Infrastructure Explorer

Free-form sizing on a single screen. Edit any value and watch the numbers recompute. For multi-use-case sizing, use the wizard.

Hardware target

Model & precision

Workload pattern

Concurrency & tokens

Peak concurrent users50

Avg input tokens4096

Avg output tokens1024

Requests per day

Data architecture

Documents

Embedding dim

Operations

VRAM

199.3 GB

GPUs

System RAM

448.0 GB

Storage

320.3 GB

Power

9.1 kW

3-yr TCO

$943.4K

VRAM breakdown

Customer Support RAG: KV cache exceeds model weights

KV cache is 104.9 GB vs 84.0 GB of weights. This is the #1 surprise at scale — consider PagedAttention (vLLM), reducing the agent multiplier, or capping concurrent users.

Multi-node deployment

Total 10 GPUs spans 2 server nodes. You'll need InfiniBand (or equivalent low-latency fabric) between them for tensor parallelism.

Only one use case configured

If you're sizing for the whole org, add the other workloads now. Aggregate sizing finds shared infrastructure savings (e.g., shared embedding GPU, deduped model files).

Hardware BOM

Item	Quantity	Spec	Cost
NVIDIA H100 SXM	10	80 GB · 3350 GB/s · 700 W	$300K
Server nodes (8-GPU HGX)	2	Chassis, CPU, motherboard, PSU	$40K
System RAM (DDR5 ECC)	448.0 GB	≥2× total VRAM, HNSW headroom	$2.2K
Enterprise NVMe storage	320.3 GB	Models, vector index, logs	$48
Networking fabric	1	InfiniBand HDR (multi-node)	$50K
Power distribution + UPS	2	PDU + battery backup per node	$10K
Total CAPEX			$402.3K
Annual OPEX (power + cooling + licenses + ops)			$180.4K
3-Year TCO			$943.4K