AI Infrastructure Sizing

Formulas

Every number in this app comes from one of these formulas. Claude is not in this loop — these run as pure TypeScript on every input change.

VRAM

Model weights
```
Params(B) × Bytes_per_Param × 1.2
```
Bytes: FP32=4, FP16/BF16=2, INT8=1, INT4=0.5. The 1.2 is framework overhead.
Source: NVIDIA LLM Inference Best Practices
KV cache per token (bytes)
```
2 × Layers × KV_Heads × Head_Dim × Bytes_per_Param
```
The leading 2 is for both Key and Value matrices.
Source: HuggingFace KV Cache Guide
KV cache per request
```
KV_per_Token × (Input_Tokens + Output_Tokens) / 1e9
```
GB per single in-flight request.
Source: HuggingFace KV Cache Guide
Total KV cache
```
KV_per_Request × Concurrent_Users × Agent_Multiplier
```
Agent patterns hold VRAM during tool execution — that's the multiplier.
Source: Empirical (vLLM, TGI production)
Activation memory
```
Model_Weights × 0.10
```
Inference activations are reused across layers; ~10% is sufficient.
Source: NVIDIA / Baseten
Total VRAM
```
Weights + KV_Cache + Activations + 2 GB framework
```
The 2 GB is constant CUDA context overhead.
Source: NVIDIA

GPU count

Inference GPUs (raw)
```
ceil(Total_VRAM / GPU_VRAM)
```
Tensor parallelism splits the model across GPUs.
Source: NVIDIA Tensor Parallelism guide
Latency-adjusted
```
ceil(Inference_GPUs × Latency_Provision_Factor)
```
Real-time=1.5×, Interactive=1.2×, Near-RT=1.0×, Batch=0.6×.
Source: BlueAlly empirical
Total with HA
```
(Latency_GPUs + Embedding_GPUs) × HA_Multiplier + FineTune_GPUs
```
HA: none=1×, N+1=2×, N+2=3×. Fine-tune GPUs are NOT replicated.
Source: BlueAlly empirical

Throughput

Tokens/sec/GPU (decode)
```
GPU_Memory_BW(GB/s) / (Active_Params(B) × Bytes_per_Param)
```
Decode is memory-bandwidth bound, NOT compute bound. For MoE, use ACTIVE params.
Source: Baseten LLM Inference Guide
Requests/min
```
Tokens_per_Sec_per_GPU × GPU_Count × 60 / Effective_Tokens
```
Upper bound — production with continuous batching gets ~70–90% of this.
Source: vLLM benchmarks

Storage

Vector index (raw)
```
Docs × Embedding_Dim × 4 bytes / 1e9
```
FP32 storage per dimension.
Source: pgvector / Qdrant docs
HNSW RAM
```
Vector_Index × 1.3
```
HNSW graph overhead — must be RAM-resident for performance.
Source: HNSW reference impl
Model files on disk
```
Params(B) × 2 × 1.2
```
Stored at FP16 regardless of inference precision; 1.2× covers checkpoints.
Source: BlueAlly empirical

Cost

Annual power
```
Total_GPUs × TDP × PUE × 8760 × $0.10/kWh
```
PUE 1.3 = modern data center average. $0.10/kWh = US enterprise blended rate.
Source: Uptime Institute
3-year TCO
```
Total_CAPEX + Annual_OPEX × 3
```
Doesn't model refresh cycles or salvage value — those typically wash out.
Source: BlueAlly empirical