Formulas
Every number in this app comes from one of these formulas. Claude is not in this loop — these run as pure TypeScript on every input change.
VRAM
- Model weights
Params(B) × Bytes_per_Param × 1.2
Bytes: FP32=4, FP16/BF16=2, INT8=1, INT4=0.5. The 1.2 is framework overhead.
Source: NVIDIA LLM Inference Best Practices
- KV cache per token (bytes)
2 × Layers × KV_Heads × Head_Dim × Bytes_per_Param
The leading 2 is for both Key and Value matrices.
Source: HuggingFace KV Cache Guide
- KV cache per request
KV_per_Token × (Input_Tokens + Output_Tokens) / 1e9
GB per single in-flight request.
Source: HuggingFace KV Cache Guide
- Total KV cache
KV_per_Request × Concurrent_Users × Agent_Multiplier
Agent patterns hold VRAM during tool execution — that's the multiplier.
Source: Empirical (vLLM, TGI production)
- Activation memory
Model_Weights × 0.10
Inference activations are reused across layers; ~10% is sufficient.
Source: NVIDIA / Baseten
- Total VRAM
Weights + KV_Cache + Activations + 2 GB framework
The 2 GB is constant CUDA context overhead.
Source: NVIDIA
GPU count
- Inference GPUs (raw)
ceil(Total_VRAM / GPU_VRAM)
Tensor parallelism splits the model across GPUs.
Source: NVIDIA Tensor Parallelism guide
- Latency-adjusted
ceil(Inference_GPUs × Latency_Provision_Factor)
Real-time=1.5×, Interactive=1.2×, Near-RT=1.0×, Batch=0.6×.
Source: BlueAlly empirical
- Total with HA
(Latency_GPUs + Embedding_GPUs) × HA_Multiplier + FineTune_GPUs
HA: none=1×, N+1=2×, N+2=3×. Fine-tune GPUs are NOT replicated.
Source: BlueAlly empirical
Throughput
- Tokens/sec/GPU (decode)
GPU_Memory_BW(GB/s) / (Active_Params(B) × Bytes_per_Param)
Decode is memory-bandwidth bound, NOT compute bound. For MoE, use ACTIVE params.
Source: Baseten LLM Inference Guide
- Requests/min
Tokens_per_Sec_per_GPU × GPU_Count × 60 / Effective_Tokens
Upper bound — production with continuous batching gets ~70–90% of this.
Source: vLLM benchmarks
Storage
- Vector index (raw)
Docs × Embedding_Dim × 4 bytes / 1e9
FP32 storage per dimension.
Source: pgvector / Qdrant docs
- HNSW RAM
Vector_Index × 1.3
HNSW graph overhead — must be RAM-resident for performance.
Source: HNSW reference impl
- Model files on disk
Params(B) × 2 × 1.2
Stored at FP16 regardless of inference precision; 1.2× covers checkpoints.
Source: BlueAlly empirical
Cost
- Annual power
Total_GPUs × TDP × PUE × 8760 × $0.10/kWh
PUE 1.3 = modern data center average. $0.10/kWh = US enterprise blended rate.
Source: Uptime Institute
- 3-year TCO
Total_CAPEX + Annual_OPEX × 3
Doesn't model refresh cycles or salvage value — those typically wash out.
Source: BlueAlly empirical