Formulas

Every number in this app comes from one of these formulas. Claude is not in this loop — these run as pure TypeScript on every input change.

VRAM

  • Model weights
    Params(B) × Bytes_per_Param × 1.2

    Bytes: FP32=4, FP16/BF16=2, INT8=1, INT4=0.5. The 1.2 is framework overhead.

    Source: NVIDIA LLM Inference Best Practices

  • KV cache per token (bytes)
    2 × Layers × KV_Heads × Head_Dim × Bytes_per_Param

    The leading 2 is for both Key and Value matrices.

    Source: HuggingFace KV Cache Guide

  • KV cache per request
    KV_per_Token × (Input_Tokens + Output_Tokens) / 1e9

    GB per single in-flight request.

    Source: HuggingFace KV Cache Guide

  • Total KV cache
    KV_per_Request × Concurrent_Users × Agent_Multiplier

    Agent patterns hold VRAM during tool execution — that's the multiplier.

    Source: Empirical (vLLM, TGI production)

  • Activation memory
    Model_Weights × 0.10

    Inference activations are reused across layers; ~10% is sufficient.

    Source: NVIDIA / Baseten

  • Total VRAM
    Weights + KV_Cache + Activations + 2 GB framework

    The 2 GB is constant CUDA context overhead.

    Source: NVIDIA

GPU count

  • Inference GPUs (raw)
    ceil(Total_VRAM / GPU_VRAM)

    Tensor parallelism splits the model across GPUs.

    Source: NVIDIA Tensor Parallelism guide

  • Latency-adjusted
    ceil(Inference_GPUs × Latency_Provision_Factor)

    Real-time=1.5×, Interactive=1.2×, Near-RT=1.0×, Batch=0.6×.

    Source: BlueAlly empirical

  • Total with HA
    (Latency_GPUs + Embedding_GPUs) × HA_Multiplier + FineTune_GPUs

    HA: none=1×, N+1=2×, N+2=3×. Fine-tune GPUs are NOT replicated.

    Source: BlueAlly empirical

Throughput

  • Tokens/sec/GPU (decode)
    GPU_Memory_BW(GB/s) / (Active_Params(B) × Bytes_per_Param)

    Decode is memory-bandwidth bound, NOT compute bound. For MoE, use ACTIVE params.

    Source: Baseten LLM Inference Guide

  • Requests/min
    Tokens_per_Sec_per_GPU × GPU_Count × 60 / Effective_Tokens

    Upper bound — production with continuous batching gets ~70–90% of this.

    Source: vLLM benchmarks

Storage

  • Vector index (raw)
    Docs × Embedding_Dim × 4 bytes / 1e9

    FP32 storage per dimension.

    Source: pgvector / Qdrant docs

  • HNSW RAM
    Vector_Index × 1.3

    HNSW graph overhead — must be RAM-resident for performance.

    Source: HNSW reference impl

  • Model files on disk
    Params(B) × 2 × 1.2

    Stored at FP16 regardless of inference precision; 1.2× covers checkpoints.

    Source: BlueAlly empirical

Cost

  • Annual power
    Total_GPUs × TDP × PUE × 8760 × $0.10/kWh

    PUE 1.3 = modern data center average. $0.10/kWh = US enterprise blended rate.

    Source: Uptime Institute

  • 3-year TCO
    Total_CAPEX + Annual_OPEX × 3

    Doesn't model refresh cycles or salvage value — those typically wash out.

    Source: BlueAlly empirical