Best practices
Hard-won guidance from sizing private AI deployments. The advisor will surface these contextually as you configure.
Quantization first, hardware second
Before you procure a single GPU, A/B test INT8 vs FP16 on your real workload. INT8 typically halves your model VRAM and cuts your GPU count in half with near-zero quality loss for enterprise tasks (classification, RAG, structured extraction, summarization).
INT4 (GPTQ/AWQ) is more aggressive — validate quality on a held-out eval set before committing, especially for regulated or high-stakes domains.
KV cache is the silent killer
At scale, KV cache often exceeds model weights in VRAM consumption. A 70B model at INT8 is 84 GB of weights — but 100 concurrent users running ReAct agents (6× multiplier) on 8K context can pin 600+ GB of KV cache.
Mitigation: PagedAttention (vLLM), prompt caching, capping concurrent requests, smaller context windows where acceptable.
Right-size the latency tier
Not every workload needs real-time SLAs. Batch and near-real-time tiers can run hot at 60–100% utilization with continuous batching, saving roughly half your GPU budget. Reserve real-time (sub-500ms) capacity for actual user-facing chat and voice.
HA is non-negotiable for production
N+1 doubles your GPU footprint but is the minimum for revenue workloads. A single GPU failure on a non-HA deployment takes the workload completely offline — and GPUs do fail. Plan for it.
Multi-use-case savings are real
Sizing each use case in isolation overcounts shared infrastructure. One embedding GPU serves the entire org. Model files dedupe when multiple use cases run the same base model. Ops staff is a single line item. The aggregate sizer in this tool catches these.
Compare to cloud before committing CAPEX
Above ~$2M of 3-year TCO, the on-prem case needs to clear a higher bar than just cost: data residency, predictable spend, latency, compliance. Cloud (Bedrock, Together, Azure OpenAI) wins on burstability and zero-CAPEX. On-prem wins on long-running steady workloads with sensitive data. Run both numbers.