Concurrency, tokens, latency
Concurrent users and the agent multiplier drive KV cache, which is usually the biggest VRAM consumer beyond model weights.
Customer Support RAG
Agent pattern
Retrieve relevant documents, augment the prompt with context, generate. Adds embedding lookup plus an enlarged context window.
Latency tier (SLA)
1–3s end-to-end. Dedicated GPUs with light batching.
50
Simultaneous in-flight requests at peak
4096
1024