Concurrency, tokens, latency

Concurrent users and the agent multiplier drive KV cache, which is usually the biggest VRAM consumer beyond model weights.

Customer Support RAG

Agent pattern

Retrieve relevant documents, augment the prompt with context, generate. Adds embedding lookup plus an enlarged context window.

Latency tier (SLA)

1–3s end-to-end. Dedicated GPUs with light batching.

50

Simultaneous in-flight requests at peak

4096
1024