Concurrency, tokens, latency

Concurrent users and the agent multiplier drive KV cache, which is usually the biggest VRAM consumer beyond model weights.

Customer Support RAG

Agent pattern

Retrieve relevant documents, augment the prompt with context, generate. Adds embedding lookup plus an enlarged context window.

Latency tier (SLA)

1–3s end-to-end. Dedicated GPUs with light batching.

Peak concurrent users50

Simultaneous in-flight requests at peak

Requests per day

Avg input tokens4096

Avg output tokens1024