Capacity Planning
RadarOS includes a built-in capacity planning library for modeling LLM inference infrastructure. It answers questions like:- How many GPUs do I need for N concurrent users?
- What happens to latency when I add NAND SSD offloading?
- What’s the KV cache pressure for my workload mix?
- Where is the TTFT SLA breach point?
| Tier | What | Where |
|---|---|---|
| Tier 1 | Pure-math capacity library | @radaros/core — zero dependencies |
| Tier 2 | Runtime session profiling | @radaros/core + @radaros/observability |
| Tier 3 | Interactive dashboard app | apps/capacity-planner/ (Next.js) |
Quick Start
Core Concepts
KV Cache Sizing
Every token in the context window stores Key and Value tensors per layer:2 × 80 × 8 × 128 × 2 = 327,680 bytes (~320 KB/token).
GQA (Grouped Query Attention) is the key — Llama 3.1 uses 8 KV heads instead of 64 query heads, reducing KV cache by 8×.
Memory Budget
Latency Model
TPOT (Time Per Output Token) — memory-bandwidth-bound:Model Architectures
15 models are included out of the box:| Model | Type | KV Heads | KV/token (bf16) |
|---|---|---|---|
| Llama 3.1 8B | GQA | 8 | 128 KB |
| Llama 3.1 70B | GQA | 8 | 320 KB |
| Llama 3.1 405B | GQA | 8 | 504 KB |
| Llama 2 7B | MHA | 32 | 512 KB |
| Mixtral 8×7B | GQA | 8 | 128 KB |
| Mixtral 8×22B | GQA | 8 | 176 KB |
| Falcon 7B | MQA | 1 | 8 KB |
| Mistral 7B | GQA | 8 | 128 KB |
| Phi-3 Mini | MHA | 32 | 384 KB |
| Gemma 2 9B | GQA | 8 | 168 KB |
| Gemma 2 27B | GQA | 16 | 184 KB |
GPU Specs
| GPU | HBM | Bandwidth | bf16 TFLOPS |
|---|---|---|---|
| H100 SXM | 80 GB | 3.35 TB/s | 989 |
| A100 SXM | 80 GB | 2.0 TB/s | 312 |
| L40S | 48 GB | 0.864 TB/s | 366 |
| RTX 4090 | 24 GB | 1.008 TB/s | 330 |
Precision Options
| Precision | Bytes | Use Case |
|---|---|---|
bf16 | 2 | Full quality baseline |
fp8 | 1 | Standard production KV — negligible quality loss |
int8 | 1 | Alternative to fp8 |
int4 | 0.5 | Aggressive — noticeable on long contexts |
Interactive Dashboard
Theapps/capacity-planner/ Next.js app provides a full interactive UI with:
- Model selector (all 15 architectures)
- GPU type, count, NAND per GPU sliders
- KV/weight precision selectors
- Workload controls (avg context, cold ratio, SLA thresholds)
- Per-GPU breakdown panel
- 6 interactive charts (Users vs GPUs, TPOT, TTFT, Restore, Cost, GPU vs NAND)
API Reference
See the detailed API pages:- KV Estimator —
kvBytesPerToken,kvCacheForContext,maxContextForMemory,weightMemory - Capacity Planner —
planCapacity,maxConcurrentSessions,estimateGpuCount - Latency Estimator —
estimateTpot,estimateTtft,ttftBreachPoint,restoreLatency - Session Profiler — Runtime monitoring with EventBus integration