Documentation Index
Fetch the complete documentation index at: https://docs.xhipai.com/llms.txt
Use this file to discover all available pages before exploring further.
Capacity Planning
RadarOS includes a built-in capacity planning library for modeling LLM inference infrastructure. It answers questions like:
- How many GPUs do I need for N concurrent users?
- What happens to latency when I add NAND SSD offloading?
- What’s the KV cache pressure for my workload mix?
- Where is the TTFT SLA breach point?
The system has three tiers:
| Tier | What | Where |
|---|
| Tier 1 | Pure-math capacity library | @radaros/core — zero dependencies |
| Tier 2 | Runtime session profiling | @radaros/core + @radaros/observability |
| Tier 3 | Interactive dashboard app | apps/capacity-planner/ (Next.js) |
Quick Start
import {
planCapacity,
DEFAULT_ARCHITECTURES,
DEFAULT_GPU_SPECS,
} from "@radaros/core";
const plan = planCapacity(
DEFAULT_ARCHITECTURES["llama-3.1-70b"],
{
gpu: DEFAULT_GPU_SPECS["h100-sxm"],
gpuCount: 4,
nandPerGpuGb: 0,
nandBandwidthGBs: 7,
},
{ extreme: 1, heavy: 2, medium: 3, light: 4 },
"fp8", // KV precision
"bf16", // weight precision
);
console.log(plan.hbmSlots); // concurrent sessions in HBM
console.log(plan.ttftBreachPoint); // max users before 5s TTFT SLA breach
console.log(plan.monthlyGpuCostUsd);
Glossary
Every term used in the capacity planning system, explained in detail.
KV Cache (Key-Value Cache)
During autoregressive generation, each transformer layer computes Key and Value projections for every token. Without caching, generating token N would require recomputing K and V for all N-1 prior tokens — quadratic cost per step. The KV cache stores these projections so each decode step only reads them — reducing cost to linear per step.
The KV cache is the single largest consumer of GPU memory during inference. For Llama 3.1 70B at 128K context, the KV cache alone is ~40 GB in bf16 — larger than many GPUs.
KV Bytes Per Token
The memory required to store one token’s KV cache entry across all layers:
KV bytes/token = 2 × layers × kv_heads × head_dim × precision_bytes
The 2× accounts for both the Key tensor and the Value tensor. Each layer has its own independent set of K and V vectors, and each KV head stores a vector of head_dim floating-point values.
Example — Llama 3.1 70B in bf16:
2 × 80 layers × 8 kv_heads × 128 head_dim × 2 bytes = 327,680 bytes (~320 KB per token)
This means a single 128K-context session consumes 128,000 × 320 KB = 40 GB of KV cache.
Attention Types
How a model organizes its attention heads directly determines KV cache size:
| Type | Full Name | Description | KV Heads | KV Size Impact |
|---|
| MHA | Multi-Head Attention | Every query head has its own dedicated KV head. Original transformer design. | = query heads | Baseline (largest) |
| GQA | Grouped Query Attention | Multiple query heads share one KV head. Groups of query heads attend to the same K/V vectors. | fewer than query heads | Reduced by group factor (typically 4-8×) |
| MQA | Multi-Query Attention | All query heads share a single KV head. Extreme compression. | 1 | Minimal (smallest possible) |
Why it matters: Llama 3.1 70B uses GQA with 64 query heads but only 8 KV heads — an 8× reduction in KV cache compared to MHA. Falcon 7B uses MQA with just 1 KV head — KV cache is only 8 KB/token vs 320 KB for Llama 70B.
Layers
The number of transformer blocks stacked sequentially in the model. Each layer has its own independent attention weights and stores its own KV cache. More layers = deeper model = more KV memory per token.
| Model | Layers | Impact |
|---|
| Llama 3.1 8B | 32 | Lightweight |
| Llama 3.1 70B | 80 | KV cache scales 2.5× vs 8B |
| Llama 3.1 405B | 126 | Nearly 4× the 8B’s KV per token |
Head Dimension
The size of each attention vector (both Q/K/V). Determined by hidden_dim / num_attention_heads. Larger head dimensions store more information per attention head but increase KV cache proportionally.
Most modern models use 128 (Llama, Mistral, Mixtral). Falcon uses 64. Gemma 2 9B uses 256.
Hidden Dimension
The width of the model’s internal representation — the size of the vector that represents each token as it flows through the network. Determines the model’s capacity to represent complex patterns. Related to head_dim via hidden_dim = attention_heads × head_dim.
FFN Dimension
The intermediate size of the feed-forward network inside each transformer layer. Typically 3-4× the hidden dimension. Affects prefill compute cost because FFN operations scale linearly with N per layer.
HBM (High Bandwidth Memory)
The GPU’s on-chip memory (often called VRAM). This is where model weights, KV cache, and activations must reside for active inference. HBM is fast (~2-3.35 TB/s on modern GPUs) but limited in capacity (24-80 GB per GPU).
The entire capacity planning problem reduces to: what fits in HBM?
total_hbm = gpu_hbm × gpu_count
free_for_kv = total_hbm - weight_memory - overhead (5 GB)
hbm_slots = floor(free_for_kv / kv_per_session)
HBM Slots
The number of concurrent sessions that can have their full KV cache resident in GPU HBM. These sessions can generate tokens at full speed with no restore penalty. When HBM is full, new sessions must either wait or be served from NAND (with restore latency).
Weight Memory
The GPU memory consumed by the model’s parameters (weights). This is a fixed cost that must be paid regardless of how many users are served.
| Precision | Llama 70B Weight Size | Notes |
|---|
| bf16 | 140 GB | Full quality, needs 2+ H100s |
| int8 | 70 GB | Fits on 1× H100 |
| int4 (AWQ/GPTQ) | 35 GB | Fits on 1× RTX A5000 with KV headroom |
NAND SSD Offloading
Using NVMe solid-state drives attached to each GPU server to store KV cache for inactive (parked) sessions. When a parked session becomes active, its KV cache is loaded from NAND back into HBM.
NAND expands the total number of sessions the system can manage but does not help active inference speed — decoding still requires KV data in HBM.
| Storage Tier | Bandwidth | Latency | Role |
|---|
| GPU HBM | 2,000–3,350 GB/s | ~100ns | Active decoding |
| NVMe Gen4 SSD | ~7 GB/s | ~100µs | Cold session parking |
| NVMe Gen5 SSD | ~14 GB/s | ~100µs | Faster cold parking |
NAND Slots
The number of sessions that can be parked on NAND SSD while inactive. Computed as total_nand_gb / kv_per_session_gb. These sessions can be restored to HBM when they become active, at the cost of restore latency.
Restore Latency
The time required to load a parked session’s KV cache from NAND SSD back into GPU HBM. This is the “wake-up cost” for a cold session.
restore_time = kv_size_gb / nand_bandwidth_gb_per_sec
Example: 5 GB KV cache on Gen4 NVMe (7 GB/s) = 714ms restore time.
When multiple sessions restore simultaneously, they share the SSD bandwidth pipe, increasing individual restore time: effective_bw = nand_bw / parallel_streams.
Cold Ratio
The percentage of total sessions that are parked on NAND at any given moment (inactive, not generating tokens). Typical values:
- 20-30% — most sessions are active (interactive chat)
- 50% — half parked (async agent workloads with tool waits)
- 70-80% — most parked (background research agents)
concurrent_active = total_sessions × (1 - cold_ratio)
TPOT (Time Per Output Token)
The latency for each decode step — generating one output token. Decoding is memory-bandwidth-bound because each step must stream the entire KV cache for all active sequences through HBM.
TPOT = (context_tokens × batch_size × kv_bytes_per_token) / aggregate_bandwidth
TPOT scales linearly with context length and batch size. A user perceives this as the streaming speed — lower TPOT = faster text output. Interactive applications target < 50ms TPOT (~20 tokens/sec streaming).
TTFT (Time To First Token)
The latency from when a user submits their prompt to when the first output token arrives. TTFT is dominated by prefill — processing the entire input prompt through every layer to build the KV cache.
Prefill is compute-bound (not memory-bound like decode) because attention scales quadratically with prompt length.
Under concurrent load, prefills are serialized on the GPU compute path. With C concurrent users, a random user waits for C/2 prefills ahead of them:
TTFT(C users) = single_prefill_time × (C + 1) / 2
Interactive applications target < 1-5 seconds TTFT.
TTFT Breach Point
The maximum number of concurrent users before average TTFT exceeds the configured SLA threshold. Computed by solving:
single_prefill × (C + 1) / 2 = ttft_sla
→ C = 2 × ttft_sla / single_prefill - 1
Adding more GPUs increases TFLOPS, which reduces single prefill time, which pushes the breach point out. Adding NAND does not move the breach point — NAND doesn’t help prefill compute.
Single Prefill Time
The time to process one prompt through all layers with no queue contention. This is the atomic unit that TTFT is built from.
prefill_flops = (4 × N² × hidden_dim + 4 × N × ffn_dim) × layers
single_prefill = prefill_flops / (gpu_tflops × gpu_count × efficiency)
Where efficiency is ~35% (real-world vs peak TFLOPS). The quadratic attention term dominates at long contexts — a 32K prompt takes ~64x longer than a 4K prompt, not 8x.
Prefix Caching / Prefix Hit Rate
When multiple requests share the same prefix (system prompt, RAG context, few-shot examples), the KV cache for that prefix can be computed once and reused. A prefix cache hit skips the expensive prefill entirely for the shared portion.
A 60% hit rate can effectively double throughput — the biggest “free” optimization in production inference.
Tensor Parallelism
Splitting a model across multiple GPUs within the same node. Each GPU holds a shard of the weights and a shard of each KV cache. GPUs communicate via NVLink during each forward pass.
- Increases total HBM (more GPUs = more memory)
- Increases aggregate bandwidth (faster TPOT)
- Increases aggregate TFLOPS (faster prefill, lower TTFT)
- Adds ~5-15% communication overhead via NVLink
Workload Mix
The distribution of session types by token intensity:
| Category | Token Range | Typical Use Case | KV Cache (70B, fp8) |
|---|
| Light | 0 – 50K | Quick Q&A, lookups | up to 7.8 GB |
| Medium | 50K – 200K | Multi-turn explanations | 7.8 – 31.2 GB |
| Heavy | 200K – 500K | Deep research, SWE tasks | 31.2 – 78.1 GB |
| Extreme | 500K+ | Full repo analysis, long agents | 78.1+ GB |
The workload mix determines the weighted average context length and drives the capacity plan. A mix of { extreme: 1, heavy: 2, medium: 3, light: 4 } (10 users) produces a weighted average context of ~197K tokens.
Session Category Thresholds
The token boundaries used by the SessionProfiler to classify live sessions:
SESSION_CATEGORY_THRESHOLDS = {
light: 50_000, // up to 50K tokens
medium: 200_000, // 50K - 200K
heavy: 500_000, // 200K - 500K
extreme: Infinity, // 500K+
};
Overhead
A fixed 5 GB budget for activations, CUDA contexts, framework metadata, and vLLM’s internal data structures (page tables, scheduling state). This is subtracted from total HBM before computing KV capacity.
Precision Options
KV cache and model weights can be quantized independently:
KV Precision
| Precision | Bytes/element | KV/token (70B) | Quality Impact |
|---|
bf16 | 2 | 320 KB | Lossless — baseline |
fp8 | 1 | 160 KB | ~0.1-0.3% perplexity increase — standard production choice |
int8 | 1 | 160 KB | Slightly more lossy than fp8 |
int4 | 0.5 | 80 KB | Noticeable degradation on long contexts |
fp8 KV is standard practice — it halves memory and bandwidth usage with negligible quality loss.
Weight Precision
| Precision | 70B Size | Min GPUs (H100) | Quality Impact |
|---|
bf16 | 140 GB | 2× H100 | Lossless |
int8 | 70 GB | 1× H100 | Minor degradation |
int4 (AWQ/GPTQ) | 35 GB | 1× RTX A5000 | Acceptable for most tasks |
The standard production setup is fp8 KV + bf16 weights for cloud GPUs, or fp8 KV + int4 weights for cost-sensitive self-hosted deployments.
Model Architectures
15 models included out of the box, with specs sourced from HuggingFace config.json:
| Model | Type | Layers | KV Heads | Head Dim | KV/token (bf16) | Max Context |
|---|
| Llama 3.1 8B | GQA | 32 | 8 | 128 | 128 KB | 128K |
| Llama 3.1 70B | GQA | 80 | 8 | 128 | 320 KB | 128K |
| Llama 3.1 405B | GQA | 126 | 8 | 128 | 504 KB | 128K |
| Llama 2 7B | MHA | 32 | 32 | 128 | 512 KB | 4K |
| Llama 2 13B | MHA | 40 | 40 | 128 | 640 KB | 4K |
| Llama 2 70B | GQA | 80 | 8 | 128 | 320 KB | 4K |
| Mixtral 8×7B | GQA | 32 | 8 | 128 | 128 KB | 32K |
| Mixtral 8×22B | GQA | 56 | 8 | 128 | 176 KB | 64K |
| Falcon 7B | MQA | 32 | 1 | 64 | 8 KB | 8K |
| Falcon 40B | GQA | 60 | 8 | 64 | 60 KB | 8K |
| Mistral 7B | GQA | 32 | 8 | 128 | 128 KB | 32K |
| Phi-3 Mini | MHA | 32 | 32 | 96 | 384 KB | 128K |
| Gemma 2 9B | GQA | 42 | 8 | 256 | 168 KB | 8K |
| Gemma 2 27B | GQA | 46 | 16 | 128 | 184 KB | 8K |
Custom architectures can be passed to any function:
const myModel: ModelArchitecture = {
id: "my-model",
displayName: "My Custom 13B",
family: "custom",
params: "13B",
layers: 40,
attentionHeads: 40,
kvHeads: 8,
headDim: 128,
hiddenDim: 5120,
ffnDim: 13824,
maxContext: 32768,
attentionType: "gqa",
weightSizeBf16Gb: 26,
};
GPU Specs
| GPU | HBM | Bandwidth | bf16 TFLOPS | NVLink | Use Case |
|---|
| H100 SXM | 80 GB | 3.35 TB/s | 989 | 900 GB/s | Premium cloud inference |
| A100 SXM | 80 GB | 2.0 TB/s | 312 | 600 GB/s | Standard cloud inference |
| L40S | 48 GB | 0.864 TB/s | 366 | None | Mid-tier / batch workloads |
| RTX A5000 | 22.5 GB | 0.768 TB/s | 65 | None | Cost-sensitive self-hosted |
| RTX 4090 | 24 GB | 1.008 TB/s | 330 | None | Dev / small-scale serving |
Key metrics explained:
- HBM — Total GPU memory. Determines how much fits (weights + KV + overhead).
- Bandwidth — How fast data streams from HBM. Determines TPOT (decode speed).
- bf16 TFLOPS — Peak compute throughput. Determines prefill speed and TTFT.
- NVLink — GPU-to-GPU interconnect bandwidth. Only matters for tensor parallelism across multiple GPUs in the same node. GPUs without NVLink communicate over PCIe (~64 GB/s), which adds latency for multi-GPU setups.
How the Math Works — Step by Step
This section walks through every calculation the capacity planner performs, with a worked example using Llama 3.1 70B on 8× RTX A5000 with int4 AWQ weights and fp8 KV cache.
Step 1: KV Bytes Per Token
What: How many bytes does one token cost in the KV cache?
Formula:
kv_bytes_per_token = 2 × layers × kv_heads × head_dim × precision_bytes
Why each term:
2 — one Key vector + one Value vector per layer
layers (80) — each of the 80 transformer blocks stores its own K and V
kv_heads (8) — GQA means only 8 KV heads (not all 64 query heads)
head_dim (128) — each head stores a 128-dimensional vector
precision_bytes (1 for fp8) — bytes per floating-point element
Calculation:
2 × 80 × 8 × 128 × 1 = 163,840 bytes = 160 KB/token
If we used bf16 instead of fp8, it would be × 2 bytes = 327,680 bytes = 320 KB/token — double.
Code: kvBytesPerToken(arch, "fp8") in kv-estimator.ts
Step 2: KV Cache Per Session
What: Total KV memory for one session at a given average context length.
Formula:
kv_per_session = avg_context_tokens × kv_bytes_per_token
Calculation (16K context):
16,384 tokens × 163,840 bytes = 2,684,354,560 bytes = 2.5 GB per session
Calculation (128K full context):
131,072 tokens × 163,840 bytes = 21,474,836,480 bytes = 20 GB per session
Code: kvCacheForContext(arch, 16384, "fp8") in kv-estimator.ts
Step 3: Weight Memory
What: GPU memory consumed by the model’s parameters.
Formula:
weight_memory = weight_size_bf16 × precision_ratio
Precision ratios: bf16 = 1.0, int8 = 0.5, int4 = 0.25
Calculation (int4 AWQ):
Without quantization (bf16), weights would be 140 GB — needing 2× H100s just for weights. With int4, they fit on a single GPU with room to spare.
Code: weightMemory(arch, "int4") in kv-estimator.ts
Step 4: Free HBM for KV Cache
What: How much GPU memory is available for KV cache after weights and overhead.
Formula:
total_hbm = gpu_hbm × gpu_count
free_hbm = total_hbm - weight_memory - overhead
Calculation (8× RTX A5000):
total_hbm = 22.5 GB × 8 = 180 GB
free_hbm = 180 - 35 - 5 = 140 GB
The 5 GB overhead covers CUDA contexts, vLLM paging metadata, activation buffers, and framework state.
Code: Lines 92-94 in capacity-planner.ts
Step 5: HBM Slots (Active Sessions)
What: How many concurrent sessions fit in free HBM.
Formula:
hbm_slots = floor(free_hbm / kv_per_session)
Calculation (16K avg context, fp8):
hbm_slots = floor(140 GB / 2.5 GB) = 56 sessions
Calculation (4K avg context, fp8):
kv_per_session = 4,096 × 163,840 = 0.625 GB
hbm_slots = floor(140 / 0.625) = 224 sessions
Notice how context length dominates: 4× shorter context = 4× more sessions.
Code: maxConcurrentSessions() in capacity-planner.ts
Step 6: NAND Slots (Parked Sessions)
What: How many additional sessions can be parked on SSD.
Formula:
total_nand = nand_per_gpu × gpu_count
nand_slots = floor(total_nand / kv_per_session)
Calculation (4 TB NAND per GPU, 16K context, fp8):
total_nand = 4,000 GB × 8 = 32,000 GB
nand_slots = floor(32,000 / 2.5) = 12,800 parked sessions
total_sessions = 56 (HBM) + 12,800 (NAND) = 12,856
NAND massively expands capacity. But those 12,800 sessions are parked — they need restore latency to become active.
Code: Lines 32-36 in capacity-planner.ts
Step 7: TPOT (Decode Latency)
What: How long each output token takes to generate.
Why bandwidth-bound: Each decode step must read the entire KV cache for all active sequences from HBM. The GPU compute is idle waiting for memory.
Formula:
total_bytes = context_tokens × batch_size × kv_bytes_per_token
bandwidth = gpu_bandwidth_TBs × gpu_count × 10^12 (convert to bytes/sec)
tpot_ms = (total_bytes / bandwidth) × 1000
Calculation (16K context, 1 user, 8× RTX A5000):
total_bytes = 16,384 × 1 × 163,840 = 2,684,354,560 bytes
bandwidth = 0.768 × 8 × 10^12 = 6,144,000,000,000 bytes/sec
tpot_ms = (2,684,354,560 / 6,144,000,000,000) × 1000 = 0.44 ms
With 10 concurrent users (batch=10):
total_bytes = 16,384 × 10 × 163,840 = 26,843,545,600 bytes
tpot_ms = (26,843,545,600 / 6,144,000,000,000) × 1000 = 4.37 ms
TPOT scales linearly with batch size. At 50ms SLA, you breach at ~114 concurrent users.
Code: estimateTpot() in latency-estimator.ts
Step 8: Single Prefill Time
What: Time to process one prompt through all layers (no queue).
Why compute-bound: Prefill runs the full attention computation (quadratic in prompt length) plus FFN (linear). The GPU compute units are saturated, not memory.
Formula:
prefill_flops = (4 × N² × hidden_dim + 4 × N × ffn_dim) × layers
gpu_flops = gpu_tflops × gpu_count × efficiency × 10^12
single_prefill_ms = (prefill_flops / gpu_flops) × 1000
The efficiency factor is 0.35 (35%) — real-world GPU utilization vs peak spec. This accounts for memory stalls, kernel launch overhead, and tensor parallelism communication.
Calculation (4K prompt, 8× RTX A5000):
N = 4,096
prefill_flops = (4 × 4096² × 8192 + 4 × 4096 × 28672) × 80
= (4 × 16,777,216 × 8192 + 4 × 4096 × 28672) × 80
= (549,755,813,888 + 469,762,048) × 80
= 550,225,575,936 × 80
= 44,018,046,074,880 flops
gpu_flops = 65 × 8 × 0.35 × 10^12 = 182,000,000,000,000 flops/sec
single_prefill = (44,018,046,074,880 / 182,000,000,000,000) × 1000 = 241.9 ms
Why 32K prompt is ~64× slower than 4K (not 8×):
The quadratic attention term dominates. When N grows 8x, the attention cost grows 64x. This is why long-context prefill is so expensive.
Code: singlePrefillMs() in latency-estimator.ts
Step 9: TTFT Under Load
What: How long a user waits for the first token when other users are also submitting prompts.
Why it degrades: Prefills are serialized on the GPU compute path. With C concurrent users, each user’s prefill waits behind the others in a queue.
Formula:
TTFT(C users) = single_prefill × (C + 1) / 2
The (C+1)/2 is the average queue position — if C users arrive simultaneously, a random user is at position 1 to C uniformly, so the average wait is (C+1)/2 prefills.
Calculation (10 concurrent users, 4K prompt):
TTFT = 241.9 ms × (10 + 1) / 2 = 241.9 × 5.5 = 1,330 ms (1.3 seconds)
Calculation (100 concurrent users):
TTFT = 241.9 × (100 + 1) / 2 = 241.9 × 50.5 = 12,216 ms (12.2 seconds)
Code: estimateTtft() in latency-estimator.ts
Step 10: TTFT Breach Point
What: Maximum concurrent users before average TTFT exceeds the SLA.
Formula (solving Step 9 for C):
single_prefill × (C + 1) / 2 = ttft_sla_ms
C = 2 × ttft_sla_ms / single_prefill - 1
Calculation (5 second SLA, 4K prompt):
C = 2 × 5000 / 241.9 - 1 = 41.3 - 1 = 40.3 → 41 users
At 41 concurrent users, the average TTFT hits 5 seconds. The 42nd user will experience over 5s wait.
Important: Adding NAND does NOT change this number. NAND parks cold sessions but doesn’t add TFLOPS — the prefill queue bottleneck is compute, not memory.
Code: ttftBreachPoint() in latency-estimator.ts
Step 11: Restore Latency
What: Time to wake up a cold session from NAND SSD.
Formula:
per_gpu_kv = kv_per_session / gpu_count (tensor parallel sharding)
restore_ms = (per_gpu_kv / nand_bandwidth) × 1000
Each GPU restores its own shard in parallel, so the KV per session is divided by GPU count.
Calculation (16K context fp8, 8× GPU, Gen4 NVMe 7 GB/s):
kv_per_session = 2.5 GB
per_gpu_kv = 2.5 / 8 = 0.3125 GB
restore_ms = (0.3125 / 7) × 1000 = 44.6 ms
With parallel restore streams (4 sessions restoring simultaneously):
effective_bw = 7 / 4 = 1.75 GB/s per stream
restore_ms = (0.3125 / 1.75) × 1000 = 178.6 ms per session
Code: restoreLatency() in latency-estimator.ts
Step 12: Monthly GPU Cost
What: Infrastructure cost estimate.
Formula:
monthly_cost = gpu_count × price_per_hour × 730 hours/month
Calculation (8× RTX A5000 on-demand at $1.10/hr):
monthly_cost = 8 × $1.10 × 730 = $6,424/month
Per-slot cost:
cost_per_slot_per_day = $6,424 / 56 slots / 30 days = $3.82/day
Code: monthlyGpuCost() in infra-cost.ts
Step 13: Weighted Average Context (Workload Mix)
What: Converts the session type distribution into a single average context length.
Formula:
avg_context = Σ(count_i × midpoint_i) / Σ(count_i)
Midpoints: light=35K, medium=130K, heavy=325K, extreme=1,250K
Calculation (extreme=1, heavy=2, medium=3, light=4):
avg = (1×1,250,000 + 2×325,000 + 3×130,000 + 4×35,000) / (1+2+3+4)
= (1,250,000 + 650,000 + 390,000 + 140,000) / 10
= 2,430,000 / 10
= 243,000 tokens
This weighted average drives all the session slot calculations in planCapacity().
Code: weightedAvgContext() in capacity-planner.ts
Full Worked Example Summary
Config: Llama 3.1 70B, 8× RTX A5000, int4 weights, fp8 KV, 16K avg context
| Step | Calculation | Result |
|---|
| KV/token | 2 × 80 × 8 × 128 × 1 | 160 KB |
| KV/session (16K) | 16,384 × 160 KB | 2.5 GB |
| Weights (int4) | 140 × 0.25 | 35 GB |
| Total HBM | 22.5 × 8 | 180 GB |
| Free for KV | 180 - 35 - 5 | 140 GB |
| HBM slots | floor(140 / 2.5) | 56 |
| NAND slots (4TB/GPU) | floor(32,000 / 2.5) | 12,800 |
| TPOT (1 user) | 2.68 GB / 6.14 TB/s | 0.44 ms |
| Single prefill (4K) | 44T flops / 182T flops/s | 241.9 ms |
| TTFT (10 users) | 241.9 × 5.5 | 1,330 ms |
| TTFT breach (5s SLA) | 2×5000/241.9 - 1 | 41 users |
| Restore (Gen4 NVMe) | 0.3125 GB / 7 GB/s | 44.6 ms |
| Monthly cost | 8 × $1.10 × 730 | $6,424 |
The CapacityPlan Object
The planCapacity() function returns a complete CapacityPlan with every metric:
| Field | Type | Description |
|---|
model | ModelArchitecture | The model being planned for |
hardware | HardwareConfig | GPU setup (type, count, NAND) |
kvPrecision | KvPrecision | KV cache precision used |
weightPrecision | WeightPrecision | Weight quantization used |
totalHbmGb | number | Total GPU memory across all GPUs |
weightMemoryGb | number | Memory consumed by model weights |
freeHbmForKvGb | number | HBM available for KV cache after weights + overhead |
kvBytesPerToken | number | Bytes per token in KV cache |
hbmSlots | number | Concurrent sessions fitting in HBM |
nandSlots | number | Sessions parkable on NAND SSD |
totalSessions | number | hbmSlots + nandSlots |
tpotMs | number | Estimated Time Per Output Token (ms) |
ttftMs | number | Estimated Time To First Token (ms) |
restoreLatencyMs | number | null | NAND → HBM restore time (null if no NAND) |
ttftBreachPoint | number | Max concurrent users before 5s TTFT SLA breach |
monthlyGpuCostUsd | number | Estimated monthly GPU infrastructure cost |
Interactive Dashboard
The apps/capacity-planner/ Next.js app provides a full interactive UI with:
- Model selector (all 15 architectures)
- GPU type, count, NAND per GPU sliders
- KV/weight precision selectors
- Workload controls (avg context, cold ratio, SLA thresholds)
- Per-GPU breakdown panel (shows free HBM + NAND per card)
- 6 interactive charts:
- Users vs GPUs — session capacity scaling with GPU count
- Users vs Context — how capacity drops as context grows
- TPOT vs Users — decode latency at different context sizes
- TTFT vs Users — prefill queue congestion with SLA breach markers
- Restore Budget — NAND restore time at Gen4/Gen5 bandwidth
- GPU vs NAND — total sessions across NAND sizes
cd apps/capacity-planner
npm install
npm run dev
# → http://localhost:3000
- KV Estimator —
kvBytesPerToken, kvCacheForContext, maxContextForMemory, weightMemory
- Capacity Planner —
planCapacity, maxConcurrentSessions, estimateGpuCount
- Latency Estimator —
estimateTpot, estimateTtft, ttftBreachPoint, restoreLatency
- Session Profiler — Runtime monitoring with EventBus integration
- Edge GPU Monitoring — Real-time GPU memory, utilization, and temperature via
nvidia-smi
- Observability Metrics — Prometheus counters for KV cache and session categories
- Examples — Live monitor, config comparison, KV sizing