Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.xhipai.com/llms.txt

Use this file to discover all available pages before exploring further.

Capacity Planning

RadarOS includes a built-in capacity planning library for modeling LLM inference infrastructure. It answers questions like:
  • How many GPUs do I need for N concurrent users?
  • What happens to latency when I add NAND SSD offloading?
  • What’s the KV cache pressure for my workload mix?
  • Where is the TTFT SLA breach point?
The system has three tiers:
TierWhatWhere
Tier 1Pure-math capacity library@radaros/core — zero dependencies
Tier 2Runtime session profiling@radaros/core + @radaros/observability
Tier 3Interactive dashboard appapps/capacity-planner/ (Next.js)

Quick Start

import {
  planCapacity,
  DEFAULT_ARCHITECTURES,
  DEFAULT_GPU_SPECS,
} from "@radaros/core";

const plan = planCapacity(
  DEFAULT_ARCHITECTURES["llama-3.1-70b"],
  {
    gpu: DEFAULT_GPU_SPECS["h100-sxm"],
    gpuCount: 4,
    nandPerGpuGb: 0,
    nandBandwidthGBs: 7,
  },
  { extreme: 1, heavy: 2, medium: 3, light: 4 },
  "fp8",   // KV precision
  "bf16",  // weight precision
);

console.log(plan.hbmSlots);        // concurrent sessions in HBM
console.log(plan.ttftBreachPoint);  // max users before 5s TTFT SLA breach
console.log(plan.monthlyGpuCostUsd);

Glossary

Every term used in the capacity planning system, explained in detail.

KV Cache (Key-Value Cache)

During autoregressive generation, each transformer layer computes Key and Value projections for every token. Without caching, generating token N would require recomputing K and V for all N-1 prior tokens — quadratic cost per step. The KV cache stores these projections so each decode step only reads them — reducing cost to linear per step. The KV cache is the single largest consumer of GPU memory during inference. For Llama 3.1 70B at 128K context, the KV cache alone is ~40 GB in bf16 — larger than many GPUs.

KV Bytes Per Token

The memory required to store one token’s KV cache entry across all layers:
KV bytes/token = 2 × layers × kv_heads × head_dim × precision_bytes
The accounts for both the Key tensor and the Value tensor. Each layer has its own independent set of K and V vectors, and each KV head stores a vector of head_dim floating-point values. Example — Llama 3.1 70B in bf16: 2 × 80 layers × 8 kv_heads × 128 head_dim × 2 bytes = 327,680 bytes (~320 KB per token) This means a single 128K-context session consumes 128,000 × 320 KB = 40 GB of KV cache.

Attention Types

How a model organizes its attention heads directly determines KV cache size:
TypeFull NameDescriptionKV HeadsKV Size Impact
MHAMulti-Head AttentionEvery query head has its own dedicated KV head. Original transformer design.= query headsBaseline (largest)
GQAGrouped Query AttentionMultiple query heads share one KV head. Groups of query heads attend to the same K/V vectors.fewer than query headsReduced by group factor (typically 4-8×)
MQAMulti-Query AttentionAll query heads share a single KV head. Extreme compression.1Minimal (smallest possible)
Why it matters: Llama 3.1 70B uses GQA with 64 query heads but only 8 KV heads — an 8× reduction in KV cache compared to MHA. Falcon 7B uses MQA with just 1 KV head — KV cache is only 8 KB/token vs 320 KB for Llama 70B.

Layers

The number of transformer blocks stacked sequentially in the model. Each layer has its own independent attention weights and stores its own KV cache. More layers = deeper model = more KV memory per token.
ModelLayersImpact
Llama 3.1 8B32Lightweight
Llama 3.1 70B80KV cache scales 2.5× vs 8B
Llama 3.1 405B126Nearly 4× the 8B’s KV per token

Head Dimension

The size of each attention vector (both Q/K/V). Determined by hidden_dim / num_attention_heads. Larger head dimensions store more information per attention head but increase KV cache proportionally. Most modern models use 128 (Llama, Mistral, Mixtral). Falcon uses 64. Gemma 2 9B uses 256.

Hidden Dimension

The width of the model’s internal representation — the size of the vector that represents each token as it flows through the network. Determines the model’s capacity to represent complex patterns. Related to head_dim via hidden_dim = attention_heads × head_dim.

FFN Dimension

The intermediate size of the feed-forward network inside each transformer layer. Typically 3-4× the hidden dimension. Affects prefill compute cost because FFN operations scale linearly with N per layer.

HBM (High Bandwidth Memory)

The GPU’s on-chip memory (often called VRAM). This is where model weights, KV cache, and activations must reside for active inference. HBM is fast (~2-3.35 TB/s on modern GPUs) but limited in capacity (24-80 GB per GPU). The entire capacity planning problem reduces to: what fits in HBM?
total_hbm = gpu_hbm × gpu_count
free_for_kv = total_hbm - weight_memory - overhead (5 GB)
hbm_slots = floor(free_for_kv / kv_per_session)

HBM Slots

The number of concurrent sessions that can have their full KV cache resident in GPU HBM. These sessions can generate tokens at full speed with no restore penalty. When HBM is full, new sessions must either wait or be served from NAND (with restore latency).

Weight Memory

The GPU memory consumed by the model’s parameters (weights). This is a fixed cost that must be paid regardless of how many users are served.
PrecisionLlama 70B Weight SizeNotes
bf16140 GBFull quality, needs 2+ H100s
int870 GBFits on 1× H100
int4 (AWQ/GPTQ)35 GBFits on 1× RTX A5000 with KV headroom

NAND SSD Offloading

Using NVMe solid-state drives attached to each GPU server to store KV cache for inactive (parked) sessions. When a parked session becomes active, its KV cache is loaded from NAND back into HBM. NAND expands the total number of sessions the system can manage but does not help active inference speed — decoding still requires KV data in HBM.
Storage TierBandwidthLatencyRole
GPU HBM2,000–3,350 GB/s~100nsActive decoding
NVMe Gen4 SSD~7 GB/s~100µsCold session parking
NVMe Gen5 SSD~14 GB/s~100µsFaster cold parking

NAND Slots

The number of sessions that can be parked on NAND SSD while inactive. Computed as total_nand_gb / kv_per_session_gb. These sessions can be restored to HBM when they become active, at the cost of restore latency.

Restore Latency

The time required to load a parked session’s KV cache from NAND SSD back into GPU HBM. This is the “wake-up cost” for a cold session.
restore_time = kv_size_gb / nand_bandwidth_gb_per_sec
Example: 5 GB KV cache on Gen4 NVMe (7 GB/s) = 714ms restore time. When multiple sessions restore simultaneously, they share the SSD bandwidth pipe, increasing individual restore time: effective_bw = nand_bw / parallel_streams.

Cold Ratio

The percentage of total sessions that are parked on NAND at any given moment (inactive, not generating tokens). Typical values:
  • 20-30% — most sessions are active (interactive chat)
  • 50% — half parked (async agent workloads with tool waits)
  • 70-80% — most parked (background research agents)
concurrent_active = total_sessions × (1 - cold_ratio)

TPOT (Time Per Output Token)

The latency for each decode step — generating one output token. Decoding is memory-bandwidth-bound because each step must stream the entire KV cache for all active sequences through HBM.
TPOT = (context_tokens × batch_size × kv_bytes_per_token) / aggregate_bandwidth
TPOT scales linearly with context length and batch size. A user perceives this as the streaming speed — lower TPOT = faster text output. Interactive applications target < 50ms TPOT (~20 tokens/sec streaming).

TTFT (Time To First Token)

The latency from when a user submits their prompt to when the first output token arrives. TTFT is dominated by prefill — processing the entire input prompt through every layer to build the KV cache. Prefill is compute-bound (not memory-bound like decode) because attention scales quadratically with prompt length. Under concurrent load, prefills are serialized on the GPU compute path. With C concurrent users, a random user waits for C/2 prefills ahead of them:
TTFT(C users) = single_prefill_time × (C + 1) / 2
Interactive applications target < 1-5 seconds TTFT.

TTFT Breach Point

The maximum number of concurrent users before average TTFT exceeds the configured SLA threshold. Computed by solving:
single_prefill × (C + 1) / 2 = ttft_sla
→ C = 2 × ttft_sla / single_prefill - 1
Adding more GPUs increases TFLOPS, which reduces single prefill time, which pushes the breach point out. Adding NAND does not move the breach point — NAND doesn’t help prefill compute.

Single Prefill Time

The time to process one prompt through all layers with no queue contention. This is the atomic unit that TTFT is built from.
prefill_flops = (4 × N² × hidden_dim + 4 × N × ffn_dim) × layers
single_prefill = prefill_flops / (gpu_tflops × gpu_count × efficiency)
Where efficiency is ~35% (real-world vs peak TFLOPS). The quadratic attention term dominates at long contexts — a 32K prompt takes ~64x longer than a 4K prompt, not 8x.

Prefix Caching / Prefix Hit Rate

When multiple requests share the same prefix (system prompt, RAG context, few-shot examples), the KV cache for that prefix can be computed once and reused. A prefix cache hit skips the expensive prefill entirely for the shared portion. A 60% hit rate can effectively double throughput — the biggest “free” optimization in production inference.

Tensor Parallelism

Splitting a model across multiple GPUs within the same node. Each GPU holds a shard of the weights and a shard of each KV cache. GPUs communicate via NVLink during each forward pass.
  • Increases total HBM (more GPUs = more memory)
  • Increases aggregate bandwidth (faster TPOT)
  • Increases aggregate TFLOPS (faster prefill, lower TTFT)
  • Adds ~5-15% communication overhead via NVLink

Workload Mix

The distribution of session types by token intensity:
CategoryToken RangeTypical Use CaseKV Cache (70B, fp8)
Light0 – 50KQuick Q&A, lookupsup to 7.8 GB
Medium50K – 200KMulti-turn explanations7.8 – 31.2 GB
Heavy200K – 500KDeep research, SWE tasks31.2 – 78.1 GB
Extreme500K+Full repo analysis, long agents78.1+ GB
The workload mix determines the weighted average context length and drives the capacity plan. A mix of { extreme: 1, heavy: 2, medium: 3, light: 4 } (10 users) produces a weighted average context of ~197K tokens.

Session Category Thresholds

The token boundaries used by the SessionProfiler to classify live sessions:
SESSION_CATEGORY_THRESHOLDS = {
  light:   50_000,     // up to 50K tokens
  medium:  200_000,    // 50K - 200K
  heavy:   500_000,    // 200K - 500K
  extreme: Infinity,   // 500K+
};

Overhead

A fixed 5 GB budget for activations, CUDA contexts, framework metadata, and vLLM’s internal data structures (page tables, scheduling state). This is subtracted from total HBM before computing KV capacity.

Precision Options

KV cache and model weights can be quantized independently:

KV Precision

PrecisionBytes/elementKV/token (70B)Quality Impact
bf162320 KBLossless — baseline
fp81160 KB~0.1-0.3% perplexity increase — standard production choice
int81160 KBSlightly more lossy than fp8
int40.580 KBNoticeable degradation on long contexts
fp8 KV is standard practice — it halves memory and bandwidth usage with negligible quality loss.

Weight Precision

Precision70B SizeMin GPUs (H100)Quality Impact
bf16140 GB2× H100Lossless
int870 GB1× H100Minor degradation
int4 (AWQ/GPTQ)35 GB1× RTX A5000Acceptable for most tasks
The standard production setup is fp8 KV + bf16 weights for cloud GPUs, or fp8 KV + int4 weights for cost-sensitive self-hosted deployments.

Model Architectures

15 models included out of the box, with specs sourced from HuggingFace config.json:
ModelTypeLayersKV HeadsHead DimKV/token (bf16)Max Context
Llama 3.1 8BGQA328128128 KB128K
Llama 3.1 70BGQA808128320 KB128K
Llama 3.1 405BGQA1268128504 KB128K
Llama 2 7BMHA3232128512 KB4K
Llama 2 13BMHA4040128640 KB4K
Llama 2 70BGQA808128320 KB4K
Mixtral 8×7BGQA328128128 KB32K
Mixtral 8×22BGQA568128176 KB64K
Falcon 7BMQA321648 KB8K
Falcon 40BGQA6086460 KB8K
Mistral 7BGQA328128128 KB32K
Phi-3 MiniMHA323296384 KB128K
Gemma 2 9BGQA428256168 KB8K
Gemma 2 27BGQA4616128184 KB8K
Custom architectures can be passed to any function:
const myModel: ModelArchitecture = {
  id: "my-model",
  displayName: "My Custom 13B",
  family: "custom",
  params: "13B",
  layers: 40,
  attentionHeads: 40,
  kvHeads: 8,
  headDim: 128,
  hiddenDim: 5120,
  ffnDim: 13824,
  maxContext: 32768,
  attentionType: "gqa",
  weightSizeBf16Gb: 26,
};

GPU Specs

GPUHBMBandwidthbf16 TFLOPSNVLinkUse Case
H100 SXM80 GB3.35 TB/s989900 GB/sPremium cloud inference
A100 SXM80 GB2.0 TB/s312600 GB/sStandard cloud inference
L40S48 GB0.864 TB/s366NoneMid-tier / batch workloads
RTX A500022.5 GB0.768 TB/s65NoneCost-sensitive self-hosted
RTX 409024 GB1.008 TB/s330NoneDev / small-scale serving
Key metrics explained:
  • HBM — Total GPU memory. Determines how much fits (weights + KV + overhead).
  • Bandwidth — How fast data streams from HBM. Determines TPOT (decode speed).
  • bf16 TFLOPS — Peak compute throughput. Determines prefill speed and TTFT.
  • NVLink — GPU-to-GPU interconnect bandwidth. Only matters for tensor parallelism across multiple GPUs in the same node. GPUs without NVLink communicate over PCIe (~64 GB/s), which adds latency for multi-GPU setups.

How the Math Works — Step by Step

This section walks through every calculation the capacity planner performs, with a worked example using Llama 3.1 70B on 8× RTX A5000 with int4 AWQ weights and fp8 KV cache.

Step 1: KV Bytes Per Token

What: How many bytes does one token cost in the KV cache? Formula:
kv_bytes_per_token = 2 × layers × kv_heads × head_dim × precision_bytes
Why each term:
  • 2 — one Key vector + one Value vector per layer
  • layers (80) — each of the 80 transformer blocks stores its own K and V
  • kv_heads (8) — GQA means only 8 KV heads (not all 64 query heads)
  • head_dim (128) — each head stores a 128-dimensional vector
  • precision_bytes (1 for fp8) — bytes per floating-point element
Calculation:
2 × 80 × 8 × 128 × 1 = 163,840 bytes = 160 KB/token
If we used bf16 instead of fp8, it would be × 2 bytes = 327,680 bytes = 320 KB/token — double. Code: kvBytesPerToken(arch, "fp8") in kv-estimator.ts

Step 2: KV Cache Per Session

What: Total KV memory for one session at a given average context length. Formula:
kv_per_session = avg_context_tokens × kv_bytes_per_token
Calculation (16K context):
16,384 tokens × 163,840 bytes = 2,684,354,560 bytes = 2.5 GB per session
Calculation (128K full context):
131,072 tokens × 163,840 bytes = 21,474,836,480 bytes = 20 GB per session
Code: kvCacheForContext(arch, 16384, "fp8") in kv-estimator.ts

Step 3: Weight Memory

What: GPU memory consumed by the model’s parameters. Formula:
weight_memory = weight_size_bf16 × precision_ratio
Precision ratios: bf16 = 1.0, int8 = 0.5, int4 = 0.25 Calculation (int4 AWQ):
140 GB × 0.25 = 35 GB
Without quantization (bf16), weights would be 140 GB — needing 2× H100s just for weights. With int4, they fit on a single GPU with room to spare. Code: weightMemory(arch, "int4") in kv-estimator.ts

Step 4: Free HBM for KV Cache

What: How much GPU memory is available for KV cache after weights and overhead. Formula:
total_hbm = gpu_hbm × gpu_count
free_hbm = total_hbm - weight_memory - overhead
Calculation (8× RTX A5000):
total_hbm = 22.5 GB × 8 = 180 GB
free_hbm  = 180 - 35 - 5 = 140 GB
The 5 GB overhead covers CUDA contexts, vLLM paging metadata, activation buffers, and framework state. Code: Lines 92-94 in capacity-planner.ts

Step 5: HBM Slots (Active Sessions)

What: How many concurrent sessions fit in free HBM. Formula:
hbm_slots = floor(free_hbm / kv_per_session)
Calculation (16K avg context, fp8):
hbm_slots = floor(140 GB / 2.5 GB) = 56 sessions
Calculation (4K avg context, fp8):
kv_per_session = 4,096 × 163,840 = 0.625 GB
hbm_slots = floor(140 / 0.625) = 224 sessions
Notice how context length dominates: 4× shorter context = 4× more sessions. Code: maxConcurrentSessions() in capacity-planner.ts

Step 6: NAND Slots (Parked Sessions)

What: How many additional sessions can be parked on SSD. Formula:
total_nand = nand_per_gpu × gpu_count
nand_slots = floor(total_nand / kv_per_session)
Calculation (4 TB NAND per GPU, 16K context, fp8):
total_nand = 4,000 GB × 8 = 32,000 GB
nand_slots = floor(32,000 / 2.5) = 12,800 parked sessions
total_sessions = 56 (HBM) + 12,800 (NAND) = 12,856
NAND massively expands capacity. But those 12,800 sessions are parked — they need restore latency to become active. Code: Lines 32-36 in capacity-planner.ts

Step 7: TPOT (Decode Latency)

What: How long each output token takes to generate. Why bandwidth-bound: Each decode step must read the entire KV cache for all active sequences from HBM. The GPU compute is idle waiting for memory. Formula:
total_bytes = context_tokens × batch_size × kv_bytes_per_token
bandwidth = gpu_bandwidth_TBs × gpu_count × 10^12  (convert to bytes/sec)
tpot_ms = (total_bytes / bandwidth) × 1000
Calculation (16K context, 1 user, 8× RTX A5000):
total_bytes = 16,384 × 1 × 163,840 = 2,684,354,560 bytes
bandwidth   = 0.768 × 8 × 10^12 = 6,144,000,000,000 bytes/sec
tpot_ms     = (2,684,354,560 / 6,144,000,000,000) × 1000 = 0.44 ms
With 10 concurrent users (batch=10):
total_bytes = 16,384 × 10 × 163,840 = 26,843,545,600 bytes
tpot_ms     = (26,843,545,600 / 6,144,000,000,000) × 1000 = 4.37 ms
TPOT scales linearly with batch size. At 50ms SLA, you breach at ~114 concurrent users. Code: estimateTpot() in latency-estimator.ts

Step 8: Single Prefill Time

What: Time to process one prompt through all layers (no queue). Why compute-bound: Prefill runs the full attention computation (quadratic in prompt length) plus FFN (linear). The GPU compute units are saturated, not memory. Formula:
prefill_flops = (4 × N² × hidden_dim + 4 × N × ffn_dim) × layers
gpu_flops     = gpu_tflops × gpu_count × efficiency × 10^12
single_prefill_ms = (prefill_flops / gpu_flops) × 1000
The efficiency factor is 0.35 (35%) — real-world GPU utilization vs peak spec. This accounts for memory stalls, kernel launch overhead, and tensor parallelism communication. Calculation (4K prompt, 8× RTX A5000):
N = 4,096
prefill_flops = (4 × 4096² × 8192 + 4 × 4096 × 28672) × 80
              = (4 × 16,777,216 × 8192 + 4 × 4096 × 28672) × 80
              = (549,755,813,888 + 469,762,048) × 80
              = 550,225,575,936 × 80
              = 44,018,046,074,880 flops

gpu_flops     = 65 × 8 × 0.35 × 10^12 = 182,000,000,000,000 flops/sec

single_prefill = (44,018,046,074,880 / 182,000,000,000,000) × 1000 = 241.9 ms
Why 32K prompt is ~64× slower than 4K (not 8×): The quadratic attention term dominates. When N grows 8x, the attention cost grows 64x. This is why long-context prefill is so expensive. Code: singlePrefillMs() in latency-estimator.ts

Step 9: TTFT Under Load

What: How long a user waits for the first token when other users are also submitting prompts. Why it degrades: Prefills are serialized on the GPU compute path. With C concurrent users, each user’s prefill waits behind the others in a queue. Formula:
TTFT(C users) = single_prefill × (C + 1) / 2
The (C+1)/2 is the average queue position — if C users arrive simultaneously, a random user is at position 1 to C uniformly, so the average wait is (C+1)/2 prefills. Calculation (10 concurrent users, 4K prompt):
TTFT = 241.9 ms × (10 + 1) / 2 = 241.9 × 5.5 = 1,330 ms (1.3 seconds)
Calculation (100 concurrent users):
TTFT = 241.9 × (100 + 1) / 2 = 241.9 × 50.5 = 12,216 ms (12.2 seconds)
Code: estimateTtft() in latency-estimator.ts

Step 10: TTFT Breach Point

What: Maximum concurrent users before average TTFT exceeds the SLA. Formula (solving Step 9 for C):
single_prefill × (C + 1) / 2 = ttft_sla_ms
C = 2 × ttft_sla_ms / single_prefill - 1
Calculation (5 second SLA, 4K prompt):
C = 2 × 5000 / 241.9 - 1 = 41.3 - 1 = 40.3 → 41 users
At 41 concurrent users, the average TTFT hits 5 seconds. The 42nd user will experience over 5s wait. Important: Adding NAND does NOT change this number. NAND parks cold sessions but doesn’t add TFLOPS — the prefill queue bottleneck is compute, not memory. Code: ttftBreachPoint() in latency-estimator.ts

Step 11: Restore Latency

What: Time to wake up a cold session from NAND SSD. Formula:
per_gpu_kv = kv_per_session / gpu_count   (tensor parallel sharding)
restore_ms = (per_gpu_kv / nand_bandwidth) × 1000
Each GPU restores its own shard in parallel, so the KV per session is divided by GPU count. Calculation (16K context fp8, 8× GPU, Gen4 NVMe 7 GB/s):
kv_per_session = 2.5 GB
per_gpu_kv     = 2.5 / 8 = 0.3125 GB
restore_ms     = (0.3125 / 7) × 1000 = 44.6 ms
With parallel restore streams (4 sessions restoring simultaneously):
effective_bw = 7 / 4 = 1.75 GB/s per stream
restore_ms   = (0.3125 / 1.75) × 1000 = 178.6 ms per session
Code: restoreLatency() in latency-estimator.ts

Step 12: Monthly GPU Cost

What: Infrastructure cost estimate. Formula:
monthly_cost = gpu_count × price_per_hour × 730 hours/month
Calculation (8× RTX A5000 on-demand at $1.10/hr):
monthly_cost = 8 × $1.10 × 730 = $6,424/month
Per-slot cost:
cost_per_slot_per_day = $6,424 / 56 slots / 30 days = $3.82/day
Code: monthlyGpuCost() in infra-cost.ts

Step 13: Weighted Average Context (Workload Mix)

What: Converts the session type distribution into a single average context length. Formula:
avg_context = Σ(count_i × midpoint_i) / Σ(count_i)
Midpoints: light=35K, medium=130K, heavy=325K, extreme=1,250K Calculation (extreme=1, heavy=2, medium=3, light=4):
avg = (1×1,250,000 + 2×325,000 + 3×130,000 + 4×35,000) / (1+2+3+4)
    = (1,250,000 + 650,000 + 390,000 + 140,000) / 10
    = 2,430,000 / 10
    = 243,000 tokens
This weighted average drives all the session slot calculations in planCapacity(). Code: weightedAvgContext() in capacity-planner.ts

Full Worked Example Summary

Config: Llama 3.1 70B, 8× RTX A5000, int4 weights, fp8 KV, 16K avg context
StepCalculationResult
KV/token2 × 80 × 8 × 128 × 1160 KB
KV/session (16K)16,384 × 160 KB2.5 GB
Weights (int4)140 × 0.2535 GB
Total HBM22.5 × 8180 GB
Free for KV180 - 35 - 5140 GB
HBM slotsfloor(140 / 2.5)56
NAND slots (4TB/GPU)floor(32,000 / 2.5)12,800
TPOT (1 user)2.68 GB / 6.14 TB/s0.44 ms
Single prefill (4K)44T flops / 182T flops/s241.9 ms
TTFT (10 users)241.9 × 5.51,330 ms
TTFT breach (5s SLA)2×5000/241.9 - 141 users
Restore (Gen4 NVMe)0.3125 GB / 7 GB/s44.6 ms
Monthly cost8 × $1.10 × 730$6,424

The CapacityPlan Object

The planCapacity() function returns a complete CapacityPlan with every metric:
FieldTypeDescription
modelModelArchitectureThe model being planned for
hardwareHardwareConfigGPU setup (type, count, NAND)
kvPrecisionKvPrecisionKV cache precision used
weightPrecisionWeightPrecisionWeight quantization used
totalHbmGbnumberTotal GPU memory across all GPUs
weightMemoryGbnumberMemory consumed by model weights
freeHbmForKvGbnumberHBM available for KV cache after weights + overhead
kvBytesPerTokennumberBytes per token in KV cache
hbmSlotsnumberConcurrent sessions fitting in HBM
nandSlotsnumberSessions parkable on NAND SSD
totalSessionsnumberhbmSlots + nandSlots
tpotMsnumberEstimated Time Per Output Token (ms)
ttftMsnumberEstimated Time To First Token (ms)
restoreLatencyMsnumber | nullNAND → HBM restore time (null if no NAND)
ttftBreachPointnumberMax concurrent users before 5s TTFT SLA breach
monthlyGpuCostUsdnumberEstimated monthly GPU infrastructure cost

Interactive Dashboard

The apps/capacity-planner/ Next.js app provides a full interactive UI with:
  • Model selector (all 15 architectures)
  • GPU type, count, NAND per GPU sliders
  • KV/weight precision selectors
  • Workload controls (avg context, cold ratio, SLA thresholds)
  • Per-GPU breakdown panel (shows free HBM + NAND per card)
  • 6 interactive charts:
    • Users vs GPUs — session capacity scaling with GPU count
    • Users vs Context — how capacity drops as context grows
    • TPOT vs Users — decode latency at different context sizes
    • TTFT vs Users — prefill queue congestion with SLA breach markers
    • Restore Budget — NAND restore time at Gen4/Gen5 bandwidth
    • GPU vs NAND — total sessions across NAND sizes
cd apps/capacity-planner
npm install
npm run dev
# → http://localhost:3000