Skip to main content

Capacity Planning

RadarOS includes a built-in capacity planning library for modeling LLM inference infrastructure. It answers questions like:
  • How many GPUs do I need for N concurrent users?
  • What happens to latency when I add NAND SSD offloading?
  • What’s the KV cache pressure for my workload mix?
  • Where is the TTFT SLA breach point?
The system has three tiers:
TierWhatWhere
Tier 1Pure-math capacity library@radaros/core — zero dependencies
Tier 2Runtime session profiling@radaros/core + @radaros/observability
Tier 3Interactive dashboard appapps/capacity-planner/ (Next.js)

Quick Start

import {
  planCapacity,
  DEFAULT_ARCHITECTURES,
  DEFAULT_GPU_SPECS,
} from "@radaros/core";

const plan = planCapacity(
  DEFAULT_ARCHITECTURES["llama-3.1-70b"],
  {
    gpu: DEFAULT_GPU_SPECS["h100-sxm"],
    gpuCount: 4,
    nandPerGpuGb: 0,
    nandBandwidthGBs: 7,
  },
  { extreme: 1, heavy: 2, medium: 3, light: 4 },
  "fp8",   // KV precision
  "bf16",  // weight precision
);

console.log(plan.hbmSlots);        // concurrent sessions in HBM
console.log(plan.ttftBreachPoint);  // max users before 5s TTFT SLA breach
console.log(plan.monthlyGpuCostUsd);

Core Concepts

KV Cache Sizing

Every token in the context window stores Key and Value tensors per layer:
KV bytes/token = 2 × layers × kv_heads × head_dim × precision_bytes
For Llama 3.1 70B in bf16: 2 × 80 × 8 × 128 × 2 = 327,680 bytes (~320 KB/token). GQA (Grouped Query Attention) is the key — Llama 3.1 uses 8 KV heads instead of 64 query heads, reducing KV cache by 8×.

Memory Budget

free_hbm = (gpu_hbm × gpu_count) - weight_memory - overhead
hbm_slots = floor(free_hbm / kv_per_session)
Weight quantization (int4 AWQ) shrinks weights from 140 GB to 35 GB, freeing massive KV headroom.

Latency Model

TPOT (Time Per Output Token) — memory-bandwidth-bound:
TPOT = (context × batch × kv_bytes_per_token) / aggregate_bandwidth
TTFT (Time To First Token) — compute-bound, queued:
TTFT(C users) = single_prefill_time × (C + 1) / 2
Adding NAND expands session capacity but does not improve prefill speed.

Model Architectures

15 models are included out of the box:
ModelTypeKV HeadsKV/token (bf16)
Llama 3.1 8BGQA8128 KB
Llama 3.1 70BGQA8320 KB
Llama 3.1 405BGQA8504 KB
Llama 2 7BMHA32512 KB
Mixtral 8×7BGQA8128 KB
Mixtral 8×22BGQA8176 KB
Falcon 7BMQA18 KB
Mistral 7BGQA8128 KB
Phi-3 MiniMHA32384 KB
Gemma 2 9BGQA8168 KB
Gemma 2 27BGQA16184 KB
Custom architectures can be passed to any function:
const myModel: ModelArchitecture = {
  id: "my-model",
  displayName: "My Custom 13B",
  family: "custom",
  params: "13B",
  layers: 40,
  attentionHeads: 40,
  kvHeads: 8,
  headDim: 128,
  hiddenDim: 5120,
  ffnDim: 13824,
  maxContext: 32768,
  attentionType: "gqa",
  weightSizeBf16Gb: 26,
};

GPU Specs

GPUHBMBandwidthbf16 TFLOPS
H100 SXM80 GB3.35 TB/s989
A100 SXM80 GB2.0 TB/s312
L40S48 GB0.864 TB/s366
RTX 409024 GB1.008 TB/s330

Precision Options

PrecisionBytesUse Case
bf162Full quality baseline
fp81Standard production KV — negligible quality loss
int81Alternative to fp8
int40.5Aggressive — noticeable on long contexts
KV and weight precision are independent. The standard production setup is fp8 KV + bf16 weights (or int4 weights for AWQ/GPTQ models).

Interactive Dashboard

The apps/capacity-planner/ Next.js app provides a full interactive UI with:
  • Model selector (all 15 architectures)
  • GPU type, count, NAND per GPU sliders
  • KV/weight precision selectors
  • Workload controls (avg context, cold ratio, SLA thresholds)
  • Per-GPU breakdown panel
  • 6 interactive charts (Users vs GPUs, TPOT, TTFT, Restore, Cost, GPU vs NAND)
cd apps/capacity-planner
npm install
npm run dev
# → http://localhost:3000

API Reference

See the detailed API pages:
  • KV EstimatorkvBytesPerToken, kvCacheForContext, maxContextForMemory, weightMemory
  • Capacity PlannerplanCapacity, maxConcurrentSessions, estimateGpuCount
  • Latency EstimatorestimateTpot, estimateTtft, ttftBreachPoint, restoreLatency
  • Session Profiler — Runtime monitoring with EventBus integration