Latency Estimator
estimateTpot(arch, contextTokens, batchSize, hardware, kvPrecision?)
estimateTtft(arch, promptTokens, concurrentUsers, hardware)
singlePrefillMs(arch, promptTokens, hardware)
ttftBreachPoint(arch, hardware, ttftSlaMs, promptTokens?)
restoreLatency(kvSizeGb, nandBandwidthGBs, parallelism?)

Latency Estimator

Physics-based latency modeling for LLM inference. Captures the two fundamental bottlenecks: memory bandwidth (decode) and compute (prefill).

`estimateTpot(arch, contextTokens, batchSize, hardware, kvPrecision?)`

Time Per Output Token — each decode step streams the full KV cache from HBM.

import { estimateTpot, DEFAULT_ARCHITECTURES, DEFAULT_GPU_SPECS } from "@radaros/core";

const tpot = estimateTpot(
  DEFAULT_ARCHITECTURES["llama-3.1-70b"],
  4096,    // 4K context
  1,       // batch size
  { gpu: DEFAULT_GPU_SPECS["h100-sxm"], gpuCount: 4, nandPerGpuGb: 0, nandBandwidthGBs: 7 },
  "fp8",
);
// → milliseconds per output token

TPOT scales linearly with context length and batch size. fp8 halves it vs bf16.

`estimateTtft(arch, promptTokens, concurrentUsers, hardware)`

Time To First Token — compute-bound prefill with queue congestion.

import { estimateTtft } from "@radaros/core";

// Single user
estimateTtft(llama70b, 4096, 1, hw);   // fast — just one prefill

// 20 concurrent users
estimateTtft(llama70b, 4096, 20, hw);  // 10× worse — queue serialization

The model uses TTFT(C) = singlePrefill × (C+1)/2 — with C concurrent users, the average user waits for C/2 prefills ahead of them.

`singlePrefillMs(arch, promptTokens, hardware)`

Base prefill time for a single prompt with no queue contention.

import { singlePrefillMs } from "@radaros/core";

singlePrefillMs(llama70b, 4096, hw);    // ~16ms on 4× H100
singlePrefillMs(llama70b, 32768, hw);   // much longer — quadratic attention

Prefill is compute-bound: flops = (4 × N² × hiddenDim + 4 × N × ffnDim) × layers.

`ttftBreachPoint(arch, hardware, ttftSlaMs, promptTokens?)`

Maximum concurrent users before TTFT exceeds the SLA.

import { ttftBreachPoint } from "@radaros/core";

const breach = ttftBreachPoint(llama70b, hw, 5000, 4096);
// → e.g. 629 users before 5s SLA breach

// More GPUs push the breach point out
const breach8gpu = ttftBreachPoint(llama70b, { ...hw, gpuCount: 8 }, 5000);
// → higher number — more TFLOPS

Adding NAND does not change the breach point — NAND doesn’t help prefill compute.

`restoreLatency(kvSizeGb, nandBandwidthGBs, parallelism?)`

Time to restore a cold session from NAND SSD to GPU HBM.

import { restoreLatency } from "@radaros/core";

restoreLatency(40, 7);     // 40 GB at Gen4 (7 GB/s) → 5,714ms
restoreLatency(40, 14);    // 40 GB at Gen5 (14 GB/s) → 2,857ms
restoreLatency(40, 7, 4);  // 4 parallel streams → 22,857ms per stream

Parallel restore streams share the same SSD pipe — higher parallelism helps throughput but increases per-stream latency.

Capacity Planner Session Profiler

Getting Started

Agents

Memory

Skills

Handoff

Cost Tracking

Semantic Cache

Eval Framework

Compliance & Audit

Culture System

Webhooks

Capacity Planning

Observability

Voice Agents

Browser Agents

Models

Teams

Workflows

Storage

Knowledge & RAG

Toolkits

MCP (Model Context Protocol)

A2A (Agent-to-Agent)

Edge & IoT

Transport

Queue

Scheduling

Advanced Features

Latency Estimator

Latency Estimator

`estimateTpot(arch, contextTokens, batchSize, hardware, kvPrecision?)`

`estimateTtft(arch, promptTokens, concurrentUsers, hardware)`

`singlePrefillMs(arch, promptTokens, hardware)`

`ttftBreachPoint(arch, hardware, ttftSlaMs, promptTokens?)`

`restoreLatency(kvSizeGb, nandBandwidthGBs, parallelism?)`

Getting Started

Agents

Memory

Skills

Handoff

Cost Tracking

Semantic Cache

Eval Framework

Compliance & Audit

Culture System

Webhooks

Capacity Planning

Observability

Voice Agents

Browser Agents

Models

Teams

Workflows

Storage

Knowledge & RAG

Toolkits

MCP (Model Context Protocol)

A2A (Agent-to-Agent)

Edge & IoT

Transport

Queue

Scheduling

Advanced Features

Documentation Index

​Latency Estimator

​estimateTpot(arch, contextTokens, batchSize, hardware, kvPrecision?)

​estimateTtft(arch, promptTokens, concurrentUsers, hardware)

​singlePrefillMs(arch, promptTokens, hardware)

​ttftBreachPoint(arch, hardware, ttftSlaMs, promptTokens?)

​restoreLatency(kvSizeGb, nandBandwidthGBs, parallelism?)

Latency Estimator

`estimateTpot(arch, contextTokens, batchSize, hardware, kvPrecision?)`

`estimateTtft(arch, promptTokens, concurrentUsers, hardware)`

`singlePrefillMs(arch, promptTokens, hardware)`

`ttftBreachPoint(arch, hardware, ttftSlaMs, promptTokens?)`

`restoreLatency(kvSizeGb, nandBandwidthGBs, parallelism?)`