Latency Estimator
Physics-based latency modeling for LLM inference. Captures the two fundamental bottlenecks: memory bandwidth (decode) and compute (prefill).estimateTpot(arch, contextTokens, batchSize, hardware, kvPrecision?)
Time Per Output Token — each decode step streams the full KV cache from HBM.
estimateTtft(arch, promptTokens, concurrentUsers, hardware)
Time To First Token — compute-bound prefill with queue congestion.
TTFT(C) = singlePrefill × (C+1)/2 — with C concurrent users, the average user waits for C/2 prefills ahead of them.
singlePrefillMs(arch, promptTokens, hardware)
Base prefill time for a single prompt with no queue contention.
flops = (4 × N² × hiddenDim + 4 × N × ffnDim) × layers.