Semantic Cache
Semantic caching stores LLM responses indexed by the semantic meaning of the input. When a similar query arrives, the cached response is returned without calling the LLM — reducing costs and latency.Quick Start
Configuration
Scope
| Scope | Behavior |
|---|---|
global | All agents share one cache |
agent | Each agent has its own cache partition |
session | Each session has its own cache partition |
How It Works
- Before calling the LLM, the input is embedded and searched against the vector store
- If a result exceeds the
similarityThreshold, it’s returned as a cache hit - Output guardrails still run on cached responses
- After an LLM call, the input + output are stored in the vector store (fire-and-forget)
- TTL is enforced on lookup — expired entries are evicted lazily
Events
| Event | Payload |
|---|---|
cache.hit | { agentName, input, cachedId } |
cache.miss | { agentName, input } |
Supported Backends
AnyVectorStore implementation works: InMemoryVectorStore, QdrantVectorStore, MongoDBVectorStore, PgVectorStore.
Backend Examples
InMemory (Development)
Qdrant (Production)
PgVector (PostgreSQL)
Cache Hit vs Miss Behavior
Tuning similarityThreshold
| Threshold | Behavior |
|---|---|
0.98+ | Nearly exact matches only |
0.92-0.95 | Good default — catches rephrasings |
0.85-0.90 | Aggressive caching — may return irrelevant results |
< 0.85 | Not recommended — too many false matches |
0.92 and adjust based on your cache hit rate and quality.
Cross-References
- Tool Caching — Cache individual tool results (different from semantic cache)
- Cost Tracking — Semantic cache reduces LLM costs; track savings with CostTracker