Skip to main content

Semantic Cache

Semantic caching stores LLM responses indexed by the semantic meaning of the input. When a similar query arrives, the cached response is returned without calling the LLM — reducing costs and latency.

Quick Start

import { Agent, openai, InMemoryVectorStore, OpenAIEmbedding } from "@radaros/core";

const agent = new Agent({
  name: "assistant",
  model: openai("gpt-4o"),
  semanticCache: {
    vectorStore: new InMemoryVectorStore(new OpenAIEmbedding()),
    embedding: new OpenAIEmbedding(),
    similarityThreshold: 0.92,
    scope: "agent",
  },
});

// First call: LLM call, result cached
await agent.run("What is the capital of France?");

// Second call: returns from cache (no LLM call)
await agent.run("What's the capital of France?");

Configuration

interface SemanticCacheConfig {
  vectorStore: VectorStore;        // Any vector store backend
  embedding: EmbeddingProvider;    // Embedding model for similarity
  similarityThreshold?: number;    // 0-1, default 0.92
  ttl?: number;                    // Cache expiry in ms
  collection?: string;             // Vector collection name
  scope?: "global" | "agent" | "session";
}

Scope

ScopeBehavior
globalAll agents share one cache
agentEach agent has its own cache partition
sessionEach session has its own cache partition

How It Works

  1. Before calling the LLM, the input is embedded and searched against the vector store
  2. If a result exceeds the similarityThreshold, it’s returned as a cache hit
  3. Output guardrails still run on cached responses
  4. After an LLM call, the input + output are stored in the vector store (fire-and-forget)
  5. TTL is enforced on lookup — expired entries are evicted lazily

Events

EventPayload
cache.hit{ agentName, input, cachedId }
cache.miss{ agentName, input }

Supported Backends

Any VectorStore implementation works: InMemoryVectorStore, QdrantVectorStore, MongoDBVectorStore, PgVectorStore.

Backend Examples

InMemory (Development)

import { Agent, openai, InMemoryVectorStore, OpenAIEmbedding } from "@radaros/core";

const embedding = new OpenAIEmbedding();

const agent = new Agent({
  name: "assistant",
  model: openai("gpt-4o"),
  semanticCache: {
    vectorStore: new InMemoryVectorStore(embedding),
    embedding,
    similarityThreshold: 0.92,
  },
});
Fast, zero-config. Cache is lost when the process restarts — ideal for development and testing.

Qdrant (Production)

import { Agent, openai, QdrantVectorStore, OpenAIEmbedding } from "@radaros/core";

const embedding = new OpenAIEmbedding();

const agent = new Agent({
  name: "assistant",
  model: openai("gpt-4o"),
  semanticCache: {
    vectorStore: new QdrantVectorStore({
      url: "http://localhost:6333",
      collection: "semantic_cache",
      embedding,
    }),
    embedding,
    similarityThreshold: 0.90,
    ttl: 3600_000, // 1 hour
  },
});

PgVector (PostgreSQL)

import { Agent, openai, PgVectorStore, OpenAIEmbedding } from "@radaros/core";

const embedding = new OpenAIEmbedding();

const agent = new Agent({
  name: "assistant",
  model: openai("gpt-4o"),
  semanticCache: {
    vectorStore: new PgVectorStore({
      connectionString: "postgresql://localhost:5432/myapp",
      table: "semantic_cache",
      embedding,
    }),
    embedding,
    similarityThreshold: 0.92,
  },
});

Cache Hit vs Miss Behavior

const agent = new Agent({
  name: "assistant",
  model: openai("gpt-4o"),
  semanticCache: {
    vectorStore: new InMemoryVectorStore(new OpenAIEmbedding()),
    embedding: new OpenAIEmbedding(),
    similarityThreshold: 0.92,
    ttl: 60_000, // 1 minute
  },
});

// Listen to cache events
agent.on("cache.hit", ({ input, cachedId }) => {
  console.log(`Cache HIT for: "${input}" (id: ${cachedId})`);
});
agent.on("cache.miss", ({ input }) => {
  console.log(`Cache MISS for: "${input}"`);
});

// First call: MISS — calls LLM, stores result
await agent.run("What is the capital of France?");
// → Cache MISS for: "What is the capital of France?"

// Semantically similar: HIT — returns cached result (no LLM call)
await agent.run("What's France's capital city?");
// → Cache HIT for: "What's France's capital city?"

// Different enough: MISS
await agent.run("What is the population of France?");
// → Cache MISS for: "What is the population of France?"

// After TTL expires: MISS again
// (wait 60 seconds...)
await agent.run("What is the capital of France?");
// → Cache MISS for: "What is the capital of France?"

Tuning similarityThreshold

ThresholdBehavior
0.98+Nearly exact matches only
0.92-0.95Good default — catches rephrasings
0.85-0.90Aggressive caching — may return irrelevant results
< 0.85Not recommended — too many false matches
Start with 0.92 and adjust based on your cache hit rate and quality.

Cross-References

  • Tool Caching — Cache individual tool results (different from semantic cache)
  • Cost Tracking — Semantic cache reduces LLM costs; track savings with CostTracker