Skip to main content

Eval Framework

The @radaros/eval package provides automated quality testing for agents. Define test cases, run them against your agent, and score the outputs using built-in or custom scorers.
npm install @radaros/eval

Quick Start

import { Agent, openai } from "@radaros/core";
import { EvalSuite, contains, regexMatch, ConsoleReporter } from "@radaros/eval";

const agent = new Agent({
  name: "assistant",
  model: openai("gpt-4o-mini"),
  instructions: "Be factual and concise.",
});

const suite = new EvalSuite({
  name: "Quality Suite",
  agent,
  cases: [
    { name: "Capital", input: "What is the capital of France?", expected: "Paris" },
    { name: "Math", input: "What is 15 * 7?", expected: "105" },
  ],
  scorers: [
    contains("Paris"),
    regexMatch(/\d+/),
  ],
  threshold: 0.7,
  concurrency: 2,
});

const result = await suite.run([new ConsoleReporter()]);
console.log(`${result.passed}/${result.total} passed`);

Built-in Scorers

ScorerDescription
contains(text)Output contains the expected string
regexMatch(pattern)Output matches a regex pattern
semanticSimilarity({ expected, embedding })Cosine similarity above threshold
llmJudge({ model, criteria })LLM rates output on criteria (relevance, helpfulness, etc.)
jsonMatch(fields)Structured output fields match expected values
custom(name, fn)User-defined scoring function

LLM-as-Judge

Use another model to evaluate the output:
import { llmJudge } from "@radaros/eval";

llmJudge({
  model: openai("gpt-4o"),
  criteria: ["faithfulness", "relevance", "helpfulness"],
});
Each criterion is scored 0-1. The overall score is the average. Customize the judge prompt for domain-specific evaluation:
llmJudge({
  model: openai("gpt-4o"),
  criteria: ["faithfulness", "relevance", "helpfulness"],
  systemPrompt: `You are evaluating a customer support agent.
Score each criterion 0-1 based on whether the response:
- faithfulness: only states verifiable facts
- relevance: addresses the user's specific question
- helpfulness: provides actionable next steps`,
});

Reporters

ReporterOutput
ConsoleReporterPretty-printed table in the terminal
JsonReporterJSON file with detailed results

JSON Reporter Output

const reporter = new JsonReporter({ outputPath: "./eval-results.json" });
const result = await suite.run([reporter]);

// eval-results.json:
// {
//   "suite": "Quality Suite",
//   "timestamp": "2026-02-26T10:30:00Z",
//   "passed": 8,
//   "failed": 2,
//   "total": 10,
//   "passRate": 0.8,
//   "cases": [
//     {
//       "name": "Capital",
//       "input": "What is the capital of France?",
//       "output": "The capital of France is Paris.",
//       "expected": "Paris",
//       "scores": { "contains": 1.0, "llm-judge": 0.95 },
//       "pass": true,
//       "duration": 1240
//     },
//     ...
//   ]
// }

Custom Scorer

import { custom } from "@radaros/eval";

custom("word-count", async (input, output) => {
  const count = output.text.split(/\s+/).length;
  const pass = count >= 10 && count <= 200;
  return { score: pass ? 1 : 0, pass, reason: `${count} words` };
});

Full Eval Suite Example

A comprehensive eval setup testing a support agent across multiple dimensions:
import { Agent, openai } from "@radaros/core";
import {
  EvalSuite, contains, regexMatch, llmJudge, custom,
  semanticSimilarity, ConsoleReporter, JsonReporter,
} from "@radaros/eval";

const agent = new Agent({
  name: "support-bot",
  model: openai("gpt-4o-mini"),
  instructions: "You are a helpful customer support agent for an e-commerce platform.",
});

const suite = new EvalSuite({
  name: "Support Agent Quality",
  agent,
  cases: [
    {
      name: "Return policy",
      input: "What is your return policy?",
      expected: "30-day return policy",
    },
    {
      name: "Order tracking",
      input: "Where is my order #12345?",
      expected: "tracking information",
    },
    {
      name: "Refund timeline",
      input: "How long do refunds take?",
      expected: "5-10 business days",
    },
    {
      name: "Greeting",
      input: "Hi there",
      expected: "polite greeting",
    },
    {
      name: "Out of scope",
      input: "What's the weather like?",
      expected: "redirect to relevant topic",
    },
  ],
  scorers: [
    contains(),
    llmJudge({
      model: openai("gpt-4o"),
      criteria: ["relevance", "helpfulness", "professionalism"],
    }),
    custom("not-too-long", async (input, output) => {
      const words = output.text.split(/\s+/).length;
      const pass = words <= 150;
      return { score: pass ? 1 : 0, pass, reason: `${words} words (max 150)` };
    }),
  ],
  threshold: 0.7,
  concurrency: 3,
});

const result = await suite.run([
  new ConsoleReporter(),
  new JsonReporter({ outputPath: "./eval-results.json" }),
]);

console.log(`\nResults: ${result.passed}/${result.total} passed (${(result.passRate * 100).toFixed(0)}%)`);
if (!result.allPassed) {
  console.log("Failed cases:", result.failures.map(f => f.name).join(", "));
  process.exit(1);
}

Running Evals in CI

Add eval runs to your CI pipeline:
// eval.ts
const result = await suite.run([new JsonReporter({ outputPath: "./eval-results.json" })]);
process.exit(result.allPassed ? 0 : 1);
# In your CI config (GitHub Actions, etc.)
npx tsx eval.ts