Skip to main content

Overview

AccuracyEval uses an LLM judge to score agent responses against expected answers on a 0.0–1.0 scale.

Quick Start

import { AccuracyEval } from "@radaros/eval";
import { Agent, openai } from "@radaros/core";

const agent = new Agent({ name: "qa-bot", model: openai("gpt-4o") });

const eval = new AccuracyEval({
  name: "qa-accuracy",
  agent,
  judge: openai("gpt-4o-mini"),
  cases: [
    { name: "capital", input: "What is the capital of France?", expected: "Paris" },
    { name: "math", input: "What is 2+2?", expected: "4" },
  ],
  threshold: 0.8,
});

const result = await eval.run();
console.log(`Passed: ${result.passed}/${result.total}, Avg: ${result.averageScore}`);

Configuration

OptionTypeDefaultDescription
namestringrequiredName of the evaluation
agentAgentrequiredAgent to evaluate
judgeModelProviderrequiredModel used for scoring
casesEvalCase[]requiredTest cases with input/expected
thresholdnumber0.7Minimum score to pass
timeoutMsnumber30000Timeout per case