Skip to main content

Multi-Modal Input

RadarOS agents accept not only text but also images, audio, and files. Use the MessageContent type and ContentPart[] to send multi-modal input to vision and audio-capable models.

MessageContent Type

Input to agent.run() or agent.stream() can be:
type MessageContent = string | ContentPart[];
  • string — Plain text (most common)
  • ContentPart[] — Array of text, image, audio, or file parts

ContentPart Types

TextPart

{ type: "text", text: string }

ImagePart

{ type: "image", data: string, mimeType? }

AudioPart

{ type: "audio", data: string, mimeType? }

FilePart

{ type: "file", data: string, mimeType, filename? }

Image Input

Images can be provided as base64 or URL:
import { Agent, openai, type ContentPart } from "@radaros/core";

const agent = new Agent({
  name: "VisionAgent",
  model: openai("gpt-4o"),
  instructions: "Describe and analyze images in detail.",
});

// Image via URL
const input: ContentPart[] = [
  { type: "text", text: "What's in this image?" },
  {
    type: "image",
    data: "https://example.com/image.png",
    mimeType: "image/png",
  },
];

// Image via base64
const base64Image = "data:image/png;base64,iVBORw0KGgo...";
const inputBase64: ContentPart[] = [
  { type: "text", text: "Analyze this." },
  { type: "image", data: base64Image, mimeType: "image/png" },
];

const result = await agent.run(input);
Supported mimeType values: image/png, image/jpeg, image/gif, image/webp.

Audio Input

Audio is provided as base64-encoded data:
import { Agent, google, type ContentPart } from "@radaros/core";
import { readFileSync } from "node:fs";

const agent = new Agent({
  name: "AudioAnalyzer",
  model: google("gemini-2.5-flash"),
  instructions: "Transcribe and analyze audio content.",
  structuredOutput: AudioAnalysisSchema,
});

const audioData = readFileSync("sample.mp3");
const base64Audio = audioData.toString("base64");

const result = await agent.run([
  { type: "text", text: "Transcribe and summarize this audio." },
  { type: "audio", data: base64Audio, mimeType: "audio/mp3" },
] as ContentPart[]);
Supported mimeType values: audio/mp3, audio/wav, audio/ogg, audio/webm.

File Input

Generic files (PDFs, documents, etc.) use FilePart:
const input: ContentPart[] = [
  { type: "text", text: "Summarize this document." },
  {
    type: "file",
    data: "https://example.com/doc.pdf",
    mimeType: "application/pdf",
    filename: "document.pdf",
  },
];
data can be a URL or base64-encoded content.

Example: Vision Agent Analyzing an Image

import { Agent, openai, type ContentPart } from "@radaros/core";
import { z } from "zod";

const ImageAnalysis = z.object({
  description: z.string().describe("Detailed description of the image"),
  objects: z.array(z.string()).describe("Objects detected"),
  dominantColors: z.array(z.string()).describe("Dominant colors"),
  mood: z.string().describe("Overall mood"),
});

const analyzer = new Agent({
  name: "ImageAnalyzer",
  model: openai("gpt-4o"),
  instructions: "Analyze images and return structured JSON.",
  structuredOutput: ImageAnalysis,
});

const multiModalInput: ContentPart[] = [
  { type: "text", text: "Analyze this image in detail." },
  {
    type: "image",
    data: "https://upload.wikimedia.org/wikipedia/commons/4/47/PNG_transparency_demonstration_1.png",
    mimeType: "image/png",
  },
];

const result = await analyzer.run(multiModalInput);
console.log(result.structured);

Example: Audio Analysis with Gemini

import { Agent, google, type ContentPart } from "@radaros/core";
import { readFileSync } from "node:fs";
import { z } from "zod";

const AudioAnalysis = z.object({
  transcription: z.string(),
  language: z.string(),
  speakerCount: z.number(),
  summary: z.string(),
  mood: z.string(),
  topics: z.array(z.string()),
});

const agent = new Agent({
  name: "AudioAnalyzer",
  model: google("gemini-2.5-flash"),
  instructions: "Analyze audio: transcribe, detect language, summarize.",
  structuredOutput: AudioAnalysis,
});

const audioData = readFileSync("audio/sample.mp3");
const base64Audio = audioData.toString("base64");

const result = await agent.run([
  { type: "text", text: "Analyze this audio clip in detail." },
  { type: "audio", data: base64Audio, mimeType: "audio/mp3" },
] as ContentPart[]);

console.log(result.structured);

Provider Support Matrix

Not all providers support all content types. When an unsupported type is passed, the provider logs a warning and either skips the content or substitutes a placeholder.
Content TypeOpenAIAnthropicGoogle/VertexAWS ClaudeAWS BedrockAzure OpenAIAzure FoundryOllama
Image (URL)YesYesYesYesNoYesModel-dependentNo
Image (base64)YesYesYesYesYes*YesModel-dependentYes
Audio (base64)YesNoYesNoNoYesNoNo
File (URL)YesYesYesYesNoYesNoNo
File (base64)YesYesYesYesYes*YesNoNo
  • Ollama image support requires a vision-capable model (e.g., llava, bakllava, llama3.2-vision).
  • AWS Bedrock multi-modal support (*) depends on the specific model. Amazon Nova supports images; document support varies by model.
  • AWS Claude supports the same multi-modal features as the direct Anthropic provider.
  • Azure OpenAI supports the same multi-modal features as the direct OpenAI provider.
  • Azure AI Foundry vision support depends on the model (e.g., Phi-3.5-vision-instruct supports images).

Reading CSV Data

CSV files can be sent to Anthropic and OpenAI as file input. The model reads and analyzes the data directly:
import { Agent, anthropic, type ContentPart } from "@radaros/core";
import { readFileSync } from "node:fs";

const agent = new Agent({
  name: "DataAnalyst",
  model: anthropic("claude-sonnet-4-6"),
  instructions: "Analyze data files. Provide insights with specific numbers.",
});

// From a local CSV file
const csvData = readFileSync("sales-data.csv").toString("base64");
const result = await agent.run([
  { type: "text", text: "Analyze this sales data. What are the top 3 products by revenue?" },
  { type: "file", data: csvData, mimeType: "text/csv", filename: "sales-data.csv" },
] as ContentPart[]);

console.log(result.text);
// "Based on the sales data, the top 3 products by revenue are:
//  1. Widget Pro - $142,500 (1,425 units)
//  2. Gadget Plus - $98,200 (982 units)
//  3. Tool Basic - $67,800 (2,260 units)"

Analyzing PDFs

PDF documents can be sent via URL (no download needed) or base64:
import { Agent, anthropic, type ContentPart } from "@radaros/core";

const agent = new Agent({
  name: "DocumentReader",
  model: anthropic("claude-sonnet-4-6"),
  instructions: "Extract key information from documents. Be thorough but concise.",
});

// PDF via URL — Anthropic fetches it directly
const result = await agent.run([
  { type: "text", text: "Summarize the key findings in this research paper." },
  {
    type: "file",
    data: "https://example.com/research-paper.pdf",
    mimeType: "application/pdf",
    filename: "paper.pdf",
  },
] as ContentPart[]);

// PDF via base64
import { readFileSync } from "node:fs";
const pdfData = readFileSync("contract.pdf").toString("base64");

const contractResult = await agent.run([
  { type: "text", text: "What are the payment terms and termination clauses?" },
  { type: "file", data: pdfData, mimeType: "application/pdf", filename: "contract.pdf" },
] as ContentPart[]);

XLSX and Binary Formats

Most providers cannot process Excel (.xlsx) files directly. Google Gemini is the exception — it handles XLSX natively via inlineData. For other providers, convert to CSV first:
import { parse } from "xlsx"; // npm install xlsx

function xlsxToCsv(filePath: string): string {
  const workbook = parse(readFileSync(filePath));
  const sheet = workbook.Sheets[workbook.SheetNames[0]];
  return XLSX.utils.sheet_to_csv(sheet);
}

const csvContent = xlsxToCsv("report.xlsx");
const csvBase64 = Buffer.from(csvContent).toString("base64");

const result = await agent.run([
  { type: "text", text: "Analyze this spreadsheet data." },
  { type: "file", data: csvBase64, mimeType: "text/csv", filename: "report.csv" },
] as ContentPart[]);

Multi-Modal via HTTP File Upload

When exposing agents via Express, you can accept file uploads and convert them to ContentPart[]. The transport layer provides buildMultiModalInput for this: See File Upload for how to handle multipart/form-data and build multi-modal input from uploaded files.