Vision Agents

RadarOS supports real-time vision+audio conversations through the VisionAgent class. Vision agents stream both audio and video frames to multimodal models, enabling agents that can see camera feeds, screen shares, or images while having a spoken conversation. VisionAgent is a separate class from VoiceAgent with its own VisionProvider interface, designed for clean multi-provider support.

Quick Start

npm install @radaros/core @google/genai

import { VisionAgent, geminiVisionLive } from "@radaros/core";
import { readFileSync } from "fs";

const agent = new VisionAgent({
  name: "vision-assistant",
  provider: geminiVisionLive("gemini-3.1-flash-live-preview"),
  instructions: "You can see and hear. Describe what you see when asked.",
  voice: "Aoede",
  fps: 1,
  thinkingLevel: "minimal",
});

const session = await agent.connect();

// Send an image frame (JPEG from camera, screenshot, etc.)
const frame = readFileSync("./photo.jpg");
session.sendImage(frame, "image/jpeg");

// Send audio (PCM 16kHz)
session.sendAudio(micBuffer);

// Send text
session.sendText("What do you see?");

// Listen for responses
session.on("transcript", ({ text, role }) => {
  console.log(`[${role}] ${text}`);
});

session.on("audio", ({ data }) => {
  // Play audio response (PCM 24kHz)
});

session.on("interrupted", () => {
  // User started speaking — stop playback
});

Architecture: Agent vs VoiceAgent vs VisionAgent

RadarOS has three independent agent tiers:

Class	Provider	Input	Output	Use Case
`Agent`	`ModelProvider`	Text	Text + tools	Chat, RAG, workflows
`VoiceAgent`	`RealtimeProvider`	Audio	Audio + text	Voice assistants
`VisionAgent`	`VisionProvider`	Audio + Images	Audio + text	Vision assistants, video analysis

Each tier has its own provider interface. Existing code is never affected when a new tier is added.

VisionProvider Interface

Any provider that supports audio + video can implement this:

interface VisionProvider {
  readonly providerId: string;
  readonly modelId: string;
  connect(config: VisionSessionConfig): Promise<VisionConnection>;
}

interface VisionConnection {
  sendAudio(data: Buffer): void;
  sendImage(data: Buffer, mimeType?: string): void;
  sendText(text: string): void;
  sendToolResult(callId: string, result: string): void;
  interrupt(): void;
  close(): Promise<void>;
  on(event, handler): void;
  off(event, handler): void;
}

The difference from RealtimeConnection (voice) is sendImage() — this is what makes a provider “vision-capable.”

Gemini 3.1 Flash Live

The first VisionProvider implementation uses Google’s Gemini 3.1 Flash Live model.

import { geminiVisionLive } from "@radaros/core";

const provider = geminiVisionLive("gemini-3.1-flash-live-preview");

Model specs:

Input: Text, images, audio, video
Output: Text and audio
Context: 131K input, 65K output
Features: Function calling, search grounding, thinking
Thinking: Uses thinkingLevel (“minimal”, “low”, “medium”, “high”) instead of thinkingBudget

Configuration

const agent = new VisionAgent({
  name: "assistant",
  provider: geminiVisionLive(),
  instructions: "You are a helpful vision assistant.",
  voice: "Aoede",              // voice selection (30 voices available)
  language: "hi-IN",           // optional — force a BCP-47 language code
  fps: 1,                      // frame rate hint
  thinkingLevel: "minimal",    // lowest latency
  temperature: 0.7,
  tools: [myTool],
  memory: { /* UnifiedMemoryConfig */ },
});

Full Configuration Reference

Property	Type	Default	Description
`name`	`string`	required	Agent name for logging and identification
`provider`	`VisionProvider`	required	The vision provider to use
`instructions`	`string`	—	System prompt / instructions
`voice`	`string`	Provider default	Voice name for audio responses
`language`	`string`	Auto-detect	BCP-47 language code (e.g. `"en-US"`, `"hi-IN"`)
`fps`	`number`	`1`	Suggested video frame rate
`thinkingLevel`	`"minimal" \| "low" \| "medium" \| "high"`	`"minimal"`	Reasoning depth (Gemini 3.1+)
`temperature`	`number`	—	Sampling temperature
`tools`	`ToolDef[]`	—	Tools the agent can call
`memory`	`UnifiedMemoryConfig`	—	Memory configuration
`logLevel`	`LogLevel`	—	Logging verbosity

Voice Selection

Gemini Live supports 30 prebuilt voices, each with a distinct personality. Set the voice property on the agent config:

const agent = new VisionAgent({
  name: "assistant",
  provider: geminiVisionLive(),
  voice: "Puck",  // Upbeat personality
});

Available Voices

Voice	Personality	Voice	Personality
Puck	Upbeat	Laomedeia	Upbeat
Kore	Firm	Schedar	Even
Charon	Informative	Achird	Friendly
Fenrir	Excitable	Sadachbia	Lively
Aoede	Breezy	Enceladus	Breathy
Zephyr	Bright	Algieba	Smooth
Orus	Firm	Algenib	Gravelly
Autonoe	Bright	Achernar	Soft
Umbriel	Easy-going	Gacrux	Mature
Erinome	Clear	Zubenelgenubi	Casual
Leda	Youthful	Sadaltager	Knowledgeable
Callirrhoe	Easy-going	Rasalgethi	Informative
Iapetus	Clear	Alnilam	Firm
Despina	Smooth	Pulcherrima	Forward
Vindemiatrix	Gentle	Sulafat	Warm

Per-Session Voice Selection

You can create agents with different voices per session — useful for letting users pick:

socket.on("vision.start", async ({ voice }) => {
  const agent = new VisionAgent({
    name: "assistant",
    provider: geminiVisionLive(),
    instructions: "You are a helpful assistant.",
    voice: voice || undefined,  // user's choice or provider default
  });
  const session = await agent.connect();
});

Multilingual Support

Vision agents support 24+ languages with automatic language detection.

Auto-Detection (Recommended)

The agent detects the user’s spoken language and responds in kind — no configuration needed. Add a language rule to your instructions:

const agent = new VisionAgent({
  name: "assistant",
  provider: geminiVisionLive(),
  instructions: [
    "You are a helpful multilingual vision assistant.",
    "LANGUAGE RULE: Always detect the language the user is speaking and respond in that same language.",
    "If the user switches languages mid-conversation, switch with them immediately.",
  ].join(" "),
});

The user can speak Hindi, switch to English, then ask something in Japanese — the agent follows naturally.

Explicit Language (Optional)

Force a specific language using the language config with a BCP-47 code:

const agent = new VisionAgent({
  name: "assistant",
  provider: geminiVisionLive(),
  language: "hi-IN",  // always respond in Hindi
});

Supported Languages

Language	Code	Language	Code
English (US)	`en-US`	Korean	`ko-KR`
Hindi	`hi-IN`	Japanese	`ja-JP`
Spanish (US)	`es-US`	Arabic (Egyptian)	`ar-EG`
French	`fr-FR`	Bengali	`bn-BD`
German	`de-DE`	Marathi	`mr-IN`
Portuguese (Brazil)	`pt-BR`	Tamil	`ta-IN`
Italian	`it-IT`	Telugu	`te-IN`
Dutch	`nl-NL`	Thai	`th-TH`
Polish	`pl-PL`	Turkish	`tr-TR`
Romanian	`ro-RO`	Ukrainian	`uk-UA`
Russian	`ru-RU`	Vietnamese	`vi-VN`
Indonesian	`id-ID`	English (India)	`en-IN`

Audio Interruption

When the user starts speaking while the agent is responding, the agent automatically stops its current response. This is handled by Gemini Live’s built-in Voice Activity Detection (VAD).

Server-Side

The VisionSession emits an interrupted event:

const session = await agent.connect();

session.on("interrupted", () => {
  console.log("User interrupted — response stopped");
  // Forward to client to stop audio playback
});

Client-Side

Stop all queued audio playback when interrupted:

let activeSources = [];

socket.on("vision.audio", (data) => {
  const source = playbackCtx.createBufferSource();
  // ... decode and play audio ...
  activeSources.push(source);
  source.onended = () => {
    activeSources = activeSources.filter(s => s !== source);
  };
});

socket.on("vision.interrupted", () => {
  activeSources.forEach(s => { try { s.stop(); } catch(e) {} });
  activeSources = [];
  nextPlayTime = 0;
});

Sending Image Frames

From a file

import { readFileSync } from "fs";

const frame = readFileSync("./screenshot.jpg");
session.sendImage(frame, "image/jpeg");

const stream = await navigator.mediaDevices.getDisplayMedia({ video: true });
const video = document.createElement("video");
video.srcObject = stream;
await video.play();

const canvas = document.createElement("canvas");
const nw = video.videoWidth;
const nh = video.videoHeight;
const scale = Math.min(1, 1280 / nw);
canvas.width = Math.round(nw * scale);
canvas.height = Math.round(nh * scale);
const ctx = canvas.getContext("2d");

setInterval(() => {
  ctx.drawImage(video, 0, 0, canvas.width, canvas.height);
  canvas.toBlob((blob) => {
    blob.arrayBuffer().then((buf) => {
      const base64 = btoa(String.fromCharCode(...new Uint8Array(buf)));
      socket.emit("vision.image", {
        data: base64,
        mimeType: "image/jpeg",
        source: "screen",
      });
    });
  }, "image/jpeg", 0.7);
}, 2000);

Camera feed (via browser getUserMedia)

const stream = await navigator.mediaDevices.getUserMedia({
  video: { width: 640, height: 480, facingMode: "user" },
  audio: false,
});
const video = document.createElement("video");
video.srcObject = stream;
await video.play();

const canvas = document.createElement("canvas");
canvas.width = video.videoWidth || 640;
canvas.height = video.videoHeight || 480;
const ctx = canvas.getContext("2d");

setInterval(() => {
  ctx.drawImage(video, 0, 0, canvas.width, canvas.height);
  canvas.toBlob((blob) => {
    blob.arrayBuffer().then((buf) => {
      const base64 = btoa(String.fromCharCode(...new Uint8Array(buf)));
      socket.emit("vision.image", {
        data: base64,
        mimeType: "image/jpeg",
        source: "camera",
      });
    });
  }, "image/jpeg", 0.7);
}, 2000);

Server-side relay (Socket.IO)

import { createVisionGateway } from "@radaros/transport";

createVisionGateway({
  agents: { assistant: visionAgent },
  io: socketIoServer,
});

Socket.IO Vision Gateway

The vision gateway relays audio, images, and text between browser clients and VisionAgent sessions. Namespace: /radaros-vision

Client-to-server events

Event	Payload	Description
`vision.start`	`{ agentName, voice?, apiKey?, userId?, sessionId? }`	Start a vision session (with optional voice)
`vision.audio`	`{ data: base64 }`	Send audio frame
`vision.image`	`{ data: base64, mimeType?, source? }`	Send image/video frame (`source`: “screen” or “camera”)
`vision.text`	`{ text }`	Send text message
`vision.interrupt`	—	Interrupt response
`vision.stop`	—	End session

Server-to-client events

Event	Payload	Description
`vision.started`	`{ userId }`	Session connected
`vision.audio`	`{ data: base64, mimeType }`	Audio response
`vision.transcript`	`{ text, role }`	Transcript
`vision.text`	`{ text }`	Text response
`vision.tool.call`	`{ name, args }`	Tool invocation
`vision.tool.result`	`{ name, result }`	Tool result
`vision.interrupted`	—	Response interrupted (stop audio playback)
`vision.stopped`	—	Session ended
`vision.error`	`{ error }`	Error

Tool Calling

Vision agents support the same tool system as voice and text agents:

import { defineTool } from "@radaros/core";
import { z } from "zod";

const weatherTool = defineTool({
  name: "getWeather",
  description: "Get weather for a city",
  parameters: z.object({ city: z.string() }),
  execute: async ({ city }) => `Sunny, 22C in ${city}`,
});

const agent = new VisionAgent({
  name: "assistant",
  provider: geminiVisionLive(),
  tools: [weatherTool],
});

The model can call tools based on what it sees or hears. For example, it could read a barcode from a camera frame and look up the product.

Memory

Vision agents support the same unified memory system as other agents:

const agent = new VisionAgent({
  name: "assistant",
  provider: geminiVisionLive(),
  memory: {
    storage: new SqliteStorage("vision-memory.db"),
    summary: { enabled: true },
    userProfile: { enabled: true },
  },
});

Transcripts from the vision session are persisted to memory when the session ends, enabling context across sessions.

Full Example: Screen + Camera Assistant

A complete example with screen sharing, camera feed, voice selection, multilingual auto-detection, and interruption support:

GOOGLE_API_KEY=your-key npx tsx examples/voice/screen-assistant.ts

This starts a web UI at http://localhost:4200 with:

Voice picker — choose from 30 Gemini voices before starting the session
Screen sharing — share your entire screen or a specific window
Camera feed — enable your webcam for face-to-face conversation
Microphone — speak naturally in any language
Text chat — type messages as an alternative to speech
Auto language detection — speak Hindi, English, Spanish, etc. and the agent responds in your language
Interruption — start speaking to immediately stop the agent’s response

import { VisionAgent, geminiVisionLive } from "@radaros/core";

const INSTRUCTIONS = [
  "You are a helpful multilingual vision assistant.",
  "You can see the user's screen and/or camera feed, and hear them speak.",
  "LANGUAGE RULE: Always detect the language the user is speaking and respond in that same language.",
  "If the user switches languages mid-conversation, switch with them immediately.",
  "Keep responses conversational and concise. Speak naturally in the user's language.",
].join(" ");

// Create a per-session agent with the user's chosen voice
const agent = new VisionAgent({
  name: "VisionAssistant",
  provider: geminiVisionLive("gemini-3.1-flash-live-preview"),
  instructions: INSTRUCTIONS,
  voice: "Puck",     // or let the user pick
  fps: 1,
  thinkingLevel: "low",
});

const session = await agent.connect();

// Forward screen/camera frames
session.sendImage(jpegBuffer, "image/jpeg");

// Forward microphone audio
session.sendAudio(pcm16kBuffer);

// Handle responses
session.on("audio", ({ data }) => { /* play PCM 24kHz */ });
session.on("transcript", ({ text, role }) => { /* show text */ });
session.on("interrupted", () => { /* stop audio playback */ });

Getting Started

Agents

Memory

Skills

Handoff

Cost Tracking

Semantic Cache

Eval Framework

Compliance & Audit

Culture System

Webhooks

Capacity Planning

Observability

Voice Agents

Browser Agents

Models

Teams

Workflows

Storage

Knowledge & RAG

Toolkits

MCP (Model Context Protocol)

A2A (Agent-to-Agent)

Edge & IoT

Transport

Queue

Scheduling

Advanced Features

Documentation Index

​Vision Agents

​Quick Start

​Architecture: Agent vs VoiceAgent vs VisionAgent

​VisionProvider Interface

​Gemini 3.1 Flash Live

​Configuration

​Full Configuration Reference

​Voice Selection

​Available Voices

​Per-Session Voice Selection

​Multilingual Support

​Auto-Detection (Recommended)

​Explicit Language (Optional)

​Supported Languages

​Audio Interruption

​Server-Side

​Client-Side

​Sending Image Frames

​From a file

​Screen sharing (via browser getDisplayMedia)

​Camera feed (via browser getUserMedia)

​Server-side relay (Socket.IO)

​Socket.IO Vision Gateway

​Client-to-server events

​Server-to-client events

​Tool Calling

​Memory

​Full Example: Screen + Camera Assistant

Vision Agents

Quick Start

Architecture: Agent vs VoiceAgent vs VisionAgent

VisionProvider Interface

Gemini 3.1 Flash Live

Configuration

Full Configuration Reference

Voice Selection

Available Voices

Per-Session Voice Selection

Multilingual Support

Auto-Detection (Recommended)

Explicit Language (Optional)

Supported Languages

Audio Interruption

Server-Side

Client-Side

Sending Image Frames

From a file

Screen sharing (via browser getDisplayMedia)

Camera feed (via browser getUserMedia)

Server-side relay (Socket.IO)

Socket.IO Vision Gateway

Client-to-server events

Server-to-client events

Tool Calling

Memory

Full Example: Screen + Camera Assistant