Skip to main content

Vision Agents

RadarOS supports real-time vision+audio conversations through the VisionAgent class. Vision agents stream both audio and video frames to multimodal models, enabling agents that can see camera feeds, screen shares, or images while having a spoken conversation. VisionAgent is a separate class from VoiceAgent with its own VisionProvider interface, designed for clean multi-provider support.

Quick Start

npm install @radaros/core @google/genai
import { VisionAgent, geminiVisionLive } from "@radaros/core";
import { readFileSync } from "fs";

const agent = new VisionAgent({
  name: "vision-assistant",
  provider: geminiVisionLive("gemini-3.1-flash-live-preview"),
  instructions: "You can see and hear. Describe what you see when asked.",
  voice: "Aoede",
  fps: 1,
  thinkingLevel: "minimal",
});

const session = await agent.connect();

// Send an image frame (JPEG from camera, screenshot, etc.)
const frame = readFileSync("./photo.jpg");
session.sendImage(frame, "image/jpeg");

// Send audio (PCM 16kHz)
session.sendAudio(micBuffer);

// Send text
session.sendText("What do you see?");

// Listen for responses
session.on("transcript", ({ text, role }) => {
  console.log(`[${role}] ${text}`);
});

session.on("audio", ({ data }) => {
  // Play audio response (PCM 24kHz)
});

session.on("interrupted", () => {
  // User started speaking — stop playback
});

Architecture: Agent vs VoiceAgent vs VisionAgent

RadarOS has three independent agent tiers:
ClassProviderInputOutputUse Case
AgentModelProviderTextText + toolsChat, RAG, workflows
VoiceAgentRealtimeProviderAudioAudio + textVoice assistants
VisionAgentVisionProviderAudio + ImagesAudio + textVision assistants, video analysis
Each tier has its own provider interface. Existing code is never affected when a new tier is added.

VisionProvider Interface

Any provider that supports audio + video can implement this:
interface VisionProvider {
  readonly providerId: string;
  readonly modelId: string;
  connect(config: VisionSessionConfig): Promise<VisionConnection>;
}

interface VisionConnection {
  sendAudio(data: Buffer): void;
  sendImage(data: Buffer, mimeType?: string): void;
  sendText(text: string): void;
  sendToolResult(callId: string, result: string): void;
  interrupt(): void;
  close(): Promise<void>;
  on(event, handler): void;
  off(event, handler): void;
}
The difference from RealtimeConnection (voice) is sendImage() — this is what makes a provider “vision-capable.”

Gemini 3.1 Flash Live

The first VisionProvider implementation uses Google’s Gemini 3.1 Flash Live model.
import { geminiVisionLive } from "@radaros/core";

const provider = geminiVisionLive("gemini-3.1-flash-live-preview");
Model specs:
  • Input: Text, images, audio, video
  • Output: Text and audio
  • Context: 131K input, 65K output
  • Features: Function calling, search grounding, thinking
  • Thinking: Uses thinkingLevel (“minimal”, “low”, “medium”, “high”) instead of thinkingBudget

Configuration

const agent = new VisionAgent({
  name: "assistant",
  provider: geminiVisionLive(),
  instructions: "You are a helpful vision assistant.",
  voice: "Aoede",              // voice selection (30 voices available)
  language: "hi-IN",           // optional — force a BCP-47 language code
  fps: 1,                      // frame rate hint
  thinkingLevel: "minimal",    // lowest latency
  temperature: 0.7,
  tools: [myTool],
  memory: { /* UnifiedMemoryConfig */ },
});

Full Configuration Reference

PropertyTypeDefaultDescription
namestringrequiredAgent name for logging and identification
providerVisionProviderrequiredThe vision provider to use
instructionsstringSystem prompt / instructions
voicestringProvider defaultVoice name for audio responses
languagestringAuto-detectBCP-47 language code (e.g. "en-US", "hi-IN")
fpsnumber1Suggested video frame rate
thinkingLevel"minimal" | "low" | "medium" | "high""minimal"Reasoning depth (Gemini 3.1+)
temperaturenumberSampling temperature
toolsToolDef[]Tools the agent can call
memoryUnifiedMemoryConfigMemory configuration
logLevelLogLevelLogging verbosity

Voice Selection

Gemini Live supports 30 prebuilt voices, each with a distinct personality. Set the voice property on the agent config:
const agent = new VisionAgent({
  name: "assistant",
  provider: geminiVisionLive(),
  voice: "Puck",  // Upbeat personality
});

Available Voices

VoicePersonalityVoicePersonality
PuckUpbeatLaomedeiaUpbeat
KoreFirmSchedarEven
CharonInformativeAchirdFriendly
FenrirExcitableSadachbiaLively
AoedeBreezyEnceladusBreathy
ZephyrBrightAlgiebaSmooth
OrusFirmAlgenibGravelly
AutonoeBrightAchernarSoft
UmbrielEasy-goingGacruxMature
ErinomeClearZubenelgenubiCasual
LedaYouthfulSadaltagerKnowledgeable
CallirrhoeEasy-goingRasalgethiInformative
IapetusClearAlnilamFirm
DespinaSmoothPulcherrimaForward
VindemiatrixGentleSulafatWarm

Per-Session Voice Selection

You can create agents with different voices per session — useful for letting users pick:
socket.on("vision.start", async ({ voice }) => {
  const agent = new VisionAgent({
    name: "assistant",
    provider: geminiVisionLive(),
    instructions: "You are a helpful assistant.",
    voice: voice || undefined,  // user's choice or provider default
  });
  const session = await agent.connect();
});

Multilingual Support

Vision agents support 24+ languages with automatic language detection. The agent detects the user’s spoken language and responds in kind — no configuration needed. Add a language rule to your instructions:
const agent = new VisionAgent({
  name: "assistant",
  provider: geminiVisionLive(),
  instructions: [
    "You are a helpful multilingual vision assistant.",
    "LANGUAGE RULE: Always detect the language the user is speaking and respond in that same language.",
    "If the user switches languages mid-conversation, switch with them immediately.",
  ].join(" "),
});
The user can speak Hindi, switch to English, then ask something in Japanese — the agent follows naturally.

Explicit Language (Optional)

Force a specific language using the language config with a BCP-47 code:
const agent = new VisionAgent({
  name: "assistant",
  provider: geminiVisionLive(),
  language: "hi-IN",  // always respond in Hindi
});

Supported Languages

LanguageCodeLanguageCode
English (US)en-USKoreanko-KR
Hindihi-INJapaneseja-JP
Spanish (US)es-USArabic (Egyptian)ar-EG
Frenchfr-FRBengalibn-BD
Germande-DEMarathimr-IN
Portuguese (Brazil)pt-BRTamilta-IN
Italianit-ITTelugute-IN
Dutchnl-NLThaith-TH
Polishpl-PLTurkishtr-TR
Romanianro-ROUkrainianuk-UA
Russianru-RUVietnamesevi-VN
Indonesianid-IDEnglish (India)en-IN

Audio Interruption

When the user starts speaking while the agent is responding, the agent automatically stops its current response. This is handled by Gemini Live’s built-in Voice Activity Detection (VAD).

Server-Side

The VisionSession emits an interrupted event:
const session = await agent.connect();

session.on("interrupted", () => {
  console.log("User interrupted — response stopped");
  // Forward to client to stop audio playback
});

Client-Side

Stop all queued audio playback when interrupted:
let activeSources = [];

socket.on("vision.audio", (data) => {
  const source = playbackCtx.createBufferSource();
  // ... decode and play audio ...
  activeSources.push(source);
  source.onended = () => {
    activeSources = activeSources.filter(s => s !== source);
  };
});

socket.on("vision.interrupted", () => {
  activeSources.forEach(s => { try { s.stop(); } catch(e) {} });
  activeSources = [];
  nextPlayTime = 0;
});

Sending Image Frames

From a file

import { readFileSync } from "fs";

const frame = readFileSync("./screenshot.jpg");
session.sendImage(frame, "image/jpeg");

Screen sharing (via browser getDisplayMedia)

const stream = await navigator.mediaDevices.getDisplayMedia({ video: true });
const video = document.createElement("video");
video.srcObject = stream;
await video.play();

const canvas = document.createElement("canvas");
const nw = video.videoWidth;
const nh = video.videoHeight;
const scale = Math.min(1, 1280 / nw);
canvas.width = Math.round(nw * scale);
canvas.height = Math.round(nh * scale);
const ctx = canvas.getContext("2d");

setInterval(() => {
  ctx.drawImage(video, 0, 0, canvas.width, canvas.height);
  canvas.toBlob((blob) => {
    blob.arrayBuffer().then((buf) => {
      const base64 = btoa(String.fromCharCode(...new Uint8Array(buf)));
      socket.emit("vision.image", {
        data: base64,
        mimeType: "image/jpeg",
        source: "screen",
      });
    });
  }, "image/jpeg", 0.7);
}, 2000);

Camera feed (via browser getUserMedia)

const stream = await navigator.mediaDevices.getUserMedia({
  video: { width: 640, height: 480, facingMode: "user" },
  audio: false,
});
const video = document.createElement("video");
video.srcObject = stream;
await video.play();

const canvas = document.createElement("canvas");
canvas.width = video.videoWidth || 640;
canvas.height = video.videoHeight || 480;
const ctx = canvas.getContext("2d");

setInterval(() => {
  ctx.drawImage(video, 0, 0, canvas.width, canvas.height);
  canvas.toBlob((blob) => {
    blob.arrayBuffer().then((buf) => {
      const base64 = btoa(String.fromCharCode(...new Uint8Array(buf)));
      socket.emit("vision.image", {
        data: base64,
        mimeType: "image/jpeg",
        source: "camera",
      });
    });
  }, "image/jpeg", 0.7);
}, 2000);

Server-side relay (Socket.IO)

import { createVisionGateway } from "@radaros/transport";

createVisionGateway({
  agents: { assistant: visionAgent },
  io: socketIoServer,
});

Socket.IO Vision Gateway

The vision gateway relays audio, images, and text between browser clients and VisionAgent sessions. Namespace: /radaros-vision

Client-to-server events

EventPayloadDescription
vision.start{ agentName, voice?, apiKey?, userId?, sessionId? }Start a vision session (with optional voice)
vision.audio{ data: base64 }Send audio frame
vision.image{ data: base64, mimeType?, source? }Send image/video frame (source: “screen” or “camera”)
vision.text{ text }Send text message
vision.interruptInterrupt response
vision.stopEnd session

Server-to-client events

EventPayloadDescription
vision.started{ userId }Session connected
vision.audio{ data: base64, mimeType }Audio response
vision.transcript{ text, role }Transcript
vision.text{ text }Text response
vision.tool.call{ name, args }Tool invocation
vision.tool.result{ name, result }Tool result
vision.interruptedResponse interrupted (stop audio playback)
vision.stoppedSession ended
vision.error{ error }Error

Tool Calling

Vision agents support the same tool system as voice and text agents:
import { defineTool } from "@radaros/core";
import { z } from "zod";

const weatherTool = defineTool({
  name: "getWeather",
  description: "Get weather for a city",
  parameters: z.object({ city: z.string() }),
  execute: async ({ city }) => `Sunny, 22C in ${city}`,
});

const agent = new VisionAgent({
  name: "assistant",
  provider: geminiVisionLive(),
  tools: [weatherTool],
});
The model can call tools based on what it sees or hears. For example, it could read a barcode from a camera frame and look up the product.

Memory

Vision agents support the same unified memory system as other agents:
const agent = new VisionAgent({
  name: "assistant",
  provider: geminiVisionLive(),
  memory: {
    storage: new SqliteStorage("vision-memory.db"),
    summary: { enabled: true },
    userProfile: { enabled: true },
  },
});
Transcripts from the vision session are persisted to memory when the session ends, enabling context across sessions.

Full Example: Screen + Camera Assistant

A complete example with screen sharing, camera feed, voice selection, multilingual auto-detection, and interruption support:
GOOGLE_API_KEY=your-key npx tsx examples/voice/screen-assistant.ts
This starts a web UI at http://localhost:4200 with:
  • Voice picker — choose from 30 Gemini voices before starting the session
  • Screen sharing — share your entire screen or a specific window
  • Camera feed — enable your webcam for face-to-face conversation
  • Microphone — speak naturally in any language
  • Text chat — type messages as an alternative to speech
  • Auto language detection — speak Hindi, English, Spanish, etc. and the agent responds in your language
  • Interruption — start speaking to immediately stop the agent’s response
import { VisionAgent, geminiVisionLive } from "@radaros/core";

const INSTRUCTIONS = [
  "You are a helpful multilingual vision assistant.",
  "You can see the user's screen and/or camera feed, and hear them speak.",
  "LANGUAGE RULE: Always detect the language the user is speaking and respond in that same language.",
  "If the user switches languages mid-conversation, switch with them immediately.",
  "Keep responses conversational and concise. Speak naturally in the user's language.",
].join(" ");

// Create a per-session agent with the user's chosen voice
const agent = new VisionAgent({
  name: "VisionAssistant",
  provider: geminiVisionLive("gemini-3.1-flash-live-preview"),
  instructions: INSTRUCTIONS,
  voice: "Puck",     // or let the user pick
  fps: 1,
  thinkingLevel: "low",
});

const session = await agent.connect();

// Forward screen/camera frames
session.sendImage(jpegBuffer, "image/jpeg");

// Forward microphone audio
session.sendAudio(pcm16kBuffer);

// Handle responses
session.on("audio", ({ data }) => { /* play PCM 24kHz */ });
session.on("transcript", ({ text, role }) => { /* show text */ });
session.on("interrupted", () => { /* stop audio playback */ });