Vision Agents
RadarOS supports real-time vision+audio conversations through theVisionAgent class. Vision agents stream both audio and video frames to multimodal models, enabling agents that can see camera feeds, screen shares, or images while having a spoken conversation.
VisionAgent is a separate class from VoiceAgent with its own VisionProvider interface, designed for clean multi-provider support.
Quick Start
Architecture: Agent vs VoiceAgent vs VisionAgent
RadarOS has three independent agent tiers:| Class | Provider | Input | Output | Use Case |
|---|---|---|---|---|
Agent | ModelProvider | Text | Text + tools | Chat, RAG, workflows |
VoiceAgent | RealtimeProvider | Audio | Audio + text | Voice assistants |
VisionAgent | VisionProvider | Audio + Images | Audio + text | Vision assistants, video analysis |
VisionProvider Interface
Any provider that supports audio + video can implement this:RealtimeConnection (voice) is sendImage() — this is what makes a provider “vision-capable.”
Gemini 3.1 Flash Live
The firstVisionProvider implementation uses Google’s Gemini 3.1 Flash Live model.
- Input: Text, images, audio, video
- Output: Text and audio
- Context: 131K input, 65K output
- Features: Function calling, search grounding, thinking
- Thinking: Uses
thinkingLevel(“minimal”, “low”, “medium”, “high”) instead ofthinkingBudget
Configuration
Full Configuration Reference
| Property | Type | Default | Description |
|---|---|---|---|
name | string | required | Agent name for logging and identification |
provider | VisionProvider | required | The vision provider to use |
instructions | string | — | System prompt / instructions |
voice | string | Provider default | Voice name for audio responses |
language | string | Auto-detect | BCP-47 language code (e.g. "en-US", "hi-IN") |
fps | number | 1 | Suggested video frame rate |
thinkingLevel | "minimal" | "low" | "medium" | "high" | "minimal" | Reasoning depth (Gemini 3.1+) |
temperature | number | — | Sampling temperature |
tools | ToolDef[] | — | Tools the agent can call |
memory | UnifiedMemoryConfig | — | Memory configuration |
logLevel | LogLevel | — | Logging verbosity |
Voice Selection
Gemini Live supports 30 prebuilt voices, each with a distinct personality. Set thevoice property on the agent config:
Available Voices
| Voice | Personality | Voice | Personality | |
|---|---|---|---|---|
| Puck | Upbeat | Laomedeia | Upbeat | |
| Kore | Firm | Schedar | Even | |
| Charon | Informative | Achird | Friendly | |
| Fenrir | Excitable | Sadachbia | Lively | |
| Aoede | Breezy | Enceladus | Breathy | |
| Zephyr | Bright | Algieba | Smooth | |
| Orus | Firm | Algenib | Gravelly | |
| Autonoe | Bright | Achernar | Soft | |
| Umbriel | Easy-going | Gacrux | Mature | |
| Erinome | Clear | Zubenelgenubi | Casual | |
| Leda | Youthful | Sadaltager | Knowledgeable | |
| Callirrhoe | Easy-going | Rasalgethi | Informative | |
| Iapetus | Clear | Alnilam | Firm | |
| Despina | Smooth | Pulcherrima | Forward | |
| Vindemiatrix | Gentle | Sulafat | Warm |
Per-Session Voice Selection
You can create agents with different voices per session — useful for letting users pick:Multilingual Support
Vision agents support 24+ languages with automatic language detection.Auto-Detection (Recommended)
The agent detects the user’s spoken language and responds in kind — no configuration needed. Add a language rule to your instructions:Explicit Language (Optional)
Force a specific language using thelanguage config with a BCP-47 code:
Supported Languages
| Language | Code | Language | Code | |
|---|---|---|---|---|
| English (US) | en-US | Korean | ko-KR | |
| Hindi | hi-IN | Japanese | ja-JP | |
| Spanish (US) | es-US | Arabic (Egyptian) | ar-EG | |
| French | fr-FR | Bengali | bn-BD | |
| German | de-DE | Marathi | mr-IN | |
| Portuguese (Brazil) | pt-BR | Tamil | ta-IN | |
| Italian | it-IT | Telugu | te-IN | |
| Dutch | nl-NL | Thai | th-TH | |
| Polish | pl-PL | Turkish | tr-TR | |
| Romanian | ro-RO | Ukrainian | uk-UA | |
| Russian | ru-RU | Vietnamese | vi-VN | |
| Indonesian | id-ID | English (India) | en-IN |
Audio Interruption
When the user starts speaking while the agent is responding, the agent automatically stops its current response. This is handled by Gemini Live’s built-in Voice Activity Detection (VAD).Server-Side
TheVisionSession emits an interrupted event:
Client-Side
Stop all queued audio playback when interrupted:Sending Image Frames
From a file
Screen sharing (via browser getDisplayMedia)
Camera feed (via browser getUserMedia)
Server-side relay (Socket.IO)
Socket.IO Vision Gateway
The vision gateway relays audio, images, and text between browser clients and VisionAgent sessions. Namespace:/radaros-vision
Client-to-server events
| Event | Payload | Description |
|---|---|---|
vision.start | { agentName, voice?, apiKey?, userId?, sessionId? } | Start a vision session (with optional voice) |
vision.audio | { data: base64 } | Send audio frame |
vision.image | { data: base64, mimeType?, source? } | Send image/video frame (source: “screen” or “camera”) |
vision.text | { text } | Send text message |
vision.interrupt | — | Interrupt response |
vision.stop | — | End session |
Server-to-client events
| Event | Payload | Description |
|---|---|---|
vision.started | { userId } | Session connected |
vision.audio | { data: base64, mimeType } | Audio response |
vision.transcript | { text, role } | Transcript |
vision.text | { text } | Text response |
vision.tool.call | { name, args } | Tool invocation |
vision.tool.result | { name, result } | Tool result |
vision.interrupted | — | Response interrupted (stop audio playback) |
vision.stopped | — | Session ended |
vision.error | { error } | Error |
Tool Calling
Vision agents support the same tool system as voice and text agents:Memory
Vision agents support the same unified memory system as other agents:Full Example: Screen + Camera Assistant
A complete example with screen sharing, camera feed, voice selection, multilingual auto-detection, and interruption support:http://localhost:4200 with:
- Voice picker — choose from 30 Gemini voices before starting the session
- Screen sharing — share your entire screen or a specific window
- Camera feed — enable your webcam for face-to-face conversation
- Microphone — speak naturally in any language
- Text chat — type messages as an alternative to speech
- Auto language detection — speak Hindi, English, Spanish, etc. and the agent responds in your language
- Interruption — start speaking to immediately stop the agent’s response