Voice Agents
RadarOS supports real-time voice conversations through theVoiceAgent class. Voice agents connect to speech-to-speech APIs (OpenAI Realtime, Google Gemini Live) over WebSocket, handle audio streaming, tool calling, and persistent user memory — all with the same patterns as regular text agents.
Voice agents use a separate
RealtimeProvider interface (not the regular ModelProvider). The realtime API manages its own conversation context within the WebSocket connection.Quick Start
Architecture
Voice agents have a layered architecture:VoiceAgent
Orchestrator. Manages the realtime connection, tools, user memory, and session lifecycle.
RealtimeProvider
WebSocket adapter for a specific speech-to-speech API. Translates between RadarOS events and the provider’s protocol.
Voice Gateway
Thin Socket.IO relay. Bridges browser audio to VoiceAgent. No business logic.
VoiceAgent Config
Name of the voice agent.
The realtime provider to use. Use the shorthand helpers
openaiRealtime() or googleLive(), or instantiate OpenAIRealtimeProvider / GoogleLiveProvider directly.System instructions for the voice agent. User memory facts are automatically appended on connect.
Tools the agent can call during a voice conversation. Same
defineTool() API as regular agents.Voice to use for speech synthesis (e.g.,
"alloy", "shimmer", "echo"). Provider-specific.Cross-session user memory. Facts are loaded into instructions on connect and auto-extracted from transcripts on disconnect.
LLM model used by UserMemory for auto-extracting facts from conversation transcripts. Required when
userMemory is set.Default user ID. Can be overridden per
connect() call.Temperature for response generation.
Server-side voice activity detection config. Set to
null to disable.Logging level:
"debug", "info", "warn", "error", "silent".connect()
Callconnect() to start a voice session:
- Loads user facts from
UserMemory(if configured) and appends them to instructions - Opens a WebSocket to the realtime provider
- Sends session config (instructions, tools, voice, etc.)
- Returns a
VoiceSessionhandle
VoiceSession
The session handle returned byconnect():
| Method | Description |
|---|---|
sendAudio(data: Buffer) | Send raw PCM audio to the agent |
sendText(text: string) | Send a text message (triggers a spoken response) |
interrupt() | Interrupt the current response |
close() | End the session. Triggers user memory extraction. |
Events
| Event | Payload | Description |
|---|---|---|
audio | { data: Buffer, mimeType: string } | Audio response chunk (PCM16) |
transcript | { text: string, role: "user" | "assistant" } | Speech-to-text transcript |
text | { text: string } | Text-only response delta |
tool_call_start | { name: string, args: unknown } | Tool call initiated |
tool_result | { name: string, result: string } | Tool call completed |
interrupted | {} | Response was interrupted |
error | { error: Error } | Error occurred |
disconnected | {} | Session ended |
Realtime Providers
OpenAI Realtime
npm install ws
Google Gemini Live
npm install @google/genai
User Memory in Voice
Voice agents support the sameUserMemory as regular agents. The flow:
User connects
connect({ userId: "akash" }) loads stored facts and appends them to the agent’s instructions.User disconnects
On
close() or disconnect, all transcripts are consolidated (small deltas merged into full messages) and sent to the LLM for fact extraction.Voice agents do not use the
Memory class (long-term summarization) or SessionManager. The realtime API manages its own conversation context within the WebSocket connection. Only UserMemory persists across sessions.Tool Calling
Tools work the same as regular agents. When the realtime API detects a tool call intent:- The provider emits a
tool_callevent VoiceAgentexecutes the tool viaToolExecutor- The result is sent back to the provider
- The agent speaks the result
Voice Gateway (Socket.IO)
For browser-based voice apps, use thecreateVoiceGateway from @radaros/transport:
VoiceAgent and streams audio/events back. All memory, session, and tool logic lives in the agent.
Client-Side Events
| Event (emit) | Payload | Description |
|---|---|---|
voice.start | { agentName, userId?, apiKey? } | Start a voice session |
voice.audio | { data: base64 } | Send mic audio (PCM16, base64) |
voice.text | { text: string } | Send text input |
voice.interrupt | — | Interrupt the current response |
voice.stop | — | End the session |
| Event (listen) | Payload | Description |
|---|---|---|
voice.started | { userId } | Session started |
voice.audio | { data: base64, mimeType } | Audio response (PCM16, base64) |
voice.transcript | { text, role } | Transcript delta |
voice.tool.call | { name, args } | Tool call started |
voice.tool.result | { name, result } | Tool call result |
voice.interrupted | — | Response interrupted |
voice.error | { error: string } | Error |
voice.stopped | — | Session ended |
Examples
| Example | Description |
|---|---|
examples/voice/26-voice-openai.ts | OpenAI voice agent with mic/speaker |
examples/voice/27-voice-google.ts | Google Gemini Live voice agent |
examples/voice/29-voice-socketio.ts | Full browser voice app with Socket.IO, tools, and unified memory |