WebSocket Protocol
The /v1/realtime endpoint supports real-time bidirectional audio streaming with JSON control messages and binary audio frames.
Connecting
wscat -c "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3"
import websockets
async with websockets.connect(
"ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3"
) as ws:
# Send audio frames, receive events
...
Query Parameters
| Parameter | Required | Description |
|---|---|---|
model | Yes | STT model ID |
Message Flow
Client Server
| |
| ---- [connect] ----> |
| <---- session.created |
| |
| ---- session.configure ----> |
| |
| ---- [binary PCM frames] ----> |
| <---- vad.speech_start|
| <---- transcript.partial
| <---- transcript.partial
| <---- transcript.final|
| <---- vad.speech_end |
| |
| ---- tts.speak ----> |
| <---- tts.speaking_start
| <---- [binary audio] |
| <---- tts.speaking_end|
| |
| ---- [close] ----> |
Client to Server Messages
Binary Frames (Audio)
Send raw PCM audio as binary WebSocket frames:
| Property | Value |
|---|---|
| Format | PCM 16-bit signed integer |
| Sample rate | Any (resampled automatically to 16 kHz) |
| Channels | Mono (or first channel extracted) |
You can send audio at any sample rate -- the runtime automatically resamples to 16 kHz before processing.
session.configure
Configure the session after connecting. Optional -- defaults are used if not sent.
{
"type": "session.configure",
"vad": {
"sensitivity": "normal"
},
"language": "en",
"hot_words": ["Macaw", "OpenVoice", "transcription"],
"tts_model": "kokoro-v1"
}
| Field | Type | Description |
|---|---|---|
vad.sensitivity | string | high, normal, or low |
language | string | ISO 639-1 language code |
hot_words | string[] | Domain-specific keywords to boost |
tts_model | string | TTS model for full-duplex mode |
tts.speak
Trigger text-to-speech synthesis. The server will stream audio back as binary frames.
{
"type": "tts.speak",
"text": "Hello, how can I help you?",
"voice": "default"
}
| Field | Type | Description |
|---|---|---|
text | string | Text to synthesize |
voice | string | Voice identifier |
Sending a new tts.speak while one is already active cancels the previous one. TTS requests do not queue -- only the latest one is processed.
tts.cancel
Cancel the currently active TTS synthesis.
{
"type": "tts.cancel"
}
Server to Client Events
session.created
Sent immediately after the WebSocket connection is established.
{
"type": "session.created",
"session_id": "abc123"
}
vad.speech_start
Speech activity detected. The runtime has started buffering audio for transcription.
{
"type": "vad.speech_start",
"timestamp": 1234567890.123
}
transcript.partial
Intermediate transcription hypothesis. Updated as more audio arrives. Unstable -- may change with subsequent partials.
{
"type": "transcript.partial",
"text": "Hello how can"
}
Partials are best-effort hypotheses. Never apply post-processing (ITN) to partials -- they are too unstable for reliable formatting.
transcript.final
Confirmed transcription segment. This is the stable, post-processed result.
{
"type": "transcript.final",
"text": "Hello, how can I help you today?",
"language": "en",
"duration": 3.42
}
vad.speech_end
Speech activity has ended.
{
"type": "vad.speech_end",
"timestamp": 1234567890.456
}
tts.speaking_start
TTS synthesis has begun. STT is automatically muted during TTS to prevent feedback loops.
{
"type": "tts.speaking_start"
}
Binary Frames (TTS Audio)
During TTS synthesis, the server sends binary WebSocket frames containing audio data:
| Direction | Content |
|---|---|
| Server to client (binary) | Always TTS audio |
| Client to server (binary) | Always STT audio |
There is no ambiguity in binary frame direction -- server-to-client binary frames are always TTS audio, and client-to-server binary frames are always STT audio.
tts.speaking_end
TTS synthesis is complete. STT is automatically unmuted and resumes processing audio.
{
"type": "tts.speaking_end"
}
error
An error occurred during processing.
{
"type": "error",
"message": "Worker connection lost",
"recoverable": true
}
| Field | Type | Description |
|---|---|---|
message | string | Human-readable error description |
recoverable | boolean | Whether the session can continue |
Full-Duplex Mode
When a tts_model is configured, the WebSocket operates in full-duplex mode:
- Client streams audio for STT continuously
- Client sends
tts.speakto trigger speech synthesis - Server automatically mutes STT during TTS playback (via
try/finally-- unmute is guaranteed even if TTS crashes) - After TTS completes, STT resumes automatically
See the Full-Duplex Guide for implementation details.