MacawMacaw OpenVoice

WebSocket Protocol

The /v1/realtime endpoint supports real-time bidirectional audio streaming with JSON control messages and binary audio frames.


Connecting

Connect with wscat
wscat -c "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3"
Connect with Python websockets
import websockets

async with websockets.connect(
    "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3"
) as ws:
    # Send audio frames, receive events
    ...

Query Parameters

ParameterRequiredDescription
modelYesSTT model ID

Message Flow

Client                                          Server
  |                                                |
  |  ---- [connect] ---->                          |
  |                          <---- session.created |
  |                                                |
  |  ---- session.configure ---->                  |
  |                                                |
  |  ---- [binary PCM frames] ---->                |
  |                          <---- vad.speech_start|
  |                          <---- transcript.partial
  |                          <---- transcript.partial
  |                          <---- transcript.final|
  |                          <---- vad.speech_end  |
  |                                                |
  |  ---- tts.speak ---->                          |
  |                          <---- tts.speaking_start
  |                          <---- [binary audio]  |
  |                          <---- tts.speaking_end|
  |                                                |
  |  ---- [close] ---->                            |

Client to Server Messages

Binary Frames (Audio)

Send raw PCM audio as binary WebSocket frames:

PropertyValue
FormatPCM 16-bit signed integer
Sample rateAny (resampled automatically to 16 kHz)
ChannelsMono (or first channel extracted)

You can send audio at any sample rate -- the runtime automatically resamples to 16 kHz before processing.

session.configure

Configure the session after connecting. Optional -- defaults are used if not sent.

{
  "type": "session.configure",
  "vad": {
    "sensitivity": "normal"
  },
  "language": "en",
  "hot_words": ["Macaw", "OpenVoice", "transcription"],
  "tts_model": "kokoro-v1"
}
FieldTypeDescription
vad.sensitivitystringhigh, normal, or low
languagestringISO 639-1 language code
hot_wordsstring[]Domain-specific keywords to boost
tts_modelstringTTS model for full-duplex mode

tts.speak

Trigger text-to-speech synthesis. The server will stream audio back as binary frames.

{
  "type": "tts.speak",
  "text": "Hello, how can I help you?",
  "voice": "default"
}
FieldTypeDescription
textstringText to synthesize
voicestringVoice identifier

Sending a new tts.speak while one is already active cancels the previous one. TTS requests do not queue -- only the latest one is processed.

tts.cancel

Cancel the currently active TTS synthesis.

{
  "type": "tts.cancel"
}

Server to Client Events

session.created

Sent immediately after the WebSocket connection is established.

{
  "type": "session.created",
  "session_id": "abc123"
}

vad.speech_start

Speech activity detected. The runtime has started buffering audio for transcription.

{
  "type": "vad.speech_start",
  "timestamp": 1234567890.123
}

transcript.partial

Intermediate transcription hypothesis. Updated as more audio arrives. Unstable -- may change with subsequent partials.

{
  "type": "transcript.partial",
  "text": "Hello how can"
}

Partials are best-effort hypotheses. Never apply post-processing (ITN) to partials -- they are too unstable for reliable formatting.

transcript.final

Confirmed transcription segment. This is the stable, post-processed result.

{
  "type": "transcript.final",
  "text": "Hello, how can I help you today?",
  "language": "en",
  "duration": 3.42
}

vad.speech_end

Speech activity has ended.

{
  "type": "vad.speech_end",
  "timestamp": 1234567890.456
}

tts.speaking_start

TTS synthesis has begun. STT is automatically muted during TTS to prevent feedback loops.

{
  "type": "tts.speaking_start"
}

Binary Frames (TTS Audio)

During TTS synthesis, the server sends binary WebSocket frames containing audio data:

DirectionContent
Server to client (binary)Always TTS audio
Client to server (binary)Always STT audio

There is no ambiguity in binary frame direction -- server-to-client binary frames are always TTS audio, and client-to-server binary frames are always STT audio.

tts.speaking_end

TTS synthesis is complete. STT is automatically unmuted and resumes processing audio.

{
  "type": "tts.speaking_end"
}

error

An error occurred during processing.

{
  "type": "error",
  "message": "Worker connection lost",
  "recoverable": true
}
FieldTypeDescription
messagestringHuman-readable error description
recoverablebooleanWhether the session can continue

Full-Duplex Mode

When a tts_model is configured, the WebSocket operates in full-duplex mode:

  1. Client streams audio for STT continuously
  2. Client sends tts.speak to trigger speech synthesis
  3. Server automatically mutes STT during TTS playback (via try/finally -- unmute is guaranteed even if TTS crashes)
  4. After TTS completes, STT resumes automatically

See the Full-Duplex Guide for implementation details.