Skip to main content

Streaming STT

Macaw provides real-time speech-to-text via WebSocket at /v1/realtime. Audio frames are sent as binary messages and transcription events are returned as JSON.

Quick Start

Using wscat

Connect and stream
wscat -c "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3"

Using Python

stream_audio.py
import asyncio
import json
import websockets

async def stream_microphone():
uri = "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3"

async with websockets.connect(uri) as ws:
# Wait for session.created
msg = json.loads(await ws.recv())
print(f"Session: {msg['session_id']}")

# Send audio frames (16-bit PCM, 16kHz)
# In production, read from microphone
with open("audio.raw", "rb") as f:
while chunk := f.read(3200): # 100ms frames
await ws.send(chunk)
await asyncio.sleep(0.1)

# Check for transcription events (non-blocking)
try:
response = await asyncio.wait_for(ws.recv(), timeout=0.01)
event = json.loads(response)
if event["type"] == "transcript.partial":
print(f" ...{event['text']}", end="\r")
elif event["type"] == "transcript.final":
print(f" {event['text']}")
except asyncio.TimeoutError:
pass

asyncio.run(stream_microphone())

Using the CLI

Stream from microphone
macaw transcribe --stream --model faster-whisper-large-v3

Connection

URL Format

ws://HOST:PORT/v1/realtime?model=MODEL&language=LANG
ParameterRequiredDefaultDescription
modelYesSTT model name
languageNoautoISO 639-1 code (e.g., en, pt)

Session Created

After connecting, the server immediately sends a session.created event:

Server → Client
{
"type": "session.created",
"session_id": "sess_a1b2c3d4"
}

Save the session_id for logging and debugging.

Audio Format

Send audio as binary WebSocket frames:

PropertyValue
EncodingPCM 16-bit signed, little-endian
Sample rate16,000 Hz
ChannelsMono
Frame sizeRecommended: 3,200 bytes (100ms)
Preprocessing is automatic

If your audio isn't exactly 16kHz mono, the StreamingPreprocessor will resample it automatically. However, sending pre-formatted audio avoids unnecessary processing.

Transcription Events

Partial Transcripts

Emitted in real-time as speech is being recognized. These are unstable — text may change as more context arrives:

Server → Client
{
"type": "transcript.partial",
"text": "hello how are",
"segment_id": 1
}

Final Transcripts

Emitted when a speech segment ends (VAD detects silence). These are stable — the text will not change:

Server → Client
{
"type": "transcript.final",
"text": "Hello, how are you doing today?",
"segment_id": 1,
"start": 0.5,
"end": 2.8,
"confidence": 0.94
}
ITN on finals only

Inverse Text Normalization (e.g., "one hundred" → "100") is applied only to final transcripts. Partials return raw text because they change too frequently for ITN to be useful.

Session Configuration

After connecting, you can adjust session settings dynamically:

Client → Server
{
"type": "session.configure",
"language": "pt",
"vad_sensitivity": "high",
"hot_words": ["Macaw", "OpenVoice", "gRPC"],
"enable_itn": true
}
FieldTypeDescription
languagestringChange language mid-session
vad_sensitivitystring"high", "normal", or "low"
hot_wordsstring[]Domain-specific terms to boost recognition
enable_itnbooleanEnable/disable Inverse Text Normalization
model_ttsstringSet TTS model for full-duplex (see Full-Duplex)

Buffer Management

Manual Commit

Force the audio buffer to commit and produce a final transcript, even without a VAD silence event:

Client → Server
{
"type": "input_audio_buffer.commit"
}

This is useful when you know the user has finished speaking (e.g., they pressed a "done" button) but the VAD hasn't detected silence yet.

Closing the Session

Graceful Close

Client → Server
{
"type": "session.close"
}

The server flushes remaining data, emits any final transcripts, and sends:

Server → Client
{
"type": "session.closed",
"session_id": "sess_a1b2c3d4",
"reason": "client_close"
}

Cancel

Client → Server
{
"type": "session.cancel"
}

Immediately closes the session without flushing. Pending transcripts are discarded.

Backpressure

If the client sends audio faster than real-time (e.g., reading from a file without throttling), the server applies backpressure:

Rate Limit Warning

Server → Client
{
"type": "session.rate_limit",
"delay_ms": 50,
"message": "Audio arriving faster than 1.2x real-time"
}

Action: slow down your send rate by the suggested delay_ms.

Frames Dropped

Server → Client
{
"type": "session.frames_dropped",
"dropped_ms": 200,
"message": "Backlog exceeded 10s, frames dropped"
}

Action: this is informational — frames have already been dropped. Reduce send rate to prevent further drops.

Throttle file streaming

When streaming from a file (not a microphone), add asyncio.sleep(0.1) between 100ms frames to simulate real-time. Without throttling, the server will trigger backpressure.

Error Handling

Error Events

Server → Client
{
"type": "error",
"code": "worker_unavailable",
"message": "STT worker not available for model faster-whisper-large-v3",
"recoverable": true
}
FieldDescription
codeMachine-readable error code
messageHuman-readable description
recoverabletrue if the client can retry or continue

Common Errors

CodeRecoverableDescription
model_not_foundNoRequested model is not loaded
worker_unavailableYesWorker crashed, recovery in progress
session_timeoutNoSession exceeded idle timeout
invalid_commandYesUnrecognized JSON command

Reconnection

If the WebSocket disconnects unexpectedly:

  1. Reconnect with the same parameters
  2. A new session_id will be assigned
  3. Previous session state is not preserved — this is a fresh session

Inactivity Timeout

The server monitors session activity:

ParameterValue
Heartbeat pingEvery 10s
Auto-close timeout60s of inactivity

If no audio frames arrive for 60 seconds, the server closes the session automatically.

Next Steps

GoalGuide
Add TTS to the same connectionFull-Duplex
Batch file transcription insteadBatch Transcription
Full protocol referenceWebSocket Protocol