MacawMacaw OpenVoice
Guides

Streaming STT

Macaw provides real-time speech-to-text via WebSocket at /v1/realtime. Audio frames are sent as binary messages and transcription events are returned as JSON.

Quick Start

Using wscat

Connect and stream
wscat -c "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3"

Using Python

stream_audio.py
import asyncio
import json
import websockets

async def stream_microphone():
    uri = "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3"

    async with websockets.connect(uri) as ws:
        # Wait for session.created
        msg = json.loads(await ws.recv())
        print(f"Session: {msg['session_id']}")

        # Send audio frames (16-bit PCM, 16kHz)
        # In production, read from microphone
        with open("audio.raw", "rb") as f:
            while chunk := f.read(3200):  # 100ms frames
                await ws.send(chunk)
                await asyncio.sleep(0.1)

                # Check for transcription events (non-blocking)
                try:
                    response = await asyncio.wait_for(ws.recv(), timeout=0.01)
                    event = json.loads(response)
                    if event["type"] == "transcript.partial":
                        print(f"  ...{event['text']}", end="\r")
                    elif event["type"] == "transcript.final":
                        print(f"  {event['text']}")
                except asyncio.TimeoutError:
                    pass

asyncio.run(stream_microphone())

Using the CLI

Stream from microphone
macaw transcribe --stream --model faster-whisper-large-v3

Connection

URL Format

ws://HOST:PORT/v1/realtime?model=MODEL&language=LANG
ParameterRequiredDefaultDescription
modelYesSTT model name
languageNoautoISO 639-1 code (e.g., en, pt)

Session Created

After connecting, the server immediately sends a session.created event:

Server → Client
{
  "type": "session.created",
  "session_id": "sess_a1b2c3d4"
}

Save the session_id for logging and debugging.

Audio Format

Send audio as binary WebSocket frames:

PropertyValue
EncodingPCM 16-bit signed, little-endian
Sample rate16,000 Hz
ChannelsMono
Frame sizeRecommended: 3,200 bytes (100ms)

Preprocessing is automatic

If your audio isn't exactly 16kHz mono, the StreamingPreprocessor will resample it automatically. However, sending pre-formatted audio avoids unnecessary processing.

Transcription Events

Partial Transcripts

Emitted in real-time as speech is being recognized. These are unstable — text may change as more context arrives:

Server → Client
{
  "type": "transcript.partial",
  "text": "hello how are",
  "segment_id": 1
}

Final Transcripts

Emitted when a speech segment ends (VAD detects silence). These are stable — the text will not change:

Server → Client
{
  "type": "transcript.final",
  "text": "Hello, how are you doing today?",
  "segment_id": 1,
  "start": 0.5,
  "end": 2.8,
  "confidence": 0.94
}

ITN on finals only

Inverse Text Normalization (e.g., "one hundred" → "100") is applied only to final transcripts. Partials return raw text because they change too frequently for ITN to be useful.

Session Configuration

After connecting, you can adjust session settings dynamically:

Client → Server
{
  "type": "session.configure",
  "language": "pt",
  "vad_sensitivity": "high",
  "hot_words": ["Macaw", "OpenVoice", "gRPC"],
  "enable_itn": true
}
FieldTypeDescription
languagestringChange language mid-session
vad_sensitivitystring"high", "normal", or "low"
hot_wordsstring[]Domain-specific terms to boost recognition
enable_itnbooleanEnable/disable Inverse Text Normalization
model_ttsstringSet TTS model for full-duplex (see Full-Duplex)

Buffer Management

Manual Commit

Force the audio buffer to commit and produce a final transcript, even without a VAD silence event:

Client → Server
{
  "type": "input_audio_buffer.commit"
}

This is useful when you know the user has finished speaking (e.g., they pressed a "done" button) but the VAD hasn't detected silence yet.

Closing the Session

Graceful Close

Client → Server
{
  "type": "session.close"
}

The server flushes remaining data, emits any final transcripts, and sends:

Server → Client
{
  "type": "session.closed",
  "session_id": "sess_a1b2c3d4",
  "reason": "client_close"
}

Cancel

Client → Server
{
  "type": "session.cancel"
}

Immediately closes the session without flushing. Pending transcripts are discarded.

Backpressure

If the client sends audio faster than real-time (e.g., reading from a file without throttling), the server applies backpressure:

Rate Limit Warning

Server → Client
{
  "type": "session.rate_limit",
  "delay_ms": 50,
  "message": "Audio arriving faster than 1.2x real-time"
}

Action: slow down your send rate by the suggested delay_ms.

Frames Dropped

Server → Client
{
  "type": "session.frames_dropped",
  "dropped_ms": 200,
  "message": "Backlog exceeded 10s, frames dropped"
}

Action: this is informational — frames have already been dropped. Reduce send rate to prevent further drops.

Throttle file streaming

When streaming from a file (not a microphone), add asyncio.sleep(0.1) between 100ms frames to simulate real-time. Without throttling, the server will trigger backpressure.

Error Handling

Error Events

Server → Client
{
  "type": "error",
  "code": "worker_unavailable",
  "message": "STT worker not available for model faster-whisper-large-v3",
  "recoverable": true
}
FieldDescription
codeMachine-readable error code
messageHuman-readable description
recoverabletrue if the client can retry or continue

Common Errors

CodeRecoverableDescription
model_not_foundNoRequested model is not loaded
worker_unavailableYesWorker crashed, recovery in progress
session_timeoutNoSession exceeded idle timeout
invalid_commandYesUnrecognized JSON command

Reconnection

If the WebSocket disconnects unexpectedly:

  1. Reconnect with the same parameters
  2. A new session_id will be assigned
  3. Previous session state is not preserved — this is a fresh session

Inactivity Timeout

The server monitors session activity:

ParameterValue
Heartbeat pingEvery 10s
Auto-close timeout60s of inactivity

If no audio frames arrive for 60 seconds, the server closes the session automatically.

Next Steps

GoalGuide
Add TTS to the same connectionFull-Duplex
Batch file transcription insteadBatch Transcription
Full protocol referenceWebSocket Protocol