Open-Source Voice Runtime
Build voice apps in minutes, not months
Macaw OpenVoice is a production-ready runtime for real-time speech-to-text and text-to-speech. Drop-in OpenAI API compatibility, streaming WebSocket support, and multi-engine architecture — all in a single Python process.
Everything you need for voice
A single runtime that handles the entire voice pipeline — from raw audio to structured text and back.
Streaming STT
Real-time partial and final transcripts via WebSocket with sub-300ms TTFB and backpressure control.
Text-to-Speech
OpenAI-compatible speech endpoint with streaming PCM or WAV output and low time-to-first-byte.
Full-Duplex
Simultaneous STT and TTS on one WebSocket connection with automatic mute-on-speak safety.
Session Manager
6-state machine with ring buffer, WAL-based crash recovery, and zero segment duplication.
Multi-Engine
Faster-Whisper, WeNet, and Kokoro through a single interface. Add new engines in ~500 lines.
Voice Pipeline
Preprocessing, Silero VAD, ITN post-processing, and Prometheus metrics — all built in.
Drop-in Replacement
OpenAI SDK compatible
Existing OpenAI client libraries work out of the box. Change one line and your code talks to Macaw instead.
/v1/audio/transcriptions/v1/audio/speech/v1/audio/translations
| 1 | from openai import OpenAI |
| 2 | |
| 3 | client = OpenAI( |
| 4 | base_url="http://localhost:8000/v1", |
| 5 | api_key="not-needed" |
| 6 | ) |
| 7 | |
| 8 | result = client.audio.transcriptions.create( |
| 9 | model="faster-whisper-tiny", |
| 10 | file=open("audio.wav", "rb"), |
| 11 | ) |
| 12 | print(result.text) |
How It Works
Architecture at a glance
A single runtime orchestrates isolated gRPC workers per engine. Workers crash independently — the runtime recovers automatically.