Everything you need for voice
A single runtime that handles the entire voice pipeline — from raw audio to structured text and back.
Streaming STT
Real-time partial and final transcripts via WebSocket with sub-300ms TTFB and backpressure control.
Text-to-Speech
OpenAI-compatible speech endpoint with streaming PCM or WAV output and low time-to-first-byte.
Full-Duplex
Simultaneous STT and TTS on one WebSocket connection with automatic mute-on-speak safety.
Session Manager
6-state machine with ring buffer, WAL-based crash recovery, and zero segment duplication.
Multi-Engine
Faster-Whisper, WeNet, and Kokoro through a single interface. Add new engines in ~500 lines.
Voice Pipeline
Preprocessing, Silero VAD, ITN post-processing, and Prometheus metrics — all built in.
OpenAI SDK compatible
Existing OpenAI client libraries work out of the box. Just point base_url to your Macaw server.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
result = client.audio.transcriptions.create(
model="faster-whisper-tiny",
file=open("audio.wav", "rb"),
)
print(result.text)Architecture at a glance
A single runtime orchestrates isolated gRPC workers per engine. Workers crash independently — the runtime recovers automatically.
Clients (REST / WebSocket / CLI)
│
┌───────────┴───────────┐
│ API Server │
│ (FastAPI + Uvicorn) │
└───────────┬───────────┘
│
┌───────────┴───────────┐
│ Scheduler │
│ Priority · Batching │
│ Cancellation · TTFB │
└─────┬─────────┬───────┘
│ │
┌────────┴──┐ ┌───┴────────┐
│ STT Worker │ │ TTS Worker │
│ (gRPC) │ │ (gRPC) │
├────────────┤ ├────────────┤
│ Faster- │ │ Kokoro │
│ Whisper │ └────────────┘
│ WeNet │
└────────────┘