Skip to main content

Welcome to Macaw OpenVoice

Macaw OpenVoice is an open-source voice runtime real-time Speech-to-Text and Text-to-Speech with OpenAI-compatible API, streaming session control, and extensible execution architecture.

Macaw is not a fork, wrapper, or thin layer on top of existing projects. It is the runtime layer that sits between inference engines and production -- handling session management, audio preprocessing, post-processing, scheduling, observability, and a unified CLI.


Capabilities

CapabilityDescription
OpenAI-Compatible APIPOST /v1/audio/transcriptions, /translations, /speech -- existing SDKs work out of the box
Real-Time StreamingPartial and final transcripts via WebSocket with sub-300ms TTFB
Full-DuplexSimultaneous STT + TTS on a single WebSocket with mute-on-speak safety
Multi-EngineFaster-Whisper (encoder-decoder), WeNet (CTC), Kokoro (TTS) through one interface
Session Manager6-state machine, ring buffer, WAL-based crash recovery, backpressure control
Voice Activity DetectionSilero VAD with energy pre-filter and configurable sensitivity levels
Audio PreprocessingAutomatic resample, DC removal, and gain normalization to 16 kHz
Post-ProcessingInverse Text Normalization via NeMo (e.g., "two thousand" becomes "2000")
Hot WordsDomain-specific keyword boosting per session
CLIOllama-style UX -- macaw serve, macaw transcribe, macaw list, macaw pull
ObservabilityPrometheus metrics for TTFB, session duration, VAD events, TTS latency

Supported Engines

EngineTypeArchitectureStreamingHot Words
Faster-WhisperSTTEncoder-DecoderLocalAgreementvia initial_prompt
WeNetSTTCTCNative partialsNative keyword boosting
KokoroTTSNeuralChunked streaming--
Adding new engines

Adding a new STT or TTS engine requires approximately 400-700 lines of code and zero changes to the runtime core. See the Adding an Engine guide.


How It Works

              Clients (REST / WebSocket / CLI)
|
+-----------+-----------+
| API Server |
| (FastAPI + Uvicorn) |
+-----------+-----------+
|
+-----------+-----------+
| Scheduler |
| Priority . Batching |
| Cancellation . TTFB |
+-----+----------+------+
| |
+--------+--+ +---+--------+
| STT Worker | | TTS Worker |
| (gRPC) | | (gRPC) |
+------------+ +------------+
| Faster- | | Kokoro |
| Whisper | +------------+
| WeNet |
+------------+

Workers run as isolated gRPC subprocesses. If a worker crashes, the runtime recovers automatically via the WAL -- no data is lost, no segments are duplicated.


Quick Example

Install and start
pip install macaw-openvoice[server,grpc,faster-whisper]
macaw serve
Transcribe a file
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.wav \
-F model=faster-whisper-large-v3
Using the OpenAI SDK
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

result = client.audio.transcriptions.create(
model="faster-whisper-large-v3",
file=open("audio.wav", "rb"),
)
print(result.text)

Next Steps

  • Installation -- Set up Python, install Macaw, and configure your first engine
  • Quickstart -- Run your first transcription in under 5 minutes
  • Streaming STT -- Connect via WebSocket for real-time transcription
  • Full-Duplex -- Build voice assistants with simultaneous STT and TTS
  • API Reference -- Complete endpoint documentation
  • Architecture -- Understand how the runtime is structured

Contact