MacawMacaw OpenVoice

Welcome to Macaw OpenVoice

Macaw OpenVoice is an open-source voice runtime real-time Speech-to-Text and Text-to-Speech with OpenAI-compatible API, streaming session control, and extensible execution architecture.

Macaw is not a fork, wrapper, or thin layer on top of existing projects. It is the runtime layer that sits between inference engines and production -- handling session management, audio preprocessing, post-processing, scheduling, observability, and a unified CLI.


Capabilities

CapabilityDescription
OpenAI-Compatible APIPOST /v1/audio/transcriptions, /translations, /speech -- existing SDKs work out of the box
Real-Time StreamingPartial and final transcripts via WebSocket with sub-300ms TTFB
Full-DuplexSimultaneous STT + TTS on a single WebSocket with mute-on-speak safety
Multi-EngineFaster-Whisper (encoder-decoder), Kokoro (TTS) through one interface
Session Manager6-state machine, ring buffer, WAL-based crash recovery, backpressure control
Voice Activity DetectionSilero VAD with energy pre-filter and configurable sensitivity levels
Audio PreprocessingAutomatic resample, DC removal, and gain normalization to 16 kHz
Post-ProcessingInverse Text Normalization via NeMo (e.g., "two thousand" becomes "2000")
Hot WordsDomain-specific keyword boosting per session
CLIOllama-style UX -- macaw serve, macaw transcribe, macaw list, macaw pull
ObservabilityPrometheus metrics for TTFB, session duration, VAD events, TTS latency

Supported Engines

EngineTypeArchitectureStreamingHot Words
Faster-WhisperSTTEncoder-DecoderLocalAgreementvia initial_prompt
KokoroTTSNeuralChunked streaming--

Adding new engines

Adding a new STT or TTS engine requires approximately 400-700 lines of code and zero changes to the runtime core. See the Adding an Engine guide.


How It Works

              Clients (REST / WebSocket / CLI)
                          |
              +-----------+-----------+
              |     API Server        |
              |  (FastAPI + Uvicorn)  |
              +-----------+-----------+
                          |
              +-----------+-----------+
              |      Scheduler        |
              |  Priority . Batching  |
              |  Cancellation . TTFB  |
              +-----+----------+------+
                    |          |
           +--------+--+  +---+--------+
           | STT Worker |  | TTS Worker |
           |  (gRPC)    |  |  (gRPC)    |
           +------------+  +------------+
           | Faster-    |  | Kokoro     |
           | Whisper    |  +------------+
           |       |
           +------------+

Workers run as isolated gRPC subprocesses. If a worker crashes, the runtime recovers automatically via the WAL -- no data is lost, no segments are duplicated.


Quick Example

Install and start
pip install macaw-openvoice[server,grpc,faster-whisper]
macaw serve
Transcribe a file
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=faster-whisper-large-v3
Using the OpenAI SDK
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

result = client.audio.transcriptions.create(
    model="faster-whisper-large-v3",
    file=open("audio.wav", "rb"),
)
print(result.text)

Next Steps

  • Installation -- Set up Python, install Macaw, and configure your first engine
  • Quickstart -- Run your first transcription in under 5 minutes
  • Streaming STT -- Connect via WebSocket for real-time transcription
  • Full-Duplex -- Build voice assistants with simultaneous STT and TTS
  • API Reference -- Complete endpoint documentation
  • Architecture -- Understand how the runtime is structured

Contact