Skip to main content

Architecture Overview

Macaw OpenVoice is a unified voice runtime that orchestrates STT (Speech-to-Text) and TTS (Text-to-Speech) engines through a single process with isolated gRPC workers. It provides an OpenAI-compatible API while keeping engines modular and crash-isolated.

High-Level Architecture

                         ┌─────────────────────────────────────────────┐
│ Macaw Runtime │
│ │
┌──────────┐ ┌─────┴─────┐ ┌───────────┐ ┌───────────┐ │
│ Clients │────▶│ API Server │────▶│ Scheduler │────▶│ STT Worker│ │ (subprocess)
│ │ │ (FastAPI) │ │ │ │ gRPC:50051│ │
│ REST │ │ │ │ Priority │ └───────────┘ │
│ WebSocket │ │ /v1/audio/ │ │ Queue │ │
│ CLI │ │ /v1/realtime│ │ Batching │ ┌───────────┐ │
└──────────┘ │ │ │ Cancel │────▶│ TTS Worker│ │ (subprocess)
└─────┬─────┘ └───────────┘ │ gRPC:50052│ │
│ └───────────┘ │
┌─────┴──────────────┐ │
│ Session Manager │ │
│ (streaming only) │ │
│ │ │
│ State Machine │ │
│ Ring Buffer │ │
│ WAL Recovery │ │
└────────────────────┘ │
│ │
┌────┴───────────────────────────────┐ │
│ Audio Pipeline │ │
│ Preprocessing → VAD → Postprocess │ │
└────────────────────────────────────┘ │
└────────────────────────────────────────────┘

Core Layers

API Server

The FastAPI server exposes three types of interfaces:

InterfaceEndpointUse Case
REST (batch)POST /v1/audio/transcriptionsFile transcription
REST (batch)POST /v1/audio/translationsFile translation to English
REST (batch)POST /v1/audio/speechText-to-speech synthesis
WebSocketWS /v1/realtimeStreaming STT + full-duplex TTS
HealthGET /health, GET /v1/modelsMonitoring and model listing

All REST endpoints are OpenAI API-compatible — existing OpenAI client libraries work without modification.

Scheduler

The Scheduler routes batch (REST) requests to gRPC workers. It provides:

  • Priority queue with two levels: REALTIME and BATCH
  • Cancellation for queued and in-flight requests
  • Dynamic batching to group requests by model
  • Latency tracking with TTL-based cleanup
Streaming bypasses the Scheduler

WebSocket streaming uses StreamingGRPCClient directly — it does not pass through the priority queue. The Scheduler is only for REST batch requests.

See Scheduling for details.

Session Manager

The Session Manager coordinates streaming STT only. Each WebSocket connection gets its own session with:

  • State machine — 6 states: INIT → ACTIVE → SILENCE → HOLD → CLOSING → CLOSED
  • Ring buffer — pre-allocated circular buffer for audio frames (zero allocations during streaming)
  • WAL — in-memory Write-Ahead Log for crash recovery
  • Backpressure — rate limiting at 1.2x real-time, frame dropping when overloaded
TTS is stateless

TTS does not use the Session Manager. Each tts.speak request is independent — no state is carried between synthesis calls.

See Session Manager for details.

Audio Pipeline

The audio pipeline runs in the runtime, not in the engine. This guarantees consistent behavior across all engines.

Input Audio → Resample (16kHz) → DC Remove → Gain Normalize → VAD → Engine

Raw Text → ITN → Output
StageLayerDescription
ResamplePreprocessingConvert to 16kHz mono via scipy.signal.resample_poly
DC RemovePreprocessing2nd-order Butterworth HPF at 20Hz
Gain NormalizePreprocessingPeak normalization to -3.0 dBFS
Energy Pre-filterVADRMS + spectral flatness check (~0.1ms)
Silero VADVADNeural speech probability (~2ms on CPU)
ITNPostprocessingInverse Text Normalization via NeMo (fail-open)

See VAD Pipeline for details.

Workers

Workers are gRPC subprocesses. A worker crash does not bring down the runtime — the Session Manager recovers by resending uncommitted audio from the ring buffer.

WorkerPortProtocolEngines
STT50051Bidirectional streamingFaster-Whisper, WeNet
TTS50052Server streamingKokoro

Worker lifecycle:

STARTING → READY → BUSY → STOPPING → STOPPED
↑ │
└───────┘
(on idle)

CRASHED → (auto-restart, max 3 in 60s)

The WorkerManager handles health probing (exponential backoff, 30s timeout), graceful shutdown (SIGTERM → 5s wait → SIGKILL), and automatic restart with rate limiting.

Model Registry

The Registry manages model manifests (macaw.yaml files) and lifecycle. Models declare their architecture field, which tells the runtime how to adapt the pipeline:

ArchitectureExampleLocalAgreementCross-segment ContextNative Partials
encoder-decoderFaster-WhisperYesYes (224 tokens)No
ctcWeNetNoNoYes
streaming-nativeParaformerNoNoYes

Data Flow

Batch Request (REST)

Client → POST /v1/audio/transcriptions
→ Preprocessing pipeline (resample, DC remove, normalize)
→ Scheduler priority queue
→ gRPC TranscribeFile to STT worker
→ Postprocessing (ITN)
→ JSON response to client

Streaming Request (WebSocket)

Client → WS /v1/realtime
→ Session created (state: INIT)
→ Binary frames arrive
→ StreamingPreprocessor (per-frame)
→ VAD (energy pre-filter → Silero)
→ SPEECH_START → state: ACTIVE
→ Frames written to ring buffer
→ Frames sent via StreamingGRPCClient to STT worker
→ Partial/final transcripts sent back to client
→ SPEECH_END → state: SILENCE
→ ITN applied on final transcripts only

Full-Duplex (STT + TTS)

Client sends audio (STT)  ──────────────────────────────▶  partials/finals
Client sends tts.speak ──▶ mute STT
──▶ gRPC Synthesize to TTS worker
──▶ tts.speaking_start event
──▶ binary audio chunks (server → client)
──▶ tts.speaking_end event
──▶ unmute STT (guaranteed via try/finally)

Key Design Decisions

DecisionRationale
Single process, subprocess workersCrash isolation without distributed system complexity
VAD in runtime, not engineConsistent behavior across all engines
Preprocessing before VADNormalized audio ensures stable VAD thresholds
Streaming bypasses SchedulerDirect gRPC connection avoids queue latency for real-time
Mute-on-speak for full-duplexPrevents TTS audio from feeding back into STT
Pipeline adapts by architectureEncoder-decoder gets LocalAgreement; CTC uses native partials
ITN on finals onlyPartials are unstable — ITN would produce confusing output
In-memory WALFast recovery without disk I/O overhead
gRPC stream break as heartbeatNo separate health polling needed for crash detection