# Macaw OpenVoice > Open-source voice runtime for real-time Speech-to-Text and Text-to-Speech with OpenAI-compatible API, streaming session control, and extensible execution architecture. --- # Welcome to Macaw OpenVoice Macaw OpenVoice is an open-source **voice runtime** real-time Speech-to-Text and Text-to-Speech with OpenAI-compatible API, streaming session control, and extensible execution architecture. > Macaw is **not** a fork, wrapper, or thin layer on top of existing projects. It is the **runtime layer** that sits between inference engines and production -- handling session management, audio preprocessing, post-processing, scheduling, observability, and a unified CLI. --- ## Capabilities | Capability | Description | |---|---| | **OpenAI-Compatible API** | `POST /v1/audio/transcriptions`, `/translations`, `/speech` -- existing SDKs work out of the box | | **Real-Time Streaming** | Partial and final transcripts via WebSocket with sub-300ms TTFB | | **Full-Duplex** | Simultaneous STT + TTS on a single WebSocket with mute-on-speak safety | | **Multi-Engine** | Faster-Whisper (encoder-decoder), WeNet (CTC), Kokoro (TTS) through one interface | | **Session Manager** | 6-state machine, ring buffer, WAL-based crash recovery, backpressure control | | **Voice Activity Detection** | Silero VAD with energy pre-filter and configurable sensitivity levels | | **Audio Preprocessing** | Automatic resample, DC removal, and gain normalization to 16 kHz | | **Post-Processing** | Inverse Text Normalization via NeMo (e.g., "two thousand" becomes "2000") | | **Hot Words** | Domain-specific keyword boosting per session | | **CLI** | Ollama-style UX -- `macaw serve`, `macaw transcribe`, `macaw list`, `macaw pull` | | **Observability** | Prometheus metrics for TTFB, session duration, VAD events, TTS latency | --- ## Supported Engines | Engine | Type | Architecture | Streaming | Hot Words | |---|---|---|---|---| | [Faster-Whisper](https://github.com/SYSTRAN/faster-whisper) | STT | Encoder-Decoder | LocalAgreement | via `initial_prompt` | | [WeNet](https://github.com/wenet-e2e/wenet) | STT | CTC | Native partials | Native keyword boosting | | [Kokoro](https://github.com/hexgrad/kokoro) | TTS | Neural | Chunked streaming | -- | > **Tip Adding new engines** > Adding a new STT or TTS engine requires approximately 400-700 lines of code and **zero changes to the runtime core**. See the [Adding an Engine](guides/adding-engine) guide. --- ## How It Works ``` Clients (REST / WebSocket / CLI) | +-----------+-----------+ | API Server | | (FastAPI + Uvicorn) | +-----------+-----------+ | +-----------+-----------+ | Scheduler | | Priority . Batching | | Cancellation . TTFB | +-----+----------+------+ | | +--------+--+ +---+--------+ | STT Worker | | TTS Worker | | (gRPC) | | (gRPC) | +------------+ +------------+ | Faster- | | Kokoro | | Whisper | +------------+ | WeNet | +------------+ ``` Workers run as **isolated gRPC subprocesses**. If a worker crashes, the runtime recovers automatically via the WAL -- no data is lost, no segments are duplicated. --- ## Quick Example ```bash title="Install and start" pip install macaw-openvoice[server,grpc,faster-whisper] macaw serve ``` ```bash title="Transcribe a file" curl -X POST http://localhost:8000/v1/audio/transcriptions \ -F file=@audio.wav \ -F model=faster-whisper-large-v3 ``` ```python title="Using the OpenAI SDK" from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") result = client.audio.transcriptions.create( model="faster-whisper-large-v3", file=open("audio.wav", "rb"), ) print(result.text) ``` --- ## Next Steps - **[Installation](getting-started/installation)** -- Set up Python, install Macaw, and configure your first engine - **[Quickstart](getting-started/quickstart)** -- Run your first transcription in under 5 minutes - **[Streaming STT](guides/streaming-stt)** -- Connect via WebSocket for real-time transcription - **[Full-Duplex](guides/full-duplex)** -- Build voice assistants with simultaneous STT and TTS - **[API Reference](api-reference/rest-api)** -- Complete endpoint documentation - **[Architecture](architecture/overview)** -- Understand how the runtime is structured --- ## Contact - **Website:** [usemacaw.io](https://usemacaw.io) - **Email:** [hello@usemacaw.io](mailto:hello@usemacaw.io) --- # Installation Macaw OpenVoice requires **Python 3.11+** and uses pip extras to install only the engines you need. --- ## Prerequisites | Requirement | Minimum | Recommended | |---|---|---| | Python | 3.11 | 3.12 | | pip | 21.0+ | latest | | OS | Linux, macOS | Linux (for GPU support) | | CUDA | Optional | 12.x (for GPU inference) | > **Info** > Macaw runs on CPU by default. GPU support depends on the engine -- Faster-Whisper uses CTranslate2 which supports CUDA out of the box. --- ## Install with pip The simplest way to get started: ```bash title="Minimal install (STT only)" pip install macaw-openvoice[server,grpc,faster-whisper] ``` ```bash title="Full install (STT + TTS + ITN)" pip install macaw-openvoice[server,grpc,faster-whisper,kokoro,itn] ``` ### Available Extras | Extra | What it adds | Size | |---|---|---| | `server` | FastAPI + Uvicorn (required for serving) | ~20 MB | | `grpc` | gRPC runtime for worker communication | ~15 MB | | `faster-whisper` | Faster-Whisper STT engine | ~100 MB | | `wenet` | WeNet CTC STT engine | ~80 MB | | `kokoro` | Kokoro TTS engine | ~50 MB | | `itn` | NeMo Inverse Text Normalization | ~200 MB | | `stream` | Microphone streaming via sounddevice | ~5 MB | | `dev` | Development tools (ruff, mypy, pytest) | ~50 MB | --- ## Install with uv (recommended for development) [uv](https://github.com/astral-sh/uv) is significantly faster than pip and handles virtual environments automatically: ```bash title="Create a virtual environment and install" uv venv --python 3.12 uv sync --all-extras ``` ```bash title="Activate the environment" source .venv/bin/activate ``` --- ## GPU Setup For GPU-accelerated inference with Faster-Whisper: 1. Install CUDA drivers for your GPU 2. Install the CUDA-enabled version of CTranslate2: ```bash pip install ctranslate2 ``` > **Warning** > Ensure your CUDA version matches the CTranslate2 build. Check compatibility at the [CTranslate2 releases page](https://github.com/OpenNMT/CTranslate2/releases). --- ## Verify Installation ```bash title="Check that Macaw is installed correctly" macaw --help ``` You should see: ``` Usage: macaw [OPTIONS] COMMAND [ARGS]... Macaw OpenVoice CLI Commands: serve Start the API server transcribe Transcribe an audio file translate Translate audio to English list List installed models pull Download a model inspect Show model details ``` --- ## Next Steps - **[Quickstart](quickstart)** -- Run your first transcription - **[Configuration](configuration)** -- Customize runtime settings --- # Quickstart Get from zero to your first transcription in under 5 minutes. --- ## Step 1: Install ```bash pip install macaw-openvoice[server,grpc,faster-whisper] ``` > **Tip** > If you plan to use TTS as well, add the `kokoro` extra: > ```bash > pip install macaw-openvoice[server,grpc,faster-whisper,kokoro] > ``` --- ## Step 2: Start the Server ```bash macaw serve ``` You should see output like this: ``` ╔═══════════════════════════════════════╗ ║ Macaw OpenVoice v1.0.0 ║ ╚═══════════════════════════════════════╝ INFO Scanning models in ~/.macaw/models INFO Found 2 model(s): faster-whisper-tiny (STT), kokoro-v1 (TTS) INFO Spawning STT worker port=50051 engine=faster-whisper INFO Spawning TTS worker port=50052 engine=kokoro INFO Scheduler started aging=30.0s batch_ms=75.0 batch_max=8 INFO Uvicorn running on http://127.0.0.1:8000 ``` The server is now ready to accept requests on port **8000**. --- ## Step 3: Transcribe Audio ### Via REST API (curl) ```bash curl -X POST http://localhost:8000/v1/audio/transcriptions \ -F file=@audio.wav \ -F model=faster-whisper-large-v3 ``` **Response:** ```json { "text": "Hello, how can I help you today?" } ``` ### Via CLI ```bash macaw transcribe audio.wav --model faster-whisper-large-v3 ``` ### Via OpenAI Python SDK ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") result = client.audio.transcriptions.create( model="faster-whisper-large-v3", file=open("audio.wav", "rb"), ) print(result.text) ``` > **Info** > Macaw implements the OpenAI Audio API contract, so any OpenAI-compatible client library works without modification. Just change the `base_url`. --- ## Step 4: Try Real-Time Streaming Connect via WebSocket for live transcription: ```bash wscat -c "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3" ``` Send binary PCM audio frames and receive JSON transcript events: ```json {"type": "transcript.partial", "text": "Hello how"} {"type": "transcript.final", "text": "Hello, how can I help you today?"} ``` See the [Streaming STT guide](../guides/streaming-stt) for the full protocol. --- ## Step 5: Text-to-Speech Generate speech from text: ```bash curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{"model": "kokoro-v1", "input": "Hello, welcome to Macaw!", "voice": "default"}' \ --output speech.wav ``` Or use the OpenAI SDK: ```python response = client.audio.speech.create( model="kokoro-v1", input="Hello, welcome to Macaw!", voice="default", ) response.stream_to_file("output.wav") ``` --- ## What's Next? | Want to... | Read... | |---|---| | Stream audio in real time | [Streaming STT](../guides/streaming-stt) | | Build a voice assistant with STT + TTS | [Full-Duplex Guide](../guides/full-duplex) | | Transcribe files in batch | [Batch Transcription](../guides/batch-transcription) | | Understand the architecture | [Architecture Overview](../architecture/overview) | | Add a new engine | [Adding an Engine](../guides/adding-engine) | | Use the CLI | [CLI Reference](../guides/cli) | --- # Configuration Macaw OpenVoice uses a combination of model manifests, runtime defaults, and environment variables for configuration. --- ## Model Manifests Each engine model is described by a `macaw.yaml` manifest file. This file declares the model's capabilities and how the runtime should interact with it. ```yaml title="Example: macaw.yaml for Faster-Whisper" name: faster-whisper-large-v3 type: stt engine: faster-whisper architecture: encoder-decoder languages: - en - pt - es options: beam_size: 5 vad_filter: false # VAD is handled by the runtime, not the engine word_timestamps: false ``` ### Key Fields | Field | Type | Description | |---|---|---| | `name` | string | Unique model identifier | | `type` | string | `stt` or `tts` | | `engine` | string | Engine backend (`faster-whisper`, `wenet`, `kokoro`) | | `architecture` | string | `encoder-decoder`, `ctc`, or `streaming-native` | | `languages` | list | Supported language codes | | `options` | dict | Engine-specific configuration | > **Warning** > Always set `vad_filter: false` in your manifest. The VAD is managed by the Macaw runtime -- enabling the engine's internal VAD would duplicate the work and cause unpredictable behavior. --- ## Runtime Configuration Runtime behavior is controlled through server startup options: ```bash title="Start with custom settings" macaw serve --host 0.0.0.0 --port 8000 ``` ### Server Options | Option | Default | Description | |---|---|---| | `--host` | `127.0.0.1` | Bind address | | `--port` | `8000` | HTTP port | | `--workers` | `1` | Uvicorn workers | ### Scheduler Settings The scheduler manages request prioritization and batching: | Setting | Default | Description | |---|---|---| | Aging timeout | `30.0s` | Max time a request waits in queue | | Batch window | `75ms` | Time window to accumulate batch requests | | Batch max size | `8` | Maximum requests per batch | > **Tip** > Streaming WebSocket requests bypass the scheduler entirely -- they use a direct gRPC streaming connection for minimum latency. --- ## Pipeline Configuration ### Preprocessing The audio preprocessing pipeline runs **before** VAD and is not configurable per-request -- it ensures all audio reaches the VAD and engine in a consistent format: 1. **Resample** to 16 kHz mono 2. **DC removal** (high-pass filter) 3. **Gain normalization** ### VAD Settings VAD can be configured per WebSocket session via the `session.configure` command: ```json title="WebSocket session configuration" { "type": "session.configure", "vad": { "sensitivity": "normal" }, "language": "en", "hot_words": ["Macaw", "OpenVoice"] } ``` | VAD Setting | Options | Description | |---|---|---| | `sensitivity` | `high`, `normal`, `low` | Controls speech detection threshold | ### Post-Processing (ITN) Inverse Text Normalization converts spoken numbers and patterns to their written form. It is applied **only to final transcripts**, never to partials. | Input | Output | |---|---| | "two thousand twenty six" | "2026" | | "ten dollars and fifty cents" | "$10.50" | | "one two three four" | "1234" | > **Info** > ITN requires the `itn` extra: `pip install macaw-openvoice[itn]`. If not installed, transcripts are returned as-is (fail-open behavior). --- ## Environment Variables | Variable | Description | Default | |---|---|---| | `MACAW_MODELS_DIR` | Directory for model files | `~/.macaw/models` | | `MACAW_LOG_LEVEL` | Logging level | `INFO` | | `MACAW_STT_PORT` | gRPC port for STT worker | `50051` | | `MACAW_TTS_PORT` | gRPC port for TTS worker | `50052` | --- ## Next Steps - **[Architecture Overview](../architecture/overview)** -- Understand the runtime design - **[Adding an Engine](../guides/adding-engine)** -- Add custom STT or TTS engines --- # Supported Models Macaw OpenVoice is **engine-agnostic** — it supports multiple STT and TTS engines through a unified backend interface. Each engine runs as an isolated gRPC subprocess, and the runtime adapts its pipeline automatically based on the model's architecture. ## Model Catalog These are the official models available via `macaw pull`: ### STT Models | Model | Engine | Architecture | Memory | GPU | Languages | Translation | |-------|--------|:---:|:---:|:---:|-----------|:---:| | [`faster-whisper-large-v3`](/docs/models/faster-whisper#large-v3) | Faster-Whisper | encoder-decoder | 3,072 MB | Recommended | 100+ (auto-detect) | Yes | | [`faster-whisper-medium`](/docs/models/faster-whisper#medium) | Faster-Whisper | encoder-decoder | 1,536 MB | Recommended | 100+ (auto-detect) | Yes | | [`faster-whisper-small`](/docs/models/faster-whisper#small) | Faster-Whisper | encoder-decoder | 512 MB | Optional | 100+ (auto-detect) | Yes | | [`faster-whisper-tiny`](/docs/models/faster-whisper#tiny) | Faster-Whisper | encoder-decoder | 256 MB | Optional | 100+ (auto-detect) | Yes | | [`distil-whisper-large-v3`](/docs/models/faster-whisper#distil-large-v3) | Faster-Whisper | encoder-decoder | 1,536 MB | Recommended | English only | No | ### TTS Models | Model | Engine | Memory | GPU | Languages | Default Voice | |-------|--------|:---:|:---:|-----------|---------------| | [`kokoro-v1`](/docs/models/kokoro) | Kokoro | 512 MB | Recommended | 9 languages | `af_heart` | ### VAD (Internal) | Model | Purpose | Memory | GPU | Cost | |-------|---------|:---:|:---:|:---:| | [Silero VAD](/docs/models/silero-vad) | Voice Activity Detection | ~50 MB | Not needed | ~2ms/frame | > **Info WeNet — bring your own model** > [WeNet](/docs/models/wenet) is a supported engine but has no pre-configured models in the catalog. You provide your own WeNet model and create a `macaw.yaml` manifest for it. ## Quick Install ```bash title="Install a model from the catalog" macaw pull faster-whisper-large-v3 ``` ```bash title="List installed models" macaw list ``` ```bash title="Inspect model details" macaw inspect faster-whisper-large-v3 ``` ```bash title="Remove a model" macaw remove faster-whisper-large-v3 ``` Models are downloaded from HuggingFace Hub and stored in `~/.macaw/models/` by default. ## Engine Comparison ### STT Engines | Feature | Faster-Whisper | WeNet | |---------|:-:|:-:| | Architecture | Encoder-decoder | CTC | | Streaming partials | Via LocalAgreement | Native | | Hot words | Via `initial_prompt` workaround | Native keyword boosting | | Cross-segment context | Yes (224 tokens) | No | | Language detection | Yes | No | | Translation | Yes (to English) | No | | Word timestamps | Yes | Yes | | Batch inference | Yes | Yes | | Best for | Accuracy, multilingual | Low latency, Chinese | ### How Architecture Affects the Pipeline The `architecture` field in the model manifest tells the runtime how to adapt its streaming pipeline: | | Encoder-Decoder | CTC | Streaming-Native | |---|:-:|:-:|:-:| | **LocalAgreement** | Yes — confirms tokens across multiple inference passes | No | No | | **Cross-segment context** | Yes — 224 tokens from previous final as `initial_prompt` | No | No | | **Native partials** | No — runtime generates partials via LocalAgreement | Yes | Yes | | **Accumulation** | 5s chunks before inference | Frame-by-frame (160ms minimum) | Frame-by-frame | | **Example** | Faster-Whisper | WeNet | Paraformer (future) | > **Tip Choosing a model** > - **Best accuracy**: `faster-whisper-large-v3` — highest quality, 100+ languages > - **Best speed/accuracy trade-off**: `faster-whisper-small` — runs on CPU, good quality > - **Fastest startup**: `faster-whisper-tiny` — 256 MB, loads in ~2s > - **English only, fast**: `distil-whisper-large-v3` — 6x faster than large-v3, ~1% WER gap > - **Low-latency streaming**: WeNet (CTC) — frame-by-frame native partials > - **Chinese focus**: WeNet — optimized for Chinese with native hot word support ## Model Manifest Every model has a `macaw.yaml` manifest that describes its capabilities, resource requirements, and engine configuration. See [Configuration](/docs/getting-started/configuration) for the full manifest format. ```yaml title="Example: macaw.yaml" name: faster-whisper-large-v3 version: "1.0.0" type: stt engine: faster-whisper capabilities: architecture: encoder-decoder streaming: true languages: ["auto", "en", "pt", "es", "ja", "zh"] word_timestamps: true translation: true partial_transcripts: true hot_words: false batch_inference: true language_detection: true initial_prompt: true resources: memory_mb: 3072 gpu_required: false gpu_recommended: true load_time_seconds: 8 engine_config: model_size: "large-v3" compute_type: "float16" device: "auto" beam_size: 5 vad_filter: false ``` ## Dependencies Each engine has its own optional dependency group. Install only what you need: | Extra | Command | What It Installs | |-------|---------|-----------------| | `faster-whisper` | `pip install macaw-openvoice[faster-whisper]` | `faster-whisper>=1.1,<2.0` | | `wenet` | `pip install macaw-openvoice[wenet]` | `wenet>=2.0,<3.0` | | `kokoro` | `pip install macaw-openvoice[kokoro]` | `kokoro>=0.1,<1.0` | | `huggingface` | `pip install macaw-openvoice[huggingface]` | `huggingface_hub>=0.20,<1.0` | | `itn` | `pip install macaw-openvoice[itn]` | `nemo_text_processing>=1.1,<2.0` | ```bash title="Install everything for a typical deployment" pip install macaw-openvoice[server,grpc,faster-whisper,kokoro,huggingface] ``` ## Adding Your Own Engine Macaw is designed to make adding new engines straightforward — approximately 400-700 lines of code with zero changes to the runtime core. See the [Adding an Engine](/docs/guides/adding-engine) guide. --- # Faster-Whisper Faster-Whisper is the primary STT engine in Macaw OpenVoice. It provides high-accuracy multilingual speech recognition using CTranslate2-optimized Whisper models. Five model variants are available in the official catalog, covering use cases from lightweight testing to production-grade transcription. ## Installation ```bash pip install macaw-openvoice[faster-whisper] ``` This installs `faster-whisper>=1.1,<2.0` as an optional dependency. ## Architecture Faster-Whisper uses the **encoder-decoder** architecture (based on OpenAI Whisper). This means the runtime adapts the streaming pipeline with: - **LocalAgreement** — confirms tokens across multiple inference passes before emitting partials - **Cross-segment context** — passes up to 224 tokens from the previous final transcript as `initial_prompt` to maintain continuity across segments - **Accumulation** — audio is buffered for ~5 seconds before each inference pass (not frame-by-frame) ``` Audio → [5s accumulation] → Inference → LocalAgreement → Partial/Final ↑ initial_prompt (224 tokens from previous final) ``` > **Info Why accumulation?** > Encoder-decoder models like Whisper process audio in fixed-length windows. Sending tiny chunks would produce poor results. The 5-second accumulation threshold balances latency with transcription quality. ## Model Variants ### large-v3 {#large-v3} The highest quality Faster-Whisper model. Best accuracy across 100+ languages. | Property | Value | |----------|-------| | Catalog name | `faster-whisper-large-v3` | | HuggingFace repo | `Systran/faster-whisper-large-v3` | | Memory | 3,072 MB | | GPU | Recommended | | Load time | ~8 seconds | | Languages | 100+ (auto-detect) | | Translation | Yes (any → English) | ```bash macaw pull faster-whisper-large-v3 ``` **Best for:** Production workloads where accuracy matters most. Multilingual support. Translation tasks. ### medium {#medium} Good balance between quality and speed. Suitable for production with GPU. | Property | Value | |----------|-------| | Catalog name | `faster-whisper-medium` | | HuggingFace repo | `Systran/faster-whisper-medium` | | Memory | 1,536 MB | | GPU | Recommended | | Load time | ~5 seconds | | Languages | 100+ (auto-detect) | | Translation | Yes (any → English) | ```bash macaw pull faster-whisper-medium ``` **Best for:** Production with moderate GPU resources. Near large-v3 quality at lower cost. ### small {#small} Lightweight model that runs well on CPU. Good quality for common languages. | Property | Value | |----------|-------| | Catalog name | `faster-whisper-small` | | HuggingFace repo | `Systran/faster-whisper-small` | | Memory | 512 MB | | GPU | Optional | | Load time | ~3 seconds | | Languages | 100+ (auto-detect) | | Translation | Yes (any → English) | ```bash macaw pull faster-whisper-small ``` **Best for:** CPU-only deployments. Development and staging environments. Good speed/accuracy trade-off. ### tiny {#tiny} Ultra-lightweight model for testing and prototyping. Fastest to load. | Property | Value | |----------|-------| | Catalog name | `faster-whisper-tiny` | | HuggingFace repo | `Systran/faster-whisper-tiny` | | Memory | 256 MB | | GPU | Optional | | Load time | ~2 seconds | | Languages | 100+ (auto-detect) | | Translation | Yes (any → English) | ```bash macaw pull faster-whisper-tiny ``` **Best for:** Quick testing and prototyping. CI/CD pipelines. Environments with minimal resources. ### distil-large-v3 {#distil-large-v3} Distilled version of large-v3. Approximately 6x faster with only ~1% WER gap. English only. | Property | Value | |----------|-------| | Catalog name | `distil-whisper-large-v3` | | HuggingFace repo | `Systran/faster-distil-whisper-large-v3` | | Memory | 1,536 MB | | GPU | Recommended | | Load time | ~5 seconds | | Languages | English only | | Translation | No | ```bash macaw pull distil-whisper-large-v3 ``` **Best for:** English-only production workloads where speed matters. High-throughput transcription. When large-v3 is too slow but you want near-equal quality. ## Capabilities | Capability | Supported | Notes | |------------|:---------:|-------| | Streaming | Yes | 5s accumulation threshold | | Batch inference | Yes | Via `POST /v1/audio/transcriptions` | | Word timestamps | Yes | Per-word start/end/probability | | Language detection | Yes | Automatic when `language` is `"auto"` or omitted | | Translation | Yes | Any language → English (except distil-large-v3) | | Initial prompt | Yes | Context string to guide transcription | | Hot words | No | Workaround via `initial_prompt` prefix | | Partial transcripts | Yes | Via LocalAgreement in streaming mode | ### Hot Words Workaround Faster-Whisper does not support native keyword boosting. However, Macaw provides a workaround by prepending hot words to the `initial_prompt`: ```python # In the backend, hot_words are converted to an initial_prompt prefix: # hot_words=["Macaw", "OpenVoice"] → initial_prompt="Terms: Macaw, OpenVoice." ``` This biases the model toward recognizing these terms but is less reliable than native hot word support (see [WeNet](./wenet) for native hot words). ### Language Handling | Input | Behavior | |-------|----------| | `"auto"` | Auto-detect language (passed as `None` to Faster-Whisper) | | `"mixed"` | Auto-detect language (same as `"auto"`) | | `"en"`, `"pt"`, etc. | Force specific language | | Omitted | Auto-detect | The model supports 100+ languages. The catalog manifests list `["auto", "en", "pt", "es", "ja", "zh"]` as common examples, but all Whisper-supported languages work. ## Engine Configuration The `engine_config` section in the model manifest controls Faster-Whisper behavior: ```yaml title="engine_config defaults" engine_config: model_size: "large-v3" # Model size or path compute_type: "float16" # float16, int8, int8_float16, float32 device: "auto" # "auto", "cpu", "cuda" beam_size: 5 # Beam search width vad_filter: false # Always false — VAD is handled by the runtime ``` | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `model_size` | string | (from catalog) | Whisper model size or path to model directory | | `compute_type` | string | `"float16"` | Quantization type. `float16` for GPU, `int8` for CPU | | `device` | string | `"auto"` | Inference device. `"auto"` selects GPU if available | | `beam_size` | int | `5` | Beam search width. Higher = more accurate, slower | | `vad_filter` | bool | `false` | Internal VAD filter. **Always `false`** — Macaw handles VAD | > **Warning Never set `vad_filter: true`** > The Macaw runtime runs its own VAD pipeline (energy pre-filter + Silero VAD). Enabling the internal Faster-Whisper VAD filter would duplicate the work and produce inconsistent behavior. ## Usage Examples ### Batch Transcription (REST API) ```bash title="Transcribe a file" curl -X POST http://localhost:8000/v1/audio/transcriptions \ -F "file=@meeting.wav" \ -F "model=faster-whisper-large-v3" \ -F "language=auto" \ -F "response_format=verbose_json" ``` ```bash title="Translate to English" curl -X POST http://localhost:8000/v1/audio/translations \ -F "file=@audio_pt.wav" \ -F "model=faster-whisper-large-v3" ``` ### Streaming (WebSocket) ```python title="Python WebSocket client" import asyncio import json import websockets async def stream_audio(): uri = "ws://localhost:8000/v1/realtime" async with websockets.connect(uri) as ws: # Configure session await ws.send(json.dumps({ "type": "session.configure", "model": "faster-whisper-large-v3", "language": "auto" })) # Stream audio chunks (16-bit PCM, 16kHz, mono) with open("audio.raw", "rb") as f: while chunk := f.read(3200): # 100ms chunks await ws.send(chunk) # Check for transcripts try: msg = await asyncio.wait_for(ws.recv(), timeout=0.1) event = json.loads(msg) if event["type"] == "transcript.partial": print(f" ...{event['text']}") elif event["type"] == "transcript.final": print(f" >> {event['text']}") except asyncio.TimeoutError: pass asyncio.run(stream_audio()) ``` ### CLI ```bash title="Transcribe via CLI" macaw transcribe meeting.wav --model faster-whisper-large-v3 # With word timestamps macaw transcribe meeting.wav --model faster-whisper-large-v3 --word-timestamps ``` ## Comparison of Variants | | tiny | small | medium | large-v3 | distil-large-v3 | |---|:-:|:-:|:-:|:-:|:-:| | **Memory** | 256 MB | 512 MB | 1,536 MB | 3,072 MB | 1,536 MB | | **GPU needed** | No | No | Recommended | Recommended | Recommended | | **Load time** | ~2s | ~3s | ~5s | ~8s | ~5s | | **Languages** | 100+ | 100+ | 100+ | 100+ | English only | | **Translation** | Yes | Yes | Yes | Yes | No | | **Relative speed** | Fastest | Fast | Moderate | Slowest | ~6x faster than large-v3 | | **Best for** | Testing | CPU deploy | Balanced | Max accuracy | Fast English | ## Manifest Reference Every Faster-Whisper model in the catalog uses the same manifest structure. Here is the full manifest for `faster-whisper-large-v3`: ```yaml title="macaw.yaml (faster-whisper-large-v3)" name: faster-whisper-large-v3 version: "3.0.0" engine: faster-whisper type: stt description: "Faster Whisper Large V3 - encoder-decoder STT" capabilities: streaming: true architecture: encoder-decoder languages: ["auto", "en", "pt", "es", "ja", "zh"] word_timestamps: true translation: true partial_transcripts: true hot_words: false batch_inference: true language_detection: true initial_prompt: true resources: memory_mb: 3072 gpu_required: false gpu_recommended: true load_time_seconds: 8 engine_config: model_size: "large-v3" compute_type: "float16" device: "auto" beam_size: 5 vad_filter: false ``` --- # WeNet WeNet is a CTC-based STT engine optimized for low-latency streaming and Chinese speech recognition. Unlike Faster-Whisper, WeNet produces **native partial transcripts** frame-by-frame without requiring LocalAgreement, making it ideal for real-time applications where latency is critical. > **Info Bring your own model** > WeNet has **no pre-configured models** in the Macaw catalog. You provide your own WeNet model and create a `macaw.yaml` manifest for it. See [Creating a Manifest](#creating-a-manifest) below. ## Installation ```bash pip install macaw-openvoice[wenet] ``` This installs `wenet>=2.0,<3.0` as an optional dependency. ## Architecture WeNet uses the **CTC (Connectionist Temporal Classification)** architecture. This means the runtime adapts the streaming pipeline with: - **No LocalAgreement** — CTC produces native partial transcripts directly - **No cross-segment context** — CTC does not support `initial_prompt` conditioning - **No accumulation** — each chunk is processed immediately (frame-by-frame, minimum 160ms) ``` Audio → [immediate processing] → Native CTC partials → Final (160ms minimum) ``` ### Faster-Whisper vs. WeNet Streaming | Behavior | Faster-Whisper (Encoder-Decoder) | WeNet (CTC) | |----------|:-:|:-:| | Audio buffering | 5s accumulation | Frame-by-frame (160ms min) | | Partial generation | Via LocalAgreement | Native | | Cross-segment context | 224 tokens via initial_prompt | Not supported | | First partial latency | ~5 seconds | ~160 milliseconds | | Best for | Accuracy | Low latency | ## Capabilities | Capability | Supported | Notes | |------------|:---------:|-------| | Streaming | Yes | Native frame-by-frame partials | | Batch inference | Yes | Via `POST /v1/audio/transcriptions` | | Word timestamps | Yes | From token-level output | | Language detection | No | Language is fixed per model | | Translation | No | | | Initial prompt | No | CTC does not support conditioning | | Hot words | Yes | Native keyword boosting via context biasing | | Partial transcripts | Yes | Native CTC partials | ### Native Hot Words WeNet supports native keyword boosting (context biasing), unlike Faster-Whisper which uses an `initial_prompt` workaround. This makes hot word recognition more reliable for domain-specific vocabulary: ```json title="WebSocket session.configure" { "type": "session.configure", "model": "my-wenet-model", "hot_words": ["CPF", "CNPJ", "PIX"] } ``` ### Language Handling | Input | Behavior | |-------|----------| | `"auto"` | Falls back to `"zh"` (Chinese) | | `"mixed"` | Falls back to `"zh"` (Chinese) | | `"zh"`, `"en"`, etc. | Uses the specified language | | Omitted | Falls back to `"zh"` | WeNet models are typically trained for a specific language (most commonly Chinese). The `language` parameter is informational — the model always uses the language it was trained for. ### Device Handling | Input | Behavior | |-------|----------| | `"auto"` | Maps to `"cpu"` | | `"cpu"` | CPU inference | | `"cuda"` | GPU inference | | `"cuda:0"` | Specific GPU | > **Tip** > Unlike Faster-Whisper where `"auto"` selects GPU if available, WeNet's `"auto"` always maps to `"cpu"`. Explicitly set `device: "cuda"` if you want GPU inference. ## Creating a Manifest Since WeNet has no catalog entries, you must create a `macaw.yaml` manifest manually in your model directory: ```yaml title="~/.macaw/models/my-wenet-model/macaw.yaml" name: my-wenet-model version: "1.0.0" engine: wenet type: stt description: "Custom WeNet CTC model for Mandarin" capabilities: streaming: true architecture: ctc languages: ["zh"] word_timestamps: true translation: false partial_transcripts: true hot_words: true batch_inference: true language_detection: false initial_prompt: false resources: memory_mb: 512 gpu_required: false gpu_recommended: false load_time_seconds: 3 engine_config: language: "chinese" device: "cpu" ``` ### Manifest Fields for WeNet | Field | Required | Description | |-------|:--------:|-------------| | `capabilities.architecture` | Yes | Must be `ctc` | | `capabilities.hot_words` | Yes | Set to `true` — WeNet supports native hot words | | `capabilities.initial_prompt` | Yes | Must be `false` — CTC does not support conditioning | | `capabilities.translation` | Yes | Must be `false` — WeNet does not translate | | `capabilities.language_detection` | Yes | Must be `false` — WeNet does not auto-detect language | | `engine_config.language` | No | Default language for the model (default: `"chinese"`) | | `engine_config.device` | No | Inference device (default: `"cpu"`) | ## Setting Up a WeNet Model 1. **Download or train a WeNet model** — obtain a model directory with the required files (model weights, config, etc.) 2. **Create the model directory:** ```bash mkdir -p ~/.macaw/models/my-wenet-model ``` 3. **Copy model files** into the directory 4. **Create the manifest:** ```bash title="Create macaw.yaml" cat > ~/.macaw/models/my-wenet-model/macaw.yaml << 'EOF' name: my-wenet-model version: "1.0.0" engine: wenet type: stt description: "Custom WeNet model" capabilities: streaming: true architecture: ctc languages: ["zh"] word_timestamps: true translation: false partial_transcripts: true hot_words: true batch_inference: true language_detection: false initial_prompt: false resources: memory_mb: 512 gpu_required: false gpu_recommended: false load_time_seconds: 3 engine_config: language: "chinese" device: "cpu" EOF ``` 5. **Verify the model is detected:** ```bash macaw list # Should show: my-wenet-model wenet stt ctc ``` 6. **Test transcription:** ```bash macaw transcribe audio_zh.wav --model my-wenet-model ``` ## Usage Examples ### Batch Transcription ```bash title="Transcribe a Chinese audio file" curl -X POST http://localhost:8000/v1/audio/transcriptions \ -F "file=@audio_zh.wav" \ -F "model=my-wenet-model" ``` ### Streaming (WebSocket) ```python title="Low-latency streaming with WeNet" import asyncio import json import websockets async def stream_low_latency(): uri = "ws://localhost:8000/v1/realtime" async with websockets.connect(uri) as ws: # Configure with WeNet model and hot words await ws.send(json.dumps({ "type": "session.configure", "model": "my-wenet-model", "hot_words": ["CPF", "CNPJ", "PIX"] })) # Stream audio — partials arrive within ~160ms with open("audio.raw", "rb") as f: while chunk := f.read(3200): # 100ms chunks await ws.send(chunk) try: msg = await asyncio.wait_for(ws.recv(), timeout=0.05) event = json.loads(msg) if event["type"] == "transcript.partial": print(f" ...{event['text']}") elif event["type"] == "transcript.final": print(f" >> {event['text']}") except asyncio.TimeoutError: pass asyncio.run(stream_low_latency()) ``` ## Engine Configuration Reference | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `language` | string | `"chinese"` | Model language (informational) | | `device` | string | `"cpu"` | Inference device (`"cpu"`, `"cuda"`, `"auto"` → `"cpu"`) | ## When to Choose WeNet **Choose WeNet when:** - You need the lowest possible latency for streaming (partials in ~160ms vs ~5s for Faster-Whisper) - Your application is Chinese-focused - You need reliable native hot word support for domain-specific vocabulary - You have your own trained WeNet model **Choose Faster-Whisper instead when:** - You need multilingual support (100+ languages) - You need translation capabilities - You want ready-to-use catalog models (no manual setup) - Accuracy is more important than latency --- # Kokoro TTS Kokoro is a lightweight neural text-to-speech engine with 82 million parameters. It supports 9 languages, multiple voices, and produces high-quality 24kHz audio. Kokoro is the default TTS engine in Macaw OpenVoice, used for full-duplex voice interactions via WebSocket and REST speech synthesis. ## Installation ```bash pip install macaw-openvoice[kokoro] ``` This installs `kokoro>=0.1,<1.0` as an optional dependency. ## Model | Property | Value | |----------|-------| | Catalog name | `kokoro-v1` | | HuggingFace repo | `hexgrad/Kokoro-82M` | | Parameters | 82M | | Memory | 512 MB | | GPU | Recommended | | Load time | ~3 seconds | | Output sample rate | 24,000 Hz | | Output format | 16-bit PCM | | Chunk size | 4,096 bytes (~85ms at 24kHz) | | API version | Kokoro v0.9.4 | ```bash macaw pull kokoro-v1 ``` ## Languages Kokoro supports 9 languages, identified by a single-character prefix in the voice name: | Prefix | Language | Example Voice | |:------:|----------|---------------| | `a` | English (American) | `af_heart` | | `b` | English (British) | `bf_emma` | | `e` | Spanish | `ef_dora` | | `f` | French | `ff_siwis` | | `h` | Hindi | `hf_alpha` | | `i` | Italian | `if_sara` | | `j` | Japanese | `jf_alpha` | | `p` | Portuguese | `pf_dora` | | `z` | Chinese | `zf_xiaobei` | The language is selected automatically based on the voice prefix. When loading the model, the `lang_code` in `engine_config` sets the default pipeline language. ## Voices ### Naming Convention Kokoro voices follow a strict naming pattern: ``` [language][gender][_name] ``` | Position | Meaning | Values | |----------|---------|--------| | 1st character | Language prefix | `a`, `b`, `e`, `f`, `h`, `i`, `j`, `p`, `z` | | 2nd character | Gender | `f` = female, `m` = male | | Rest | Voice name | Unique identifier (e.g., `_heart`, `_emma`) | **Examples:** - `af_heart` — American English, female, "heart" - `bm_george` — British English, male, "george" - `pf_dora` — Portuguese, female, "dora" - `jf_alpha` — Japanese, female, "alpha" ### Default Voice The default voice is **`af_heart`** (American English, female). This is used when: - `voice` is set to `"default"` in the API request - No voice is specified ### Voice Resolution When a voice is requested, Kokoro resolves it in this order: 1. `"default"` → uses the configured `default_voice` (default: `af_heart`) 2. Simple name (e.g., `"af_heart"`) → looks for `/af_heart.pt` 3. Absolute path or `.pt` extension → uses as-is Voice files are `.pt` (PyTorch) files stored in the `voices/` subdirectory of the model path. ### Voice Discovery The backend scans `/voices/*.pt` to discover available voices. Each `.pt` file becomes a selectable voice: ```bash title="List available voices" ls ~/.macaw/models/kokoro-v1/voices/ # af_heart.pt af_sky.pt am_adam.pt bf_emma.pt ... ``` ## Usage Examples ### REST API — Speech Synthesis ```bash title="Generate speech (WAV)" curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "model": "kokoro-v1", "input": "Hello! Welcome to Macaw OpenVoice.", "voice": "af_heart" }' \ --output speech.wav ``` ```bash title="Generate speech (raw PCM)" curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "model": "kokoro-v1", "input": "Olá! Bem-vindo ao Macaw OpenVoice.", "voice": "pf_dora", "response_format": "pcm" }' \ --output speech.raw ``` ```bash title="Adjust speed" curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "model": "kokoro-v1", "input": "Speaking slowly for clarity.", "voice": "af_heart", "speed": 0.8 }' \ --output slow.wav ``` ### Full-Duplex (WebSocket) In a full-duplex WebSocket session, TTS is triggered via the `tts.speak` command: ```python title="Full-duplex voice assistant" import asyncio import json import websockets async def voice_assistant(): uri = "ws://localhost:8000/v1/realtime" async with websockets.connect(uri) as ws: # Configure session with STT + TTS models await ws.send(json.dumps({ "type": "session.configure", "model": "faster-whisper-large-v3", "tts_model": "kokoro-v1", "tts_voice": "af_heart" })) # When user says something and you get a final transcript... # Send TTS response: await ws.send(json.dumps({ "type": "tts.speak", "text": "I heard you! Let me help with that." })) # Receive TTS audio as binary frames while True: msg = await ws.recv() if isinstance(msg, bytes): # Binary frame = TTS audio (16-bit PCM, 24kHz) play_audio(msg) else: event = json.loads(msg) if event["type"] == "tts.speaking_start": print("TTS started") elif event["type"] == "tts.speaking_end": print("TTS finished") break ``` > **Info Mute-on-speak** > During TTS playback, the STT pipeline automatically mutes (discards incoming audio frames). This prevents the system from transcribing its own speech output. Unmute is guaranteed via `try/finally` — even if TTS crashes, the microphone is restored. ### Python SDK ```python title="Direct synthesis" import httpx async def synthesize(): async with httpx.AsyncClient() as client: response = await client.post( "http://localhost:8000/v1/audio/speech", json={ "model": "kokoro-v1", "input": "Macaw OpenVoice converts text to natural speech.", "voice": "af_heart", "speed": 1.0, }, ) with open("output.wav", "wb") as f: f.write(response.content) ``` ## Engine Configuration ```yaml title="engine_config in macaw.yaml" engine_config: device: "auto" # "auto" (→ cpu), "cpu", "cuda" default_voice: "af_heart" # Default voice when "default" is requested sample_rate: 24000 # Output sample rate (fixed at 24kHz) lang_code: "a" # Default pipeline language prefix ``` | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `device` | string | `"auto"` | Inference device. `"auto"` maps to `"cpu"` | | `default_voice` | string | `"af_heart"` | Voice used when `"default"` is requested | | `sample_rate` | int | `24000` | Output sample rate (always 24kHz) | | `lang_code` | string | `"a"` | Default language prefix for the KPipeline | ## Streaming Behavior Kokoro synthesizes text in a streaming fashion using `AsyncIterator[bytes]`: 1. The full text is passed to Kokoro's `KPipeline` 2. KPipeline processes text in segments, yielding `(graphemes, phonemes, audio)` tuples 3. Audio arrays are concatenated and converted to 16-bit PCM 4. The PCM data is chunked into **4,096-byte segments** (~85ms of audio at 24kHz) 5. Chunks are yielded one at a time for low time-to-first-byte (TTFB) ``` Text → KPipeline → [segment audio] → float32→PCM16 → 4096-byte chunks → Client ``` ### Speed Control The `speed` parameter adjusts synthesis speed: | Value | Effect | |-------|--------| | `0.25` | 4x slower (minimum) | | `1.0` | Normal speed (default) | | `2.0` | 2x faster | | `4.0` | 4x faster (maximum) | ## Response Formats The `POST /v1/audio/speech` endpoint supports two output formats: | Format | Content-Type | Description | |--------|-------------|-------------| | `wav` (default) | `audio/wav` | WAV file with header (44 bytes header + PCM data) | | `pcm` | `audio/pcm` | Raw 16-bit PCM, little-endian, mono, 24kHz | ## Manifest Reference ```yaml title="macaw.yaml (kokoro-v1)" name: kokoro-v1 version: "1.0.0" engine: kokoro type: tts description: "Kokoro TTS - neural text-to-speech" capabilities: streaming: true languages: ["en", "pt", "ja"] resources: memory_mb: 512 gpu_required: false gpu_recommended: true load_time_seconds: 3 engine_config: device: "auto" default_voice: "af_heart" sample_rate: 24000 lang_code: "a" ``` ## Key Behaviors - **TTS is stateless per request.** Each `tts.speak` or `POST /v1/audio/speech` is independent. The Session Manager is not used for TTS. - **New `tts.speak` cancels the previous.** If a `tts.speak` command arrives while another is in progress, the previous one is cancelled automatically. - **Binary WebSocket frames are directional.** Server→client binary frames are always TTS audio. Client→server binary frames are always STT audio. No ambiguity. - **TTS worker is a separate subprocess.** Runs on a different gRPC port (default 50052 vs 50051 for STT). Crash does not affect the runtime. --- # Silero VAD Silero VAD (Voice Activity Detection) is the neural speech detector used internally by Macaw OpenVoice. It determines which audio frames contain speech and which are silence, enabling the runtime to process only relevant audio. Silero VAD is not a user-installable model — it is bundled with the runtime and downloaded automatically via `torch.hub`. > **Info Internal component** > Silero VAD is not something you `macaw pull`. It is loaded automatically when the runtime starts a streaming session. You configure its behavior through the `vad_sensitivity` setting in `session.configure`. ## How It Works Macaw uses a **two-stage VAD pipeline** that combines a fast energy pre-filter with the Silero neural classifier: ``` Audio Frame │ ▼ ┌──────────────────────┐ │ Energy Pre-Filter │ ~0.1ms/frame │ (RMS + Spectral │ │ Flatness) │ │ │ │ Low energy + flat │──── Silence (skip Silero) │ spectrum? │ └──────────┬───────────┘ │ Non-silence ▼ ┌──────────────────────┐ │ Silero VAD │ ~2ms/frame │ (Neural classifier) │ │ │ │ Speech probability │ │ > threshold? │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ Debounce │ │ (VADDetector) │ │ │ │ Confirmed state │ │ transition? │ └──────────┬───────────┘ │ ▼ VADEvent (SPEECH_START / SPEECH_END) ``` This two-stage design reduces unnecessary Silero invocations by 60-70% in noisy environments, since obvious silence is filtered out at the energy level without invoking the neural model. ## Stage 1: Energy Pre-Filter The energy pre-filter (`EnergyPreFilter`) uses two metrics to classify obvious silence: ### RMS Energy (dBFS) Computes the Root Mean Square energy of the frame and converts to dBFS (decibels relative to full scale). Frames below the energy threshold are candidates for silence. | Sensitivity | Energy Threshold | Description | |:-----------:|:----------------:|-------------| | HIGH | -50 dBFS | Very sensitive — detects whispers | | NORMAL | -40 dBFS | Normal conversation (default) | | LOW | -30 dBFS | Noisy environments, call centers | ### Spectral Flatness After the energy check, the pre-filter computes spectral flatness (ratio of geometric mean to arithmetic mean of the magnitude spectrum). A value above **0.8** indicates a flat spectrum (white noise or silence), while tonal speech typically has low spectral flatness (~0.1-0.5). A frame is classified as **silence** only when **both** conditions are met: - RMS energy < threshold (dBFS) - Spectral flatness > 0.8 **Cost:** ~0.1ms per frame. ## Stage 2: Silero VAD Classifier Frames that pass the energy pre-filter are sent to the Silero neural classifier (`SileroVADClassifier`). It returns a speech probability between 0.0 and 1.0. ### Speech Probability Thresholds | Sensitivity | Threshold | Behavior | |:-----------:|:---------:|----------| | HIGH | 0.3 | Detects soft speech, whispers — more false positives | | NORMAL | 0.5 | Balanced for normal conversation (default) | | LOW | 0.7 | Requires clear speech — fewer false positives, may miss quiet speakers | A frame is classified as **speech** when `probability > threshold`. ### Frame Processing - **Expected frame size:** 512 samples (32ms at 16kHz) - **Large frames:** automatically split into 512-sample sub-frames, processed sequentially (preserving Silero's internal temporal state). Returns the **maximum** probability among sub-frames - **Sample rate:** 16,000 Hz (required — validated on initialization) ### Model Loading Silero VAD is **lazy-loaded** on the first call to `get_speech_probability()`: - Downloaded via `torch.hub.load("snakers4/silero-vad", "silero_vad")` - Cached by PyTorch's hub mechanism (typically in `~/.cache/torch/hub/`) - **Thread-safe** — uses `threading.Lock` with double-check locking pattern - Can be preloaded with `await classifier.preload()` to avoid first-call latency **Cost:** ~2ms per frame on CPU. ## Stage 3: Debounce (VADDetector) The `VADDetector` orchestrates both stages and applies debounce to prevent rapid state changes from producing noisy events. ### Debounce Parameters | Parameter | Default | Description | |-----------|:-------:|-------------| | `min_speech_duration_ms` | 250ms | Consecutive speech frames required before emitting `SPEECH_START` | | `min_silence_duration_ms` | 300ms | Consecutive silence frames required before emitting `SPEECH_END` | | `max_speech_duration_ms` | 30,000ms | Maximum continuous speech before forcing `SPEECH_END` | ### State Machine ``` 250ms consecutive speech SILENCE ──────────────────────────────► SPEAKING ▲ │ │ │ │ 300ms consecutive silence │ ◄──────────────────────────────────────┘ OR 30s max speech duration ``` ### Events | Event | When | |-------|------| | `SPEECH_START` | After `min_speech_duration_ms` of consecutive speech | | `SPEECH_END` | After `min_silence_duration_ms` of consecutive silence during speech | | `SPEECH_END` (forced) | After `max_speech_duration_ms` of continuous speech | Each event includes a `timestamp_ms` computed from total processed samples. ## Configuration VAD sensitivity is configured per session via the WebSocket `session.configure` command: ```json title="WebSocket session.configure" { "type": "session.configure", "model": "faster-whisper-large-v3", "vad_sensitivity": "normal" } ``` Valid values: `"high"`, `"normal"` (default), `"low"`. Changing the sensitivity adjusts **both** the energy pre-filter threshold and the Silero speech probability threshold simultaneously. ### Sensitivity Guide | Environment | Recommended | Why | |-------------|:-----------:|-----| | Quiet office, banking app | HIGH | Detects soft-spoken customers, whispers | | Normal conversation | NORMAL | Balanced for typical voice interactions | | Call center, noisy background | LOW | Reduces false triggers from background noise | ## Performance | Metric | Value | |--------|-------| | Energy pre-filter cost | ~0.1ms/frame | | Silero classifier cost | ~2ms/frame | | Total cost (silence frame) | ~0.1ms (Silero skipped) | | Total cost (speech frame) | ~2.1ms | | Model memory | ~50 MB | | GPU required | No | | False positive reduction | 60-70% in noisy environments | ## Dependencies Silero VAD requires PyTorch: ```bash pip install torch ``` PyTorch is not listed as a direct Macaw dependency — it is typically installed as a transitive dependency of the STT or TTS engines (Faster-Whisper, Kokoro). If you are using a minimal installation, ensure `torch` is available. ## Key Design Decisions - **VAD runs in the runtime, not in the engine.** The Macaw runtime owns the VAD pipeline. Engines receive only speech audio. This ensures consistent behavior across all STT engines. - **Preprocessing comes before VAD.** Audio must be normalized (DC removal, gain normalization, resample to 16kHz) before reaching the VAD, otherwise Silero's thresholds produce inconsistent results. - **Never enable engine-internal VAD.** The `vad_filter` in Faster-Whisper's engine config is always `false`. Enabling it would duplicate the VAD work and create conflicts. - **Energy pre-filter is a performance optimization, not a replacement.** It reduces Silero invocations for obvious silence but never classifies speech on its own. Only Silero can confirm speech. - **Debounce uses sample counts, not timers.** The debounce counters accumulate actual processed samples, making the timing deterministic regardless of processing speed. --- # Batch Transcription Macaw's REST API is **OpenAI-compatible** — you can use the official OpenAI SDK or any HTTP client to transcribe and translate audio files. ## Transcription ### Using curl ```bash title="Basic transcription" curl -X POST http://localhost:8000/v1/audio/transcriptions \ -F file=@meeting.wav \ -F model=faster-whisper-large-v3 ``` ```json title="Response" { "text": "Hello, welcome to the meeting. Let's get started." } ``` ### Using the OpenAI SDK ```python title="Python" from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed" # Macaw doesn't require auth ) with open("meeting.wav", "rb") as f: transcript = client.audio.transcriptions.create( model="faster-whisper-large-v3", file=f ) print(transcript.text) ``` > **Tip Drop-in replacement** > Since Macaw implements the OpenAI Audio API, switching from OpenAI's hosted API is a one-line change — just update the `base_url`. ## Request Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `file` | binary | *required* | Audio file (WAV, MP3, FLAC, OGG, WebM) | | `model` | string | *required* | Model name (e.g., `faster-whisper-large-v3`) | | `language` | string | auto | ISO 639-1 language code (e.g., `en`, `pt`, `es`) | | `prompt` | string | — | Context hint for the model (hot words, domain terms) | | `response_format` | string | `json` | Output format (see below) | | `temperature` | float | `0.0` | Sampling temperature (0.0 = deterministic) | | `itn` | boolean | `true` | Apply Inverse Text Normalization | ## Response Formats ### `json` (default) ```bash curl -X POST http://localhost:8000/v1/audio/transcriptions \ -F file=@audio.wav \ -F model=faster-whisper-large-v3 \ -F response_format=json ``` ```json { "text": "The total is one hundred and fifty dollars." } ``` ### `verbose_json` Includes segment-level detail with timestamps and confidence scores: ```bash curl -X POST http://localhost:8000/v1/audio/transcriptions \ -F file=@audio.wav \ -F model=faster-whisper-large-v3 \ -F response_format=verbose_json ``` ```json { "text": "The total is one hundred and fifty dollars.", "segments": [ { "id": 0, "start": 0.0, "end": 2.5, "text": "The total is one hundred and fifty dollars.", "avg_logprob": -0.15, "no_speech_prob": 0.02 } ], "language": "en", "duration": 2.5 } ``` ### `text` Returns plain text without JSON wrapping: ```bash curl -X POST http://localhost:8000/v1/audio/transcriptions \ -F file=@audio.wav \ -F model=faster-whisper-large-v3 \ -F response_format=text ``` ``` The total is one hundred and fifty dollars. ``` ### `srt` SubRip subtitle format: ``` 1 00:00:00,000 --> 00:00:02,500 The total is one hundred and fifty dollars. ``` ### `vtt` WebVTT subtitle format: ``` WEBVTT 00:00:00.000 --> 00:00:02.500 The total is one hundred and fifty dollars. ``` ## Translation The translation endpoint translates audio from **any supported language to English**: ```bash title="Translate Portuguese audio to English" curl -X POST http://localhost:8000/v1/audio/translations \ -F file=@reuniao.wav \ -F model=faster-whisper-large-v3 ``` ```json { "text": "Hello, welcome to the meeting. Let's get started." } ``` > **Info Translation target** > Translation always produces English output. This matches the OpenAI API behavior. The source language is detected automatically. The translation endpoint accepts the same parameters as transcription (except `language`, which is ignored since the output is always English). ## Inverse Text Normalization (ITN) By default, Macaw applies ITN to final transcripts, converting spoken forms to written forms: | Spoken | Written (with ITN) | |--------|-------------------| | "one hundred and fifty dollars" | "$150" | | "january twenty third twenty twenty five" | "January 23, 2025" | | "five five five one two three four" | "555-1234" | To disable ITN: ```bash curl -X POST http://localhost:8000/v1/audio/transcriptions \ -F file=@audio.wav \ -F model=faster-whisper-large-v3 \ -F itn=false ``` > **Tip ITN is fail-open** > ITN uses NeMo Text Processing. If NeMo is not installed or fails, the raw text passes through unchanged — no errors are raised. ## Cancellation Long-running requests can be cancelled using the request ID: ```bash title="Start a transcription" # The request_id is returned in the response headers curl -v -X POST http://localhost:8000/v1/audio/transcriptions \ -F file=@long_recording.wav \ -F model=faster-whisper-large-v3 ``` ```bash title="Cancel it" curl -X POST http://localhost:8000/v1/audio/transcriptions/req_abc123/cancel ``` ```json title="Response" { "request_id": "req_abc123", "cancelled": true } ``` Cancellation is **idempotent** — cancelling an already-completed or already-cancelled request returns `cancelled: false`. ## Supported Audio Formats | Format | MIME Type | Notes | |--------|-----------|-------| | WAV | `audio/wav` | Preferred — no transcoding needed | | MP3 | `audio/mpeg` | Decoded automatically | | FLAC | `audio/flac` | Lossless, good for archival | | OGG | `audio/ogg` | Opus/Vorbis codec | | WebM | `audio/webm` | Common from browsers | > **Warning Preprocessing is automatic** > All audio is automatically resampled to 16kHz mono, DC-filtered, and gain-normalized before reaching the engine. You don't need to preprocess your files. ## Error Responses | Status | Description | |:---:|-------------| | 400 | Invalid request (missing file, unsupported format, unknown model) | | 413 | File too large | | 503 | No workers available for the requested model | | 500 | Internal error during transcription | ```json title="Error response" { "error": { "message": "Model 'unknown-model' not found", "type": "invalid_request_error", "code": "model_not_found" } } ``` ## Next Steps | Goal | Guide | |------|-------| | Real-time streaming transcription | [Streaming STT](./streaming-stt) | | Text-to-speech synthesis | [REST API - Speech](../api-reference/rest-api#post-v1audiospeech) | | Full-duplex voice interaction | [Full-Duplex](./full-duplex) | --- # Streaming STT Macaw provides real-time speech-to-text via WebSocket at `/v1/realtime`. Audio frames are sent as binary messages and transcription events are returned as JSON. ## Quick Start ### Using wscat ```bash title="Connect and stream" wscat -c "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3" ``` ### Using Python ```python title="stream_audio.py" import asyncio import json import websockets async def stream_microphone(): uri = "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3" async with websockets.connect(uri) as ws: # Wait for session.created msg = json.loads(await ws.recv()) print(f"Session: {msg['session_id']}") # Send audio frames (16-bit PCM, 16kHz) # In production, read from microphone with open("audio.raw", "rb") as f: while chunk := f.read(3200): # 100ms frames await ws.send(chunk) await asyncio.sleep(0.1) # Check for transcription events (non-blocking) try: response = await asyncio.wait_for(ws.recv(), timeout=0.01) event = json.loads(response) if event["type"] == "transcript.partial": print(f" ...{event['text']}", end="\r") elif event["type"] == "transcript.final": print(f" {event['text']}") except asyncio.TimeoutError: pass asyncio.run(stream_microphone()) ``` ### Using the CLI ```bash title="Stream from microphone" macaw transcribe --stream --model faster-whisper-large-v3 ``` ## Connection ### URL Format ``` ws://HOST:PORT/v1/realtime?model=MODEL&language=LANG ``` | Parameter | Required | Default | Description | |-----------|:---:|---------|-------------| | `model` | Yes | — | STT model name | | `language` | No | auto | ISO 639-1 code (e.g., `en`, `pt`) | ### Session Created After connecting, the server immediately sends a `session.created` event: ```json title="Server → Client" { "type": "session.created", "session_id": "sess_a1b2c3d4" } ``` Save the `session_id` for logging and debugging. ## Audio Format Send audio as **binary WebSocket frames**: | Property | Value | |----------|-------| | Encoding | PCM 16-bit signed, little-endian | | Sample rate | 16,000 Hz | | Channels | Mono | | Frame size | Recommended: 3,200 bytes (100ms) | > **Tip Preprocessing is automatic** > If your audio isn't exactly 16kHz mono, the `StreamingPreprocessor` will resample it automatically. However, sending pre-formatted audio avoids unnecessary processing. ## Transcription Events ### Partial Transcripts Emitted in real-time as speech is being recognized. These are **unstable** — text may change as more context arrives: ```json title="Server → Client" { "type": "transcript.partial", "text": "hello how are", "segment_id": 1 } ``` ### Final Transcripts Emitted when a speech segment ends (VAD detects silence). These are **stable** — the text will not change: ```json title="Server → Client" { "type": "transcript.final", "text": "Hello, how are you doing today?", "segment_id": 1, "start": 0.5, "end": 2.8, "confidence": 0.94 } ``` > **Info ITN on finals only** > Inverse Text Normalization (e.g., "one hundred" → "100") is applied **only** to final transcripts. Partials return raw text because they change too frequently for ITN to be useful. ## Session Configuration After connecting, you can adjust session settings dynamically: ```json title="Client → Server" { "type": "session.configure", "language": "pt", "vad_sensitivity": "high", "hot_words": ["Macaw", "OpenVoice", "gRPC"], "enable_itn": true } ``` | Field | Type | Description | |-------|------|-------------| | `language` | string | Change language mid-session | | `vad_sensitivity` | string | `"high"`, `"normal"`, or `"low"` | | `hot_words` | string[] | Domain-specific terms to boost recognition | | `enable_itn` | boolean | Enable/disable Inverse Text Normalization | | `model_tts` | string | Set TTS model for full-duplex (see [Full-Duplex](./full-duplex)) | ## Buffer Management ### Manual Commit Force the audio buffer to commit and produce a final transcript, even without a VAD silence event: ```json title="Client → Server" { "type": "input_audio_buffer.commit" } ``` This is useful when you know the user has finished speaking (e.g., they pressed a "done" button) but the VAD hasn't detected silence yet. ## Closing the Session ### Graceful Close ```json title="Client → Server" { "type": "session.close" } ``` The server flushes remaining data, emits any final transcripts, and sends: ```json title="Server → Client" { "type": "session.closed", "session_id": "sess_a1b2c3d4", "reason": "client_close" } ``` ### Cancel ```json title="Client → Server" { "type": "session.cancel" } ``` Immediately closes the session without flushing. Pending transcripts are discarded. ## Backpressure If the client sends audio faster than real-time (e.g., reading from a file without throttling), the server applies backpressure: ### Rate Limit Warning ```json title="Server → Client" { "type": "session.rate_limit", "delay_ms": 50, "message": "Audio arriving faster than 1.2x real-time" } ``` **Action:** slow down your send rate by the suggested `delay_ms`. ### Frames Dropped ```json title="Server → Client" { "type": "session.frames_dropped", "dropped_ms": 200, "message": "Backlog exceeded 10s, frames dropped" } ``` **Action:** this is informational — frames have already been dropped. Reduce send rate to prevent further drops. > **Warning Throttle file streaming** > When streaming from a file (not a microphone), add `asyncio.sleep(0.1)` between 100ms frames to simulate real-time. Without throttling, the server will trigger backpressure. ## Error Handling ### Error Events ```json title="Server → Client" { "type": "error", "code": "worker_unavailable", "message": "STT worker not available for model faster-whisper-large-v3", "recoverable": true } ``` | Field | Description | |-------|-------------| | `code` | Machine-readable error code | | `message` | Human-readable description | | `recoverable` | `true` if the client can retry or continue | ### Common Errors | Code | Recoverable | Description | |------|:---:|-------------| | `model_not_found` | No | Requested model is not loaded | | `worker_unavailable` | Yes | Worker crashed, recovery in progress | | `session_timeout` | No | Session exceeded idle timeout | | `invalid_command` | Yes | Unrecognized JSON command | ### Reconnection If the WebSocket disconnects unexpectedly: 1. Reconnect with the same parameters 2. A new `session_id` will be assigned 3. Previous session state is not preserved — this is a fresh session ## Inactivity Timeout The server monitors session activity: | Parameter | Value | |-----------|-------| | Heartbeat ping | Every 10s | | Auto-close timeout | 60s of inactivity | If no audio frames arrive for 60 seconds, the server closes the session automatically. ## Next Steps | Goal | Guide | |------|-------| | Add TTS to the same connection | [Full-Duplex](./full-duplex) | | Batch file transcription instead | [Batch Transcription](./batch-transcription) | | Full protocol reference | [WebSocket Protocol](../api-reference/websocket-protocol) | --- # Full-Duplex STT + TTS Macaw supports **full-duplex** voice interactions on a single WebSocket connection. The client streams audio for STT while simultaneously receiving synthesized speech from TTS — all on the same `/v1/realtime` endpoint. ## How It Works The key mechanism is **mute-on-speak**: when TTS is active, STT is muted to prevent the synthesized audio from being fed back into the speech recognizer. ``` Timeline ─────────────────────────────────────────────────▶ Client sends audio (STT) ████████████░░░░░░░░░░░████████████ │ │ tts.speak tts.speaking_end │ │ Server sends audio (TTS) │██████████│ │ │ STT active ████████│ muted │████████████ ``` ### Flow 1. Client streams audio frames for STT (binary messages) 2. Client sends a `tts.speak` command (JSON message) 3. Server **mutes** STT — incoming audio frames are dropped 4. Server emits `tts.speaking_start` event 5. Server streams TTS audio as binary frames (server → client) 6. When synthesis completes, server emits `tts.speaking_end` 7. Server **unmutes** STT — audio processing resumes 8. Client continues streaming audio for STT > **Tip Directionality is unambiguous** > - **Binary frames client → server** are always STT audio > - **Binary frames server → client** are always TTS audio > - **Text frames** (both directions) are always JSON events/commands ## Setup ### 1. Connect to the WebSocket ```python title="Connect with STT model" import asyncio import json import websockets async def full_duplex(): uri = "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3" async with websockets.connect(uri) as ws: # Wait for session.created event = json.loads(await ws.recv()) print(f"Session: {event['session_id']}") ``` ### 2. Configure TTS Model Set the TTS model for the session: ```python title="Configure TTS" await ws.send(json.dumps({ "type": "session.configure", "model_tts": "kokoro" })) ``` > **Info Auto-discovery** > If you don't set `model_tts`, the server will auto-discover the first available TTS model from the registry when you send a `tts.speak` command. ### 3. Request Speech Synthesis ```python title="Send tts.speak" await ws.send(json.dumps({ "type": "tts.speak", "text": "Hello! How can I help you today?", "voice": "af_heart" })) ``` ### 4. Handle Events and Audio ```python title="Event loop" async for message in ws: if isinstance(message, bytes): # TTS audio chunk (PCM 16-bit, 24kHz) play_audio(message) else: event = json.loads(message) match event["type"]: case "transcript.partial": print(f" ...{event['text']}", end="\r") case "transcript.final": print(f" User: {event['text']}") # Generate response and speak it response = get_llm_response(event["text"]) await ws.send(json.dumps({ "type": "tts.speak", "text": response })) case "tts.speaking_start": print(" [Speaking...]") case "tts.speaking_end": print(f" [Done, {event['duration_ms']}ms]") ``` ## Commands ### `tts.speak` Request speech synthesis: ```json title="Client → Server" { "type": "tts.speak", "text": "Hello, how can I help you?", "voice": "af_heart" } ``` | Field | Type | Default | Description | |-------|------|---------|-------------| | `text` | string | *required* | Text to synthesize | | `voice` | string | `"default"` | Voice ID (see [available voices](#available-voices)) | > **Warning Auto-cancellation** > If a `tts.speak` command arrives while a previous synthesis is still in progress, the **previous one is cancelled automatically**. TTS commands do not accumulate — only the latest one plays. ### `tts.cancel` Cancel the current TTS synthesis: ```json title="Client → Server" { "type": "tts.cancel" } ``` This immediately: 1. Stops sending audio chunks 2. Unmutes STT 3. Emits `tts.speaking_end` with `"cancelled": true` ## Events ### `tts.speaking_start` Emitted when the first audio chunk is ready to send: ```json title="Server → Client" { "type": "tts.speaking_start", "text": "Hello, how can I help you?" } ``` At this point, STT is muted and audio chunks will follow. ### `tts.speaking_end` Emitted when synthesis completes (or is cancelled): ```json title="Server → Client" { "type": "tts.speaking_end", "duration_ms": 1250, "cancelled": false } ``` | Field | Type | Description | |-------|------|-------------| | `duration_ms` | int | Total duration of audio sent | | `cancelled` | bool | `true` if stopped early via `tts.cancel` or new `tts.speak` | After this event, STT is unmuted and audio processing resumes. ## TTS Audio Format TTS audio chunks are sent as binary WebSocket frames: | Property | Value | |----------|-------| | Encoding | PCM 16-bit signed, little-endian | | Sample rate | 24,000 Hz (Kokoro default) | | Channels | Mono | | Chunk size | ~4,096 bytes (~85ms at 24kHz) | > **Info Different sample rates** > STT input is 16kHz, but TTS output is 24kHz (Kokoro's native rate). The client is responsible for handling both sample rates appropriately (e.g., separate audio output streams). ## Mute-on-Speak Details The mute mechanism ensures STT doesn't hear the TTS output: ``` tts.speak received │ ▼ ┌──────────┐ │ mute() │ STT frames dropped (counter incremented) └────┬─────┘ │ ▼ ┌──────────────────────┐ │ Stream TTS audio │ Binary frames server → client │ chunks to client │ └────┬─────────────────┘ │ ▼ (in finally block — always executes) ┌──────────┐ │ unmute() │ STT processing resumes └──────────┘ ``` ### Guarantees | Property | Guarantee | |----------|-----------| | Unmute on completion | Always — via `try/finally` | | Unmute on TTS error | Always — via `try/finally` | | Unmute on cancel | Always — via `try/finally` | | Unmute on WebSocket close | Always — session cleanup | | Idempotent | `mute()` and `unmute()` can be called multiple times | > **Warning** > The `try/finally` pattern is critical. If TTS crashes mid-synthesis, the `finally` block still calls `unmute()`. Without this, a TTS error would permanently mute STT for the session. ## Available Voices Kokoro supports multiple languages and voices. The voice ID prefix determines the language: | Prefix | Language | Example | |--------|----------|---------| | `a` | English (US) | `af_heart`, `am_adam` | | `b` | English (UK) | `bf_emma`, `bm_george` | | `e` | Spanish | `ef_dora`, `em_alex` | | `f` | French | `ff_siwis` | | `h` | Hindi | `hf_alpha`, `hm_omega` | | `i` | Italian | `if_sara`, `im_nicola` | | `j` | Japanese | `jf_alpha`, `jm_omega` | | `p` | Portuguese | `pf_dora`, `pm_alex` | | `z` | Chinese | `zf_xiaobei`, `zm_yunjian` | The second character indicates gender: `f` = female, `m` = male. **Default voice:** `af_heart` (English US, female) ## Complete Example ```python title="voice_assistant.py" import asyncio import json import websockets async def voice_assistant(): uri = "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3" async with websockets.connect(uri) as ws: # Wait for session event = json.loads(await ws.recv()) print(f"Connected: {event['session_id']}") # Configure TTS await ws.send(json.dumps({ "type": "session.configure", "model_tts": "kokoro", "vad_sensitivity": "normal", "enable_itn": True })) # Greet the user await ws.send(json.dumps({ "type": "tts.speak", "text": "Hi! I'm ready to help. Go ahead and speak.", "voice": "af_heart" })) # Main loop: listen for events async for message in ws: if isinstance(message, bytes): # TTS audio — send to speaker play_audio(message) continue event = json.loads(message) if event["type"] == "transcript.final": user_text = event["text"] print(f"User: {user_text}") # Get response from your LLM response = await get_llm_response(user_text) print(f"Assistant: {response}") # Speak the response await ws.send(json.dumps({ "type": "tts.speak", "text": response, "voice": "af_heart" })) elif event["type"] == "tts.speaking_end": if event.get("cancelled"): print(" (interrupted)") asyncio.run(voice_assistant()) ``` ## Next Steps | Goal | Guide | |------|-------| | WebSocket protocol reference | [WebSocket Protocol](../api-reference/websocket-protocol) | | Understanding mute and session state | [Session Manager](../architecture/session-manager) | | Batch transcription instead | [Batch Transcription](./batch-transcription) | --- # Adding an Engine Macaw is **engine-agnostic**. Adding a new STT or TTS engine requires implementing the backend interface, registering it in the factory, creating a model manifest, and writing tests. Zero changes to the runtime core. ## Overview ``` Step 1: Implement the Backend interface (STTBackend or TTSBackend) Step 2: Register in the factory function Step 3: Create a model manifest (macaw.yaml) Step 4: Declare dependencies (optional extra) Step 5: Write tests ``` ## Step 1: Implement the Backend ### STT Engine Create a new file in `src/macaw/workers/stt/`: ```python title="src/macaw/workers/stt/my_engine.py" from macaw.workers.stt.interface import ( STTBackend, STTArchitecture, EngineCapabilities, BatchResult, TranscriptSegment, ) from typing import AsyncIterator class MyEngineBackend(STTBackend): """STT backend for MyEngine.""" @property def architecture(self) -> STTArchitecture: # Choose one: # - STTArchitecture.ENCODER_DECODER (like Whisper) # - STTArchitecture.CTC (like WeNet) # - STTArchitecture.STREAMING_NATIVE (like Paraformer) return STTArchitecture.ENCODER_DECODER async def load(self, model_path: str, config: dict) -> None: """Load the model into memory.""" # Initialize your engine here # config comes from macaw.yaml engine_config section ... def capabilities(self) -> EngineCapabilities: """Declare what this engine supports.""" return EngineCapabilities( supports_hot_words=False, supports_initial_prompt=True, supports_batch=True, supports_word_timestamps=True, max_concurrent=1, # GPU concurrency limit ) async def transcribe_file( self, audio_data: bytes, language: str | None = None, initial_prompt: str | None = None, hot_words: list[str] | None = None, temperature: float = 0.0, word_timestamps: bool = False, ) -> BatchResult: """Transcribe a complete audio file.""" # audio_data is PCM 16-bit, 16kHz, mono (already preprocessed) # Return BatchResult with text, segments, language, duration ... async def transcribe_stream( self, audio_chunks: AsyncIterator[bytes], language: str | None = None, initial_prompt: str | None = None, hot_words: list[str] | None = None, ) -> AsyncIterator[TranscriptSegment]: """Transcribe streaming audio.""" # Yield TranscriptSegment for each partial/final result async for chunk in audio_chunks: # Process chunk, yield results ... async def unload(self) -> None: """Free model resources.""" ... async def health(self) -> dict: """Return health status.""" return {"status": "ok", "engine": "my_engine"} ``` ### TTS Engine Create a new file in `src/macaw/workers/tts/`: ```python title="src/macaw/workers/tts/my_tts.py" from macaw.workers.tts.interface import TTSBackend, VoiceInfo from typing import AsyncIterator class MyTTSBackend(TTSBackend): """TTS backend for MyTTS.""" async def load(self, model_path: str, config: dict) -> None: """Load the TTS model.""" ... async def synthesize( self, text: str, voice: str = "default", sample_rate: int = 24000, speed: float = 1.0, ) -> AsyncIterator[bytes]: """Synthesize speech from text. Yields PCM 16-bit audio chunks for streaming with low TTFB. """ # Process text and yield audio chunks # Each chunk should be ~4096 bytes for smooth streaming ... async def voices(self) -> list[VoiceInfo]: """List available voices.""" return [ VoiceInfo(id="default", name="Default", language="en"), ] async def unload(self) -> None: """Free model resources.""" ... async def health(self) -> dict: """Return health status.""" return {"status": "ok", "engine": "my_tts"} ``` > **Info Streaming TTS** > The `synthesize()` method returns an `AsyncIterator[bytes]` — not a single bytes object. This enables **streaming with low TTFB** (Time to First Byte). Yield audio chunks as they become available rather than waiting for the full synthesis to complete. ## Step 2: Register in the Factory Add your engine to the factory function that creates backends: ```python title="Registration" # In the worker factory (e.g., _create_backend) def _create_backend(engine: str) -> STTBackend: match engine: case "faster-whisper": from macaw.workers.stt.faster_whisper import FasterWhisperBackend return FasterWhisperBackend() case "wenet": from macaw.workers.stt.wenet import WeNetBackend return WeNetBackend() case "my-engine": # Add your engine here from macaw.workers.stt.my_engine import MyEngineBackend return MyEngineBackend() case _: raise ValueError(f"Unknown engine: {engine}") ``` > **Tip Lazy imports** > Use lazy imports inside the `match` branches. This way, engine dependencies are only loaded when that specific engine is requested. Users who don't use your engine don't need to install its dependencies. ## Step 3: Create a Model Manifest Every model needs a `macaw.yaml` manifest in its model directory: ```yaml title="models/my-model/macaw.yaml" name: my-model-large type: stt # stt or tts engine: my-engine # matches the factory key architecture: encoder-decoder # encoder-decoder, ctc, or streaming-native capabilities: hot_words: false initial_prompt: true batch: true word_timestamps: true engine_config: beam_size: 5 vad_filter: false # Always false — runtime handles VAD compute_type: float16 device: cuda # cuda or cpu files: - model.bin - tokenizer.json - config.json ``` ### Architecture Field The `architecture` field tells the runtime how to adapt the streaming pipeline: | Architecture | LocalAgreement | Cross-segment Context | Native Partials | |-------------|:-:|:-:|:-:| | `encoder-decoder` | Yes (confirms tokens across passes) | Yes (224 tokens from previous segment) | No | | `ctc` | No (not needed) | No (`initial_prompt` not supported) | Yes | | `streaming-native` | No (not needed) | No | Yes | > **Warning Set the right architecture** > Choosing the wrong architecture will cause incorrect streaming behavior. If your engine produces native partial transcripts, use `ctc` or `streaming-native`. If it needs multiple inference passes to produce stable output, use `encoder-decoder`. ### Engine Config The `engine_config` section is passed directly to your `load()` method as a dict. Define whatever configuration your engine needs: ```python async def load(self, model_path: str, config: dict) -> None: beam_size = config.get("beam_size", 5) compute_type = config.get("compute_type", "float16") device = config.get("device", "cuda") # Initialize engine with these settings ``` ## Step 4: Declare Dependencies If your engine requires additional Python packages, add them as an optional extra in `pyproject.toml`: ```toml title="pyproject.toml" [project.optional-dependencies] my-engine = ["my-engine-lib>=1.0"] # Users install with: # pip install macaw-openvoice[my-engine] ``` This keeps the base Macaw installation lightweight — users only install engine dependencies they actually use. ## Step 5: Write Tests ### Unit Tests Test your backend in isolation with mocked inference: ```python title="tests/unit/workers/stt/test_my_engine.py" import pytest from unittest.mock import AsyncMock, patch from macaw.workers.stt.my_engine import MyEngineBackend from macaw.workers.stt.interface import STTArchitecture class TestMyEngineBackend: def test_architecture(self): backend = MyEngineBackend() assert backend.architecture == STTArchitecture.ENCODER_DECODER def test_capabilities(self): backend = MyEngineBackend() caps = backend.capabilities() assert caps.supports_batch is True assert caps.max_concurrent == 1 async def test_transcribe_file(self): backend = MyEngineBackend() # Mock the engine's inference with patch.object(backend, "_inference", new_callable=AsyncMock) as mock: mock.return_value = "Hello world" result = await backend.transcribe_file(b"fake_audio") assert result.text == "Hello world" async def test_health(self): backend = MyEngineBackend() status = await backend.health() assert status["status"] == "ok" ``` ### Integration Tests Test with a real model (mark as integration): ```python title="tests/integration/workers/stt/test_my_engine_integration.py" import pytest from macaw.workers.stt.my_engine import MyEngineBackend @pytest.mark.integration class TestMyEngineIntegration: async def test_transcribe_real_audio(self, audio_440hz_wav): backend = MyEngineBackend() await backend.load("path/to/model", {"device": "cpu"}) try: result = await backend.transcribe_file(audio_440hz_wav) assert isinstance(result.text, str) finally: await backend.unload() ``` ## Checklist Before submitting your engine: - [ ] Implements all abstract methods from `STTBackend` or `TTSBackend` - [ ] `architecture` property returns the correct type - [ ] `capabilities()` accurately reflects engine features - [ ] `vad_filter: false` in the manifest (runtime handles VAD) - [ ] Lazy import in the factory function - [ ] Optional dependency declared in `pyproject.toml` - [ ] Unit tests with mocked inference - [ ] Integration tests marked with `@pytest.mark.integration` - [ ] `health()` returns meaningful status ## What You Don't Need to Touch The engine-agnostic design means you do **not** modify: | Component | Reason | |-----------|--------| | API Server | Routes are engine-agnostic | | Session Manager | Adapts automatically via `architecture` field | | VAD Pipeline | Runs before audio reaches the engine | | Preprocessing | Engines receive normalized PCM 16kHz | | Postprocessing | ITN runs after transcription, independent of engine | | Scheduler | Routes requests by model name, not engine type | | CLI | Commands work with any registered model | --- # CLI Reference Macaw ships with an Ollama-style CLI for managing models, running the server, and transcribing audio. All commands are available via the `macaw` binary. ```bash macaw --help ``` ## Commands ### `macaw serve` Start the API server and gRPC workers. ```bash title="Basic usage" macaw serve ``` ```bash title="Custom host and port" macaw serve --host 0.0.0.0 --port 9000 ``` | Flag | Default | Description | |------|---------|-------------| | `--host` | `127.0.0.1` | Bind address | | `--port` | `8000` | HTTP port | | `--models-dir` | `~/.macaw/models` | Models directory | | `--cors-origins` | `*` | Allowed CORS origins | | `--log-format` | `text` | Log format (`text` or `json`) | | `--log-level` | `info` | Log level (`debug`, `info`, `warning`, `error`) | The server starts: 1. FastAPI HTTP server on the specified port 2. STT gRPC worker on port 50051 3. TTS gRPC worker on port 50052 > **Tip Production** > For production deployments, use structured logging and bind to all interfaces: > ```bash > macaw serve --host 0.0.0.0 --log-format json --log-level info > ``` --- ### `macaw transcribe` Transcribe an audio file or stream from the microphone. ```bash title="Transcribe a file" macaw transcribe recording.wav --model faster-whisper-large-v3 ``` ```bash title="Stream from microphone" macaw transcribe --stream --model faster-whisper-large-v3 ``` | Flag | Short | Default | Description | |------|:---:|---------|-------------| | `--model` | `-m` | — | Model to use (required) | | `--format` | | `json` | Output format (`json`, `text`, `verbose_json`, `srt`, `vtt`) | | `--language` | `-l` | auto | Language code (e.g., `en`, `pt`) | | `--no-itn` | | — | Disable Inverse Text Normalization | | `--hot-words` | | — | Comma-separated hot words | | `--stream` | | — | Stream from microphone instead of file | | `--server` | | `http://localhost:8000` | Server URL (connects to running server) | **File mode** sends the audio to the REST API for batch transcription. **Stream mode** connects to the WebSocket endpoint for real-time transcription. ```bash title="With hot words and language" macaw transcribe call.wav -m faster-whisper-large-v3 -l pt --hot-words "Macaw,OpenVoice" ``` ```bash title="Output as subtitles" macaw transcribe video.wav -m faster-whisper-large-v3 --format srt ``` --- ### `macaw translate` Translate audio to English. ```bash title="Translate Portuguese audio" macaw translate reuniao.wav --model faster-whisper-large-v3 ``` | Flag | Short | Default | Description | |------|:---:|---------|-------------| | `--model` | `-m` | — | Model to use (required) | | `--format` | | `json` | Output format | | `--no-itn` | | — | Disable ITN | | `--hot-words` | | — | Comma-separated hot words | | `--server` | | `http://localhost:8000` | Server URL | > **Info English output only** > Translation always produces English text, regardless of the source language. This matches the OpenAI API behavior. --- ### `macaw list` List installed models. ```bash macaw list ``` ``` NAME TYPE ENGINE SIZE faster-whisper-large-v3 stt faster-whisper 3.1 GB kokoro tts kokoro 982 MB wenet-chinese stt wenet 1.2 GB ``` | Flag | Default | Description | |------|---------|-------------| | `--models-dir` | `~/.macaw/models` | Models directory to scan | --- ### `macaw inspect` Show detailed information about a model. ```bash macaw inspect faster-whisper-large-v3 ``` ``` Name: faster-whisper-large-v3 Type: stt Engine: faster-whisper Architecture: encoder-decoder Size: 3.1 GB Capabilities: Hot words: false (via initial_prompt workaround) Initial prompt: true Batch: true Word timestamps: true Max concurrent: 1 Engine Config: beam_size: 5 vad_filter: false compute_type: float16 device: cuda ``` | Flag | Default | Description | |------|---------|-------------| | `--models-dir` | `~/.macaw/models` | Models directory | --- ### `macaw ps` List models loaded on a running server. ```bash macaw ps ``` ``` NAME TYPE ENGINE STATUS faster-whisper-large-v3 stt faster-whisper ready kokoro tts kokoro ready ``` | Flag | Default | Description | |------|---------|-------------| | `--server` | `http://localhost:8000` | Server URL to query | This queries the `GET /v1/models` endpoint on the running server. --- ### `macaw pull` Download a model from HuggingFace Hub. ```bash title="Download a model" macaw pull faster-whisper-large-v3 ``` ```bash title="Force re-download" macaw pull faster-whisper-large-v3 --force ``` | Flag | Default | Description | |------|---------|-------------| | `--models-dir` | `~/.macaw/models` | Download destination | | `--force` | — | Re-download even if already exists | --- ### `macaw remove` Remove an installed model. ```bash title="Remove with confirmation" macaw remove faster-whisper-large-v3 ``` ```bash title="Skip confirmation" macaw remove faster-whisper-large-v3 --yes ``` | Flag | Short | Default | Description | |------|:---:|---------|-------------| | `--models-dir` | | `~/.macaw/models` | Models directory | | `--yes` | `-y` | — | Skip confirmation prompt | ## Typical Workflow ```bash title="1. Install a model" macaw pull faster-whisper-large-v3 # 2. Verify it's installed macaw list # 3. Start the server macaw serve # 4. In another terminal — check loaded models macaw ps # 5. Transcribe a file macaw transcribe audio.wav -m faster-whisper-large-v3 # 6. Stream from microphone macaw transcribe --stream -m faster-whisper-large-v3 # 7. Translate foreign audio macaw translate foreign.wav -m faster-whisper-large-v3 ``` ## Environment Variables CLI commands respect these environment variables: | Variable | Default | Description | |----------|---------|-------------| | `MACAW_MODELS_DIR` | `~/.macaw/models` | Default models directory | | `MACAW_SERVER_URL` | `http://localhost:8000` | Default server URL for client commands | | `MACAW_LOG_LEVEL` | `info` | Default log level | | `MACAW_LOG_FORMAT` | `text` | Default log format | --- # REST API Reference Macaw implements the [OpenAI Audio API](https://platform.openai.com/docs/api-reference/audio) contract. Existing OpenAI client libraries work without modification -- just change the `base_url`. --- ## Endpoints Overview | Method | Path | Description | |---|---|---| | `POST` | `/v1/audio/transcriptions` | Transcribe audio to text | | `POST` | `/v1/audio/translations` | Translate audio to English | | `POST` | `/v1/audio/speech` | Generate speech from text | | `GET` | `/health` | Health check | --- ## POST /v1/audio/transcriptions Transcribe an audio file into text. ### Request | Field | Type | Required | Description | |---|---|---|---| | `file` | file | Yes | Audio file (WAV, MP3, FLAC, OGG, WebM) | | `model` | string | Yes | Model ID (e.g., `faster-whisper-large-v3`) | | `language` | string | No | ISO 639-1 language code | | `prompt` | string | No | Context or hot words for the model | | `response_format` | string | No | `json` (default), `text`, `srt`, `vtt`, `verbose_json` | | `temperature` | float | No | Sampling temperature (0.0 - 1.0) | ### Examples ```bash title="Basic transcription" curl -X POST http://localhost:8000/v1/audio/transcriptions \ -F file=@audio.wav \ -F model=faster-whisper-large-v3 ``` ```bash title="With language and format options" curl -X POST http://localhost:8000/v1/audio/transcriptions \ -F file=@audio.wav \ -F model=faster-whisper-large-v3 \ -F language=en \ -F response_format=verbose_json ``` ```python title="Python (OpenAI SDK)" from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") result = client.audio.transcriptions.create( model="faster-whisper-large-v3", file=open("audio.wav", "rb"), language="en", response_format="verbose_json", ) print(result.text) ``` ### Response ```json title="json format (default)" { "text": "Hello, how can I help you today?" } ``` ```json title="verbose_json format" { "task": "transcribe", "language": "en", "duration": 3.42, "text": "Hello, how can I help you today?", "segments": [ { "id": 0, "start": 0.0, "end": 3.42, "text": "Hello, how can I help you today?" } ] } ``` --- ## POST /v1/audio/translations Translate audio from any supported language to English. ### Request | Field | Type | Required | Description | |---|---|---|---| | `file` | file | Yes | Audio file | | `model` | string | Yes | Model ID | | `prompt` | string | No | Context for the model | | `response_format` | string | No | Same options as transcriptions | | `temperature` | float | No | Sampling temperature | ### Example ```bash curl -X POST http://localhost:8000/v1/audio/translations \ -F file=@audio_portuguese.wav \ -F model=faster-whisper-large-v3 ``` ### Response ```json { "text": "Hello, how can I help you today?" } ``` > **Info** > Translation always outputs English text, regardless of the source language. --- ## POST /v1/audio/speech Generate speech audio from text. ### Request | Field | Type | Required | Description | |---|---|---|---| | `model` | string | Yes | TTS model ID (e.g., `kokoro-v1`) | | `input` | string | Yes | Text to synthesize | | `voice` | string | Yes | Voice identifier (e.g., `default`) | | `response_format` | string | No | `wav` (default) or `pcm` | ### Examples ```bash title="Generate WAV file" curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{"model": "kokoro-v1", "input": "Hello, welcome to Macaw!", "voice": "default"}' \ --output speech.wav ``` ```bash title="Generate raw PCM" curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{"model": "kokoro-v1", "input": "Hello!", "voice": "default", "response_format": "pcm"}' \ --output speech.pcm ``` ```python title="Python (OpenAI SDK)" response = client.audio.speech.create( model="kokoro-v1", input="Hello, welcome to Macaw!", voice="default", ) response.stream_to_file("output.wav") ``` ### Response The response body is the audio file in the requested format. | Format | Content-Type | Description | |---|---|---| | `wav` | `audio/wav` | WAV with headers (default) | | `pcm` | `audio/pcm` | Raw PCM 16-bit, 16kHz, mono | --- ## GET /health Returns the runtime health status. ```bash curl http://localhost:8000/health ``` ```json { "status": "ok" } ``` --- ## Error Responses All endpoints return standard HTTP error codes with a JSON body: ```json { "error": { "message": "Model 'nonexistent' not found", "type": "model_not_found", "code": 404 } } ``` | Status | Meaning | |---|---| | `400` | Invalid request (missing fields, bad format) | | `404` | Model not found | | `422` | Validation error | | `500` | Internal server error | | `503` | Worker unavailable | --- # WebSocket Protocol The `/v1/realtime` endpoint supports real-time bidirectional audio streaming with JSON control messages and binary audio frames. --- ## Connecting ```bash title="Connect with wscat" wscat -c "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3" ``` ```python title="Connect with Python websockets" import websockets async with websockets.connect( "ws://localhost:8000/v1/realtime?model=faster-whisper-large-v3" ) as ws: # Send audio frames, receive events ... ``` ### Query Parameters | Parameter | Required | Description | |---|---|---| | `model` | Yes | STT model ID | --- ## Message Flow ``` Client Server | | | ---- [connect] ----> | | <---- session.created | | | | ---- session.configure ----> | | | | ---- [binary PCM frames] ----> | | <---- vad.speech_start| | <---- transcript.partial | <---- transcript.partial | <---- transcript.final| | <---- vad.speech_end | | | | ---- tts.speak ----> | | <---- tts.speaking_start | <---- [binary audio] | | <---- tts.speaking_end| | | | ---- [close] ----> | ``` --- ## Client to Server Messages ### Binary Frames (Audio) Send raw PCM audio as binary WebSocket frames: | Property | Value | |---|---| | Format | PCM 16-bit signed integer | | Sample rate | Any (resampled automatically to 16 kHz) | | Channels | Mono (or first channel extracted) | > **Tip** > You can send audio at any sample rate -- the runtime automatically resamples to 16 kHz before processing. ### session.configure Configure the session after connecting. Optional -- defaults are used if not sent. ```json { "type": "session.configure", "vad": { "sensitivity": "normal" }, "language": "en", "hot_words": ["Macaw", "OpenVoice", "transcription"], "tts_model": "kokoro-v1" } ``` | Field | Type | Description | |---|---|---| | `vad.sensitivity` | string | `high`, `normal`, or `low` | | `language` | string | ISO 639-1 language code | | `hot_words` | string[] | Domain-specific keywords to boost | | `tts_model` | string | TTS model for full-duplex mode | ### tts.speak Trigger text-to-speech synthesis. The server will stream audio back as binary frames. ```json { "type": "tts.speak", "text": "Hello, how can I help you?", "voice": "default" } ``` | Field | Type | Description | |---|---|---| | `text` | string | Text to synthesize | | `voice` | string | Voice identifier | > **Warning** > Sending a new `tts.speak` while one is already active **cancels the previous one**. TTS requests do not queue -- only the latest one is processed. ### tts.cancel Cancel the currently active TTS synthesis. ```json { "type": "tts.cancel" } ``` --- ## Server to Client Events ### session.created Sent immediately after the WebSocket connection is established. ```json { "type": "session.created", "session_id": "abc123" } ``` ### vad.speech_start Speech activity detected. The runtime has started buffering audio for transcription. ```json { "type": "vad.speech_start", "timestamp": 1234567890.123 } ``` ### transcript.partial Intermediate transcription hypothesis. Updated as more audio arrives. **Unstable** -- may change with subsequent partials. ```json { "type": "transcript.partial", "text": "Hello how can" } ``` > **Info** > Partials are best-effort hypotheses. Never apply post-processing (ITN) to partials -- they are too unstable for reliable formatting. ### transcript.final Confirmed transcription segment. This is the stable, post-processed result. ```json { "type": "transcript.final", "text": "Hello, how can I help you today?", "language": "en", "duration": 3.42 } ``` ### vad.speech_end Speech activity has ended. ```json { "type": "vad.speech_end", "timestamp": 1234567890.456 } ``` ### tts.speaking_start TTS synthesis has begun. **STT is automatically muted** during TTS to prevent feedback loops. ```json { "type": "tts.speaking_start" } ``` ### Binary Frames (TTS Audio) During TTS synthesis, the server sends binary WebSocket frames containing audio data: | Direction | Content | |---|---| | Server to client (binary) | Always TTS audio | | Client to server (binary) | Always STT audio | > **Tip** > There is no ambiguity in binary frame direction -- server-to-client binary frames are **always** TTS audio, and client-to-server binary frames are **always** STT audio. ### tts.speaking_end TTS synthesis is complete. STT is **automatically unmuted** and resumes processing audio. ```json { "type": "tts.speaking_end" } ``` ### error An error occurred during processing. ```json { "type": "error", "message": "Worker connection lost", "recoverable": true } ``` | Field | Type | Description | |---|---|---| | `message` | string | Human-readable error description | | `recoverable` | boolean | Whether the session can continue | --- ## Full-Duplex Mode When a `tts_model` is configured, the WebSocket operates in full-duplex mode: 1. Client streams audio for STT continuously 2. Client sends `tts.speak` to trigger speech synthesis 3. Server automatically **mutes STT** during TTS playback (via `try/finally` -- unmute is guaranteed even if TTS crashes) 4. After TTS completes, STT **resumes automatically** See the [Full-Duplex Guide](../guides/full-duplex) for implementation details. --- # gRPC Internal Protocol Macaw uses gRPC for communication between the runtime process and its worker subprocesses. This protocol is **internal** and not intended for direct client use. > **Warning** > The gRPC API is an implementation detail. Use the [REST API](rest-api) or [WebSocket Protocol](websocket-protocol) for client integrations. --- ## Overview Each engine runs in an isolated subprocess that exposes a gRPC server. The runtime connects as a gRPC client. ``` Runtime Process Worker Subprocess +------------------+ +------------------+ | API Server | gRPC | STT Backend | | Scheduler | <========> | (Faster-Whisper) | | Session Manager | :50051 | | +------------------+ +------------------+ +------------------+ +------------------+ | API Server | gRPC | TTS Backend | | | <========> | (Kokoro) | | | :50052 | | +------------------+ +------------------+ ``` --- ## Proto Definitions ### STT Worker ``` src/macaw/proto/stt_worker.proto ``` The STT worker uses a **bidirectional streaming** RPC for real-time transcription: - **Client stream**: Audio chunks (PCM bytes) - **Server stream**: Transcription results (partial and final) The gRPC stream itself serves as the health check mechanism -- a broken stream indicates a crashed worker, triggering automatic recovery. ### TTS Worker ``` src/macaw/proto/tts_worker.proto ``` The TTS worker uses a **server-side streaming** RPC: - **Request**: Text input, voice ID, and synthesis parameters - **Server stream**: Audio chunks for low-latency streaming --- ## Worker Lifecycle 1. **Spawn**: Runtime launches worker as a subprocess on a designated port 2. **Ready**: Worker loads the model and starts the gRPC server 3. **Serve**: Runtime sends requests over gRPC streams 4. **Crash/Recovery**: If the stream breaks, the runtime respawns the worker and replays uncommitted data from the WAL ### Default Ports | Worker | Port | |---|---| | STT | `50051` | | TTS | `50052` | --- ## Regenerating Stubs If you modify the proto files, regenerate the Python stubs: ```bash make proto ``` This runs `grpcio-tools` to generate `*_pb2.py` and `*_pb2_grpc.py` files in `src/macaw/proto/`. --- # Architecture Overview Macaw OpenVoice is a **unified voice runtime** that orchestrates STT (Speech-to-Text) and TTS (Text-to-Speech) engines through a single process with isolated gRPC workers. It provides an OpenAI-compatible API while keeping engines modular and crash-isolated. ## High-Level Architecture ``` ┌─────────────────────────────────────────────┐ │ Macaw Runtime │ │ │ ┌──────────┐ ┌─────┴─────┐ ┌───────────┐ ┌───────────┐ │ │ Clients │────▶│ API Server │────▶│ Scheduler │────▶│ STT Worker│ │ (subprocess) │ │ │ (FastAPI) │ │ │ │ gRPC:50051│ │ │ REST │ │ │ │ Priority │ └───────────┘ │ │ WebSocket │ │ /v1/audio/ │ │ Queue │ │ │ CLI │ │ /v1/realtime│ │ Batching │ ┌───────────┐ │ └──────────┘ │ │ │ Cancel │────▶│ TTS Worker│ │ (subprocess) └─────┬─────┘ └───────────┘ │ gRPC:50052│ │ │ └───────────┘ │ ┌─────┴──────────────┐ │ │ Session Manager │ │ │ (streaming only) │ │ │ │ │ │ State Machine │ │ │ Ring Buffer │ │ │ WAL Recovery │ │ └────────────────────┘ │ │ │ ┌────┴───────────────────────────────┐ │ │ Audio Pipeline │ │ │ Preprocessing → VAD → Postprocess │ │ └────────────────────────────────────┘ │ └────────────────────────────────────────────┘ ``` ## Core Layers ### API Server The FastAPI server exposes three types of interfaces: | Interface | Endpoint | Use Case | |-----------|----------|----------| | REST (batch) | `POST /v1/audio/transcriptions` | File transcription | | REST (batch) | `POST /v1/audio/translations` | File translation to English | | REST (batch) | `POST /v1/audio/speech` | Text-to-speech synthesis | | WebSocket | `WS /v1/realtime` | Streaming STT + full-duplex TTS | | Health | `GET /health`, `GET /v1/models` | Monitoring and model listing | All REST endpoints are **OpenAI API-compatible** — existing OpenAI client libraries work without modification. ### Scheduler The Scheduler routes **batch** (REST) requests to gRPC workers. It provides: - **Priority queue** with two levels: `REALTIME` and `BATCH` - **Cancellation** for queued and in-flight requests - **Dynamic batching** to group requests by model - **Latency tracking** with TTL-based cleanup > **Warning Streaming bypasses the Scheduler** > WebSocket streaming uses `StreamingGRPCClient` directly — it does **not** pass through the priority queue. The Scheduler is only for REST batch requests. See [Scheduling](./scheduling) for details. ### Session Manager The Session Manager coordinates **streaming STT only**. Each WebSocket connection gets its own session with: - **State machine** — 6 states: `INIT → ACTIVE → SILENCE → HOLD → CLOSING → CLOSED` - **Ring buffer** — pre-allocated circular buffer for audio frames (zero allocations during streaming) - **WAL** — in-memory Write-Ahead Log for crash recovery - **Backpressure** — rate limiting at 1.2x real-time, frame dropping when overloaded > **Info TTS is stateless** > TTS does not use the Session Manager. Each `tts.speak` request is independent — no state is carried between synthesis calls. See [Session Manager](./session-manager) for details. ### Audio Pipeline The audio pipeline runs **in the runtime**, not in the engine. This guarantees consistent behavior across all engines. ``` Input Audio → Resample (16kHz) → DC Remove → Gain Normalize → VAD → Engine ↓ Raw Text → ITN → Output ``` | Stage | Layer | Description | |-------|-------|-------------| | Resample | Preprocessing | Convert to 16kHz mono via `scipy.signal.resample_poly` | | DC Remove | Preprocessing | 2nd-order Butterworth HPF at 20Hz | | Gain Normalize | Preprocessing | Peak normalization to -3.0 dBFS | | Energy Pre-filter | VAD | RMS + spectral flatness check (~0.1ms) | | Silero VAD | VAD | Neural speech probability (~2ms on CPU) | | ITN | Postprocessing | Inverse Text Normalization via NeMo (fail-open) | See [VAD Pipeline](./vad-pipeline) for details. ### Workers Workers are **gRPC subprocesses**. A worker crash does not bring down the runtime — the Session Manager recovers by resending uncommitted audio from the ring buffer. | Worker | Port | Protocol | Engines | |--------|------|----------|---------| | STT | 50051 | Bidirectional streaming | Faster-Whisper, WeNet | | TTS | 50052 | Server streaming | Kokoro | **Worker lifecycle:** ``` STARTING → READY → BUSY → STOPPING → STOPPED ↑ │ └───────┘ (on idle) CRASHED → (auto-restart, max 3 in 60s) ``` The WorkerManager handles health probing (exponential backoff, 30s timeout), graceful shutdown (SIGTERM → 5s wait → SIGKILL), and automatic restart with rate limiting. ### Model Registry The Registry manages model manifests (`macaw.yaml` files) and lifecycle. Models declare their `architecture` field, which tells the runtime how to adapt the pipeline: | Architecture | Example | LocalAgreement | Cross-segment Context | Native Partials | |-------------|---------|:-:|:-:|:-:| | `encoder-decoder` | Faster-Whisper | Yes | Yes (224 tokens) | No | | `ctc` | WeNet | No | No | Yes | | `streaming-native` | Paraformer | No | No | Yes | ## Data Flow ### Batch Request (REST) ``` Client → POST /v1/audio/transcriptions → Preprocessing pipeline (resample, DC remove, normalize) → Scheduler priority queue → gRPC TranscribeFile to STT worker → Postprocessing (ITN) → JSON response to client ``` ### Streaming Request (WebSocket) ``` Client → WS /v1/realtime → Session created (state: INIT) → Binary frames arrive → StreamingPreprocessor (per-frame) → VAD (energy pre-filter → Silero) → SPEECH_START → state: ACTIVE → Frames written to ring buffer → Frames sent via StreamingGRPCClient to STT worker → Partial/final transcripts sent back to client → SPEECH_END → state: SILENCE → ITN applied on final transcripts only ``` ### Full-Duplex (STT + TTS) ``` Client sends audio (STT) ──────────────────────────────▶ partials/finals Client sends tts.speak ──▶ mute STT ──▶ gRPC Synthesize to TTS worker ──▶ tts.speaking_start event ──▶ binary audio chunks (server → client) ──▶ tts.speaking_end event ──▶ unmute STT (guaranteed via try/finally) ``` ## Key Design Decisions | Decision | Rationale | |----------|-----------| | **Single process, subprocess workers** | Crash isolation without distributed system complexity | | **VAD in runtime, not engine** | Consistent behavior across all engines | | **Preprocessing before VAD** | Normalized audio ensures stable VAD thresholds | | **Streaming bypasses Scheduler** | Direct gRPC connection avoids queue latency for real-time | | **Mute-on-speak for full-duplex** | Prevents TTS audio from feeding back into STT | | **Pipeline adapts by architecture** | Encoder-decoder gets LocalAgreement; CTC uses native partials | | **ITN on finals only** | Partials are unstable — ITN would produce confusing output | | **In-memory WAL** | Fast recovery without disk I/O overhead | | **gRPC stream break as heartbeat** | No separate health polling needed for crash detection | --- # Session Manager The Session Manager is the core component for **streaming STT**. It coordinates audio buffering, speech detection, worker communication, and crash recovery for each WebSocket connection. > **Info STT only** > The Session Manager is used exclusively for streaming STT. TTS is stateless per request — each `tts.speak` call is independent. ## State Machine Each streaming session progresses through a 6-state finite state machine. Transitions are validated — invalid transitions raise `InvalidTransitionError`, and `CLOSED` is terminal. ``` ┌─────────────────────────┐ │ │ ▼ │ ┌──────┐ ┌────────┐ ┌─────────┐ ┌──────┐ │ INIT │───▶│ ACTIVE │───▶│ SILENCE │───▶│ HOLD │ └──┬───┘ └───┬────┘ └────┬────┘ └──┬───┘ │ │ │ │ │ │ │ ┌───┴────┐ │ │ └────────▶│CLOSING │ │ │ └───┬────┘ │ │ │ ▼ ▼ ▼ ┌──────────────────────────────────────────────┐ │ CLOSED │ └──────────────────────────────────────────────┘ ``` ### States | State | Description | Behavior | |-------|-------------|----------| | `INIT` | Session created, waiting for first speech | Frames preprocessed but not sent to worker | | `ACTIVE` | Speech detected, actively transcribing | Frames written to ring buffer and sent to gRPC worker | | `SILENCE` | Speech ended, waiting for next speech | Final transcript emitted, worker stream may close | | `HOLD` | Extended silence, conserving resources | Frames **not** sent to worker (saves GPU). Worker stream closed | | `CLOSING` | Graceful shutdown in progress | Flushing remaining data, preparing to close | | `CLOSED` | Terminal state | No transitions allowed. Session resources released | ### Timeouts Each state has a configurable timeout that triggers an automatic transition: | State | Default Timeout | Transition Target | |-------|:-:|---| | `INIT` | 30s | `CLOSED` (no speech detected) | | `SILENCE` | 30s | `HOLD` (extended silence) | | `HOLD` | 300s (5 min) | `CLOSING` (session idle too long) | | `CLOSING` | 2s | `CLOSED` (flush complete) | ### Triggers | Trigger | Transition | |---------|-----------| | `SPEECH_START` (VAD) | `INIT → ACTIVE` or `SILENCE → ACTIVE` or `HOLD → ACTIVE` | | `SPEECH_END` (VAD) | `ACTIVE → SILENCE` | | Silence timeout | `SILENCE → HOLD` | | Hold timeout | `HOLD → CLOSING` | | `session.close` command | Any → `CLOSING → CLOSED` | | Init timeout | `INIT → CLOSED` | ## Ring Buffer The ring buffer is a **pre-allocated circular buffer** that stores audio frames during streaming. It is designed for zero allocations during operation. ### Specifications | Property | Value | |----------|-------| | Default capacity | 1,920,000 bytes (60s at 16kHz, 16-bit) | | Allocation | Pre-allocated at session start | | Offset tracking | Absolute (`total_written`), monotonically increasing | | Overwrite protection | Read fence prevents overwriting uncommitted data | | Force commit threshold | 90% of capacity | ### Read Fence The read fence (`_read_fence`) marks the boundary between committed and uncommitted data: ``` ┌──────────────────────────────────────────────────────┐ │ Ring Buffer │ │ │ │ [committed] │ [uncommitted] │ [free space] │ │ ▲ ▲ │ │ read_fence write_pos │ │ │ │ ◀── safe to overwrite never overwrite ──▶ │ └──────────────────────────────────────────────────────┘ ``` > **Warning** > Never overwrite data past `last_committed_offset` — this data is needed for recovery. If a write would overwrite uncommitted data, `BufferOverrunError` is raised. ### Force Commit When uncommitted data exceeds **90%** of buffer capacity, the ring buffer triggers a force commit: 1. The `on_force_commit` callback fires **synchronously** from `write()` 2. The callback sets a `_force_commit_pending` flag 3. `process_frame()` (async) checks this flag and commits the segment 4. This prevents buffer overrun while keeping the write path non-blocking ## WAL (Write-Ahead Log) The WAL provides crash recovery using an **in-memory, single-record, overwrite** strategy. ### Checkpoint Structure ```python @dataclass(frozen=True, slots=True) class WALCheckpoint: segment_id: int # Current speech segment buffer_offset: int # Ring buffer position timestamp_ms: int # Monotonic timestamp (never wall-clock) ``` > **Tip Why monotonic time?** > The WAL uses `time.monotonic()` instead of `time.time()`. This ensures checkpoint consistency even if the system clock is adjusted (NTP sync, DST changes). ### Atomicity WAL updates are atomic via Python reference assignment within the single asyncio event loop. No locks are needed — the GIL and single-threaded event loop guarantee consistency. ## Recovery When a gRPC worker crashes (detected via stream break), the Session Manager recovers automatically: ``` Worker crash detected (gRPC stream break) │ ▼ ┌─────────────────────┐ │ Set _recovering flag │ (prevents recursion) └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Open new gRPC stream │ (to restarted worker) └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Read WAL checkpoint │ (get segment_id, buffer_offset) └──────────┬──────────┘ │ ▼ ┌─────────────────────────────┐ │ Resend uncommitted data │ (ring_buffer.get_uncommitted()) │ from ring buffer │ └──────────┬──────────────────┘ │ ▼ ┌─────────────────────┐ │ Resume normal flow │ (clear _recovering flag) └─────────────────────┘ ``` | Property | Value | |----------|-------| | Recovery timeout | 10s | | Anti-recursion | `_recovering` flag prevents nested recovery attempts | | Data guarantee | Only uncommitted data is resent (no duplicates) | | Detection method | gRPC bidirectional stream break | ## Backpressure The backpressure controller prevents the client from overwhelming the system when audio arrives faster than real-time. ### Thresholds | Parameter | Value | |-----------|-------| | Rate limit | 1.2x real-time | | Max backlog | 10s of audio | | Burst detection window | 5s sliding window | | Rate limit cooldown | 1s between emissions | | Minimum wall-clock before checks | 0.5s | ### Actions When thresholds are exceeded, the backpressure controller emits one of two actions: | Action | Event | Description | |--------|-------|-------------| | `RateLimitAction` | `session.rate_limit` | Client should slow down. Includes `delay_ms` hint | | `FramesDroppedAction` | `session.frames_dropped` | Frames were dropped. Includes `dropped_ms` | ## Mute-on-Speak For full-duplex operation, the Session Manager supports muting STT while TTS is active: ```python # In the TTS speak task (simplified) try: session.mute() # STT frames dropped # ... stream TTS audio to client ... finally: session.unmute() # STT always resumes, even on error ``` When muted: - Incoming audio frames are **dropped** without processing - The `stt_muted_frames_total` metric is incremented - Unmute is **guaranteed** via `try/finally` — even if TTS crashes ## Metrics The Session Manager exposes 9 Prometheus metrics (optional — graceful degradation if `prometheus_client` is not installed): | Metric | Type | Description | |--------|------|-------------| | `stt_ttfb_seconds` | Histogram | Time to first byte (first partial transcript) | | `stt_final_delay_seconds` | Histogram | Time from speech end to final transcript | | `stt_active_sessions` | Gauge | Currently active streaming sessions | | `stt_vad_events_total` | Counter | VAD events by type (speech_start, speech_end) | | `stt_session_duration_seconds` | Histogram | Total session duration | | `stt_segments_force_committed_total` | Counter | Ring buffer force commits | | `stt_confidence_avg` | Histogram | Average transcript confidence | | `stt_worker_recoveries_total` | Counter | Worker crash recoveries | | `stt_muted_frames_total` | Counter | Frames dropped due to mute | ## Pipeline Adaptation The `StreamingSession` adapts its behavior based on the engine's `architecture` field: ### Encoder-Decoder (Whisper) - **LocalAgreement** — compares tokens across consecutive inference passes. Only tokens confirmed by `min_confirm_passes` (default: 2) passes are emitted as partials. `flush()` on speech end emits remaining tokens as final - **Cross-segment context** — last 224 tokens (half of Whisper's 448 context window) from the previous final are used as `initial_prompt` for the next segment ### CTC (WeNet) - **Native partials** — CTC produces real-time partial transcripts directly - **No LocalAgreement** — not needed, partials are native - **No cross-segment context** — CTC does not support `initial_prompt` - Hot words with native support use the engine's built-in mechanism --- # VAD Pipeline Macaw runs all audio preprocessing and Voice Activity Detection (VAD) **in the runtime**, not in the engine. This guarantees consistent behavior regardless of which STT engine is active. > **Warning Preprocessing comes before VAD** > Audio must be normalized before reaching Silero VAD. Without normalization, VAD thresholds become unreliable across different audio sources and recording conditions. ## Pipeline Overview ``` Raw Audio Input │ ▼ ┌──────────────┐ │ Resample │ 16kHz mono │ (~0.5ms) │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ DC Remove │ Butterworth HPF @ 20Hz │ (~0.1ms) │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ Gain │ Peak normalize to -3dBFS │ Normalize │ │ (~0.1ms) │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ Energy │ RMS + spectral flatness │ Pre-filter │ (if silence → skip Silero) │ (~0.1ms) │ └──────┬───────┘ │ (only if energy detected) ▼ ┌──────────────┐ │ Silero VAD │ Neural speech probability │ (~2ms) │ └──────┬───────┘ │ ▼ VAD Decision (SPEECH_START / SPEECH_END) ``` ## Stage 1: Resample Converts input audio to **16kHz mono**, the standard format expected by all STT engines. | Property | Value | |----------|-------| | Method | `scipy.signal.resample_poly` (polyphase filter) | | Target rate | 16,000 Hz | | Channel handling | Multi-channel averaged to mono | | Quality | High-quality polyphase resampling (no aliasing) | ```python title="What it does" # Input: any sample rate, any channels # Output: 16kHz mono float32 audio, sample_rate = resample_stage.process(audio, original_sample_rate) # sample_rate is now 16000 ``` ## Stage 2: DC Remove Removes DC offset using a **2nd-order Butterworth high-pass filter** at 20Hz. DC offset is common in low-quality microphones and can bias VAD energy calculations. | Property | Value | |----------|-------| | Filter type | Butterworth high-pass (2nd order) | | Cutoff frequency | 20 Hz | | Implementation | `scipy.signal.sosfilt` with cached SOS coefficients | > **Tip Why 20Hz?** > Human speech starts around 85Hz (male fundamental) to 255Hz (female fundamental). A 20Hz cutoff removes DC and sub-bass rumble without affecting any speech content. ## Stage 3: Gain Normalize Peak normalization ensures audio reaches the VAD at a consistent level, regardless of the original recording volume. | Property | Value | |----------|-------| | Method | Peak normalization | | Target level | -3.0 dBFS (default) | | Clip protection | Yes — output clamped to [-1.0, 1.0] | Without gain normalization, a quiet recording might produce energy levels below the VAD threshold, causing missed speech detection. A loud recording might trigger false positives. ## Stage 4: Energy Pre-filter The energy pre-filter is a **fast, cheap check** (~0.1ms) that gates access to the more expensive Silero VAD (~2ms). If the frame is clearly silence, Silero is never called. ### How It Works The pre-filter combines two measurements: 1. **RMS energy (dBFS)** — overall loudness of the frame 2. **Spectral flatness** — how "noise-like" vs "tonal" the signal is ``` RMS Energy │ ┌───────────────┼───────────────┐ │ │ │ Below threshold Above threshold │ │ │ │ ▼ ▼ │ SILENCE Check spectral │ (skip Silero) flatness │ │ │ ┌─────────┼─────────┐ │ │ │ │ Flatness > 0.8 Flatness ≤ 0.8 (white noise) (tonal/speech) │ │ ▼ ▼ SILENCE PASS TO SILERO (skip Silero) ``` ### Sensitivity Levels Energy thresholds vary by sensitivity level: | Sensitivity | Energy Threshold | Effect | |-------------|:---:|--------| | `HIGH` | -50 dBFS | Detects very quiet speech. More false positives | | `NORMAL` | -40 dBFS | Balanced default | | `LOW` | -30 dBFS | Requires louder speech. Fewer false positives | **Spectral flatness threshold:** 0.8 (fixed). Values above 0.8 indicate white noise or silence — signals with no tonal content. ## Stage 5: Silero VAD The Silero VAD model is the final decision-maker. It uses a neural network to compute a **speech probability** for each frame. | Property | Value | |----------|-------| | Model | `snakers4/silero-vad` via `torch.hub` | | Loading | Lazy (loaded on first use) | | Thread safety | `threading.Lock` | | Frame size | 512 samples (32ms at 16kHz) | | Large frames | Split into 512-sample sub-frames, max probability returned | | Cost | ~2ms per frame on CPU | ### Speech Probability Thresholds | Sensitivity | Threshold | Meaning | |-------------|:---:|---------| | `HIGH` | 0.3 | Low bar — detects quiet/uncertain speech | | `NORMAL` | 0.5 | Balanced default | | `LOW` | 0.7 | High bar — only clear speech triggers | > **Info Sensitivity affects both stages** > The sensitivity level controls thresholds in **both** the energy pre-filter and Silero VAD. Setting `HIGH` makes both stages more permissive. ## Debounce and Duration Limits The VAD detector applies debounce to prevent rapid state toggling: | Parameter | Default | Description | |-----------|:---:|-------------| | `min_speech_duration_ms` | 250ms | Minimum speech before emitting `SPEECH_START` | | `min_silence_duration_ms` | 300ms | Minimum silence before emitting `SPEECH_END` | | `max_speech_duration_ms` | 30,000ms | Maximum speech segment (forces `SPEECH_END`) | ### Why max speech duration? Encoder-decoder models like Whisper have a fixed context window (30 seconds). If speech exceeds this window, the segment is force-ended to trigger transcription before the buffer overflows. The Session Manager handles cross-segment context to maintain continuity. ## VAD Events The VAD emits two event types: ```python @dataclass class VADEvent: event_type: str # "SPEECH_START" or "SPEECH_END" timestamp_ms: int # Monotonic timestamp ``` These events drive state transitions in the [Session Manager](./session-manager): | Event | Session Transition | |-------|-------------------| | `SPEECH_START` | `INIT → ACTIVE`, `SILENCE → ACTIVE`, `HOLD → ACTIVE` | | `SPEECH_END` | `ACTIVE → SILENCE` | ## Streaming vs Batch The preprocessing pipeline has two modes: ### Batch (REST API) Used for file uploads via `POST /v1/audio/transcriptions`: ```python title="AudioPreprocessingPipeline" # Decodes entire file (WAV/FLAC/OGG), applies all stages, outputs PCM 16-bit WAV pipeline = AudioPreprocessingPipeline(stages=[resample, dc_remove, gain_normalize]) processed_audio = pipeline.process(uploaded_file) ``` Supported input formats: WAV, FLAC, OGG (via `libsndfile` with `stdlib wave` fallback). ### Streaming (WebSocket) Used for real-time audio via `WS /v1/realtime`: ```python title="StreamingPreprocessor" # Processes one frame at a time # Input: raw PCM int16 bytes # Output: float32 16kHz mono preprocessor = StreamingPreprocessor(stages=[resample, dc_remove, gain_normalize]) processed_frame = preprocessor.process_frame(raw_pcm_bytes) ``` Each WebSocket connection gets its own `StreamingPreprocessor` instance to maintain per-connection filter state (DC remove uses stateful IIR filters). ## Configuration VAD settings can be adjusted per session via the `session.configure` WebSocket command: ```json title="Client → Server" { "type": "session.configure", "vad_sensitivity": "high", "hot_words": ["Macaw", "OpenVoice"] } ``` > **Warning Engine VAD must be disabled** > Always set `vad_filter: false` in the engine manifest. The runtime manages VAD — enabling the engine's built-in VAD (e.g., Faster-Whisper's `vad_filter`) would duplicate the work and cause unpredictable behavior. --- # Scheduling The Scheduler routes **batch** (REST API) requests to gRPC workers. It provides priority queuing, request cancellation, dynamic batching, and latency tracking. > **Warning Streaming bypasses the Scheduler** > WebSocket streaming uses `StreamingGRPCClient` directly. The Scheduler and its priority queue are **only** for REST batch requests (`POST /v1/audio/transcriptions`, `/translations`, `/speech`). ## Components ``` ┌─────────────────────────────────────────────┐ │ Scheduler │ │ │ REST Request ─┤ ┌──────────────────┐ ┌────────────────┐ │ │ │ BatchAccumulator │ │ PriorityQueue │ │ │ │ │──▶│ │ │ │ │ Group by model │ │ REALTIME first │ │ │ │ Flush: 50ms / 8 │ │ FIFO within │ │ │ └──────────────────┘ │ level │ │ │ └───────┬────────┘ │ │ │ │ │ ┌──────────────────┐ ┌───────▼────────┐ │ │ │ CancellationMgr │ │ Dispatch Loop │ │ │ │ │ │ │ │ │ │ Queue + in-flight │ │ gRPC channel │ │ │ │ Cancel via gRPC │ │ pool │ │ │ └──────────────────┘ └───────┬────────┘ │ │ │ │ │ ┌──────────────────┐ │ │ │ │ LatencyTracker │◀─────────┘ │ │ │ │ │ │ │ 4 phases, 60s TTL│ │ │ └──────────────────┘ │ └─────────────────────────────────────────────┘ │ ▼ gRPC Workers ``` ## Priority Queue The queue has **two priority levels** with FIFO ordering within each level: | Level | Value | Use Case | |-------|:---:|----------| | `REALTIME` | 0 | High-priority requests | | `BATCH` | 1 | Standard file transcriptions | ### Aging To prevent starvation, `BATCH` requests that have been queued for more than **30 seconds** are automatically promoted to `REALTIME` priority. The `aging_promotions_total` metric tracks how often this occurs. ### Request Structure Each queued request carries: ```python @dataclass class ScheduledRequest: request_id: str priority: Priority # REALTIME or BATCH audio_data: bytes model: str cancel_event: asyncio.Event # Set to cancel result_future: asyncio.Future # Resolved when complete enqueue_time: float # For aging calculation ``` ## Batch Accumulator The BatchAccumulator groups `BATCH` requests by model to improve GPU utilization: | Parameter | Value | Description | |-----------|:---:|-------------| | Flush timer | 50ms | Maximum wait before flushing a partial batch | | Max batch size | 8 | Maximum requests per batch | | Model grouping | Per-model | Different models are batched separately | ### Flush Triggers A batch is flushed (sent to the queue) when **any** of these conditions is met: 1. **Timer expires** — 50ms since the first request in the batch 2. **Batch full** — 8 requests accumulated 3. **Model mismatch** — new request targets a different model > **Info REALTIME bypasses batching** > Only `BATCH` priority requests go through the accumulator. `REALTIME` requests are sent directly to the priority queue. ### Flush Lifecycle The flush callback (`_dispatch_batch`) is fired by an asyncio timer. If the Scheduler stops before the timer fires, `stop()` performs a manual flush to avoid losing queued requests. ## Cancellation Manager The CancellationManager handles cancellation for both **queued** and **in-flight** requests. ### Cancellation Flow ``` cancel(request_id) │ ├─── Request in queue? │ │ │ ▼ │ Set cancel_event │ Remove from queue │ Remove from tracking │ └─── Request in-flight? │ ▼ Set cancel_event Send gRPC Cancel RPC to worker (100ms timeout) Remove from tracking (best-effort — cannot interrupt CUDA kernels) ``` | Property | Value | |----------|-------| | Queue cancel | Immediate — request removed from queue | | In-flight cancel | Best-effort — gRPC `Cancel` RPC with 100ms timeout | | Idempotent | Yes — cancelling an already-cancelled request is a no-op | | Tracking | Entry removed on cancel. `unregister()` is no-op if already cancelled | ### REST API Cancellation is exposed via the REST endpoint: ```bash title="Cancel a request" curl -X POST http://localhost:8000/v1/audio/transcriptions/{request_id}/cancel ``` ```json title="Response" { "request_id": "req_abc123", "cancelled": true } ``` ## Dispatch Loop The dispatch loop runs as a background asyncio task and processes the priority queue: 1. Dequeue next request (REALTIME first, then BATCH, FIFO within each) 2. Check if request was cancelled (skip if so) 3. Acquire gRPC channel from the pool 4. Send `TranscribeFile` RPC to worker 5. Track latency phases 6. Resolve the `result_future` with the transcription result 7. Apply postprocessing (ITN) if enabled ### Timeouts Request timeout is calculated dynamically: ``` timeout = max(30s, audio_duration_estimate × 2.0) ``` This ensures long audio files get proportionally more time while maintaining a reasonable minimum. ### Graceful Shutdown When `stop()` is called: 1. Flush any pending batches in the BatchAccumulator 2. Signal the dispatch loop to stop 3. Wait up to **10 seconds** for in-flight requests to complete 4. Cancel remaining requests ## Streaming gRPC Client For WebSocket streaming (which bypasses the Scheduler), Macaw uses `StreamingGRPCClient` with a `StreamHandle` abstraction: ```python title="StreamHandle API" handle = await client.open_stream(model, session_id, language) # Send audio frames await handle.send_frame(audio_data, is_last=False) # Receive transcript events async for event in handle.receive_events(): # TranscriptSegment with text, is_final, confidence, etc. ... # Close gracefully await handle.close() # Sends is_last=True + done_writing() # Or cancel await handle.cancel() # Target: ≤50ms ``` ### gRPC Keepalive Aggressive keepalive settings prevent stream drops: | Parameter | Value | |-----------|-------| | `keepalive_time` | 10s | | `keepalive_timeout` | 5s | ## Latency Tracker The LatencyTracker measures request duration across 4 phases: ``` start() dequeued() grpc_started() complete() │ │ │ │ ▼ ▼ ▼ ▼ ┌─────────┐ ┌───────────┐ ┌──────────┐ ┌───────────┐ │ Enqueue │───▶│ Queue Wait│───▶│ gRPC Time│───▶│ Done │ └─────────┘ └───────────┘ └──────────┘ └───────────┘ │ │ └──────────── total_time ──────────────────────┘ ``` | Phase | Measurement | |-------|-------------| | `queue_wait` | Time spent waiting in the priority queue | | `grpc_time` | Time spent in the gRPC call to the worker | | `total_time` | End-to-end from enqueue to completion | ### TTL Entries expire after **60 seconds**. `cleanup()` runs periodically to remove entries for requests that never completed (e.g., cancelled, timed out). ## Metrics ### Scheduler Metrics | Metric | Type | Description | |--------|------|-------------| | `scheduler_queue_depth` | Gauge | Current queue depth by priority level | | `scheduler_queue_wait_seconds` | Histogram | Time spent in queue | | `scheduler_grpc_duration_seconds` | Histogram | gRPC call duration | | `scheduler_cancel_latency_seconds` | Histogram | Time to cancel a request | | `scheduler_batch_size` | Histogram | Batch sizes dispatched | | `scheduler_requests_total` | Counter | Total requests by status (completed, cancelled, failed) | | `scheduler_aging_promotions_total` | Counter | BATCH → REALTIME promotions | ### TTS Metrics | Metric | Type | Description | |--------|------|-------------| | `tts_ttfb_seconds` | Histogram | Time to first audio byte | | `tts_synthesis_duration_seconds` | Histogram | Total synthesis time | | `tts_requests_total` | Counter | Total TTS requests | | `tts_active_sessions` | Gauge | Currently active TTS sessions | > **Tip Metrics are optional** > All metrics use `try/except ImportError` with a `HAS_METRICS` flag. If `prometheus_client` is not installed, metrics are silently skipped. Always check `if metric is not None` before observing. --- # Contributing Thank you for your interest in contributing to Macaw OpenVoice! This guide covers everything you need to set up a development environment, run tests, and submit changes. ## Development Setup ### Prerequisites | Tool | Version | Purpose | |------|---------|---------| | Python | 3.12+ | Runtime (project requires >=3.11) | | uv | latest | Fast Python package manager | | make | any | Build automation | | git | any | Version control | ### Clone and Install ```bash title="Clone the repository" git clone https://github.com/usemacaw/macaw-openvoice.git cd macaw-openvoice ``` ```bash title="Create virtual environment and install dependencies" uv venv --python 3.12 source .venv/bin/activate uv pip install -e ".[dev]" ``` ### Verify Setup ```bash title="Run the full check pipeline" make check # format + lint + typecheck make test-unit # unit tests ``` If both pass, you're ready to contribute. ## Development Workflow ### 1. Create a Branch ```bash git checkout -b feat/my-feature ``` Branch naming follows conventional prefixes: | Prefix | Use Case | |--------|----------| | `feat/` | New features | | `fix/` | Bug fixes | | `refactor/` | Code restructuring | | `test/` | Test additions/improvements | | `docs/` | Documentation changes | ### 2. Make Changes Follow the [Code Style](#code-style) guidelines below. ### 3. Run Checks ```bash title="During development — run unit tests (fast)" make test-unit ``` ```bash title="Before committing — run everything" make ci # format + lint + typecheck + all tests ``` ### 4. Commit Commits follow [Conventional Commits](https://www.conventionalcommits.org/): ```bash git commit -m "feat: add support for Paraformer streaming" git commit -m "fix: prevent ring buffer overrun on force commit" git commit -m "test: add integration tests for WeNet CTC partials" ``` | Type | Description | |------|-------------| | `feat:` | New feature | | `fix:` | Bug fix | | `refactor:` | Code change that neither fixes a bug nor adds a feature | | `test:` | Adding or updating tests | | `docs:` | Documentation only | ### 5. Submit a Pull Request Push your branch and open a PR against `main`. Include: - Clear description of what changed and why - Reference to any related issues - Test evidence (new tests or existing tests passing) ## Make Targets All `make` targets use `.venv/bin/` automatically — no need to activate the venv manually. | Target | Description | |--------|-------------| | `make check` | Format + lint + typecheck | | `make test` | All tests | | `make test-unit` | Unit tests only (use during development) | | `make test-integration` | Integration tests only | | `make test-fast` | All tests except `@pytest.mark.slow` | | `make ci` | Full pipeline: format + lint + typecheck + test | | `make proto` | Regenerate protobuf stubs | > **Tip Use `make test-unit` during development** > The full test suite includes integration tests that may require models and GPU. Unit tests run in seconds and catch most issues. ## Code Style ### General Rules - **Python 3.12** with strict mypy typing - **Async-first** — all public interfaces are `async` - **Formatting** — ruff (format + lint) - **Imports** — absolute from `macaw.` (e.g., `from macaw.registry import Registry`) - **Naming** — `snake_case` for functions/variables, `PascalCase` for classes - **Docstrings** — only on public interfaces (ABCs) and non-obvious functions - **No obvious comments** — code should be self-explanatory - **Errors** — typed domain exceptions, never generic `Exception` ### Testing Guidelines | Rule | Details | |------|---------| | Framework | pytest + pytest-asyncio with `asyncio_mode = "auto"` | | No `@pytest.mark.asyncio` | Auto mode handles it | | Async HTTP tests | `httpx.AsyncClient` with `ASGITransport` | | Error handler tests | `ASGITransport(raise_app_exceptions=False)` | | Fixtures | `tests/conftest.py` (auto-generated sine tones) | | Mocks | `unittest.mock` for inference engines | | Integration tests | `@pytest.mark.integration` marker | | Pattern | Arrange-Act-Assert | ### Running Individual Tests ```bash title="Run a specific test" .venv/bin/python -m pytest tests/unit/test_foo.py::test_bar -q ``` ```bash title="Run with verbose output" .venv/bin/python -m pytest tests/unit/test_foo.py -v ``` ## Project Structure ``` src/macaw/ ├── server/ # FastAPI — REST + WebSocket endpoints │ └── routes/ # transcriptions, translations, speech, health, realtime ├── scheduler/ # Priority queue, cancellation, batching, latency tracking ├── registry/ # Model Registry (macaw.yaml, lifecycle) ├── workers/ # Subprocess gRPC management │ ├── stt/ # STTBackend interface + implementations │ └── tts/ # TTSBackend interface + implementations ├── preprocessing/ # Audio pipeline (resample, DC remove, gain normalize) ├── postprocessing/ # Text pipeline (ITN via NeMo, fail-open) ├── vad/ # Voice Activity Detection (energy + Silero) ├── session/ # Session Manager (state machine, ring buffer, WAL) ├── cli/ # CLI commands (click) └── proto/ # gRPC protobuf definitions ``` ``` tests/ ├── unit/ # Fast, no I/O, mocked dependencies ├── integration/ # Real dependencies, may need GPU/models └── conftest.py # Shared fixtures ``` ## Common Pitfalls These are the most common issues contributors encounter. Read these before diving into the code: > **Warning Things that will bite you** > - **gRPC streams are the heartbeat.** Don't implement separate health check polling — the stream break is the crash detection. > - **Ring buffer has a read fence.** Never overwrite data past `last_committed_offset`. > - **ITN only on `transcript.final`.** Never apply ITN to partials. > - **Preprocessing before VAD.** Audio must be normalized before Silero VAD. > - **`vad_filter: false` in manifests.** Runtime handles VAD, not the engine. > - **Session Manager is STT only.** TTS is stateless per request. > - **LocalAgreement is for encoder-decoder only.** CTC has native partials. > - **Streaming bypasses the Scheduler.** WebSocket uses `StreamingGRPCClient` directly. ## Getting Help - Open an issue on [GitHub](https://github.com/usemacaw/macaw-openvoice/issues) for bugs and feature requests - Check existing issues and PRs before creating duplicates - For architecture questions, review the [Architecture Overview](../architecture/overview) - For general questions, reach out at [hello@usemacaw.io](mailto:hello@usemacaw.io) - Visit our website at [usemacaw.io](https://usemacaw.io) --- # Changelog All notable changes to Macaw OpenVoice are documented here. This project follows [Semantic Versioning](https://semver.org/) and the [Keep a Changelog](https://keepachangelog.com/) format. ## [Unreleased] ### Added - Full-duplex STT + TTS on a single WebSocket connection (M9) - Mute-on-speak mechanism with guaranteed unmute via try/finally - TTS cancel and auto-cancel of previous synthesis - `tts.speaking_start` and `tts.speaking_end` WebSocket events - KokoroBackend with 9-language support and streaming synthesis - TTS gRPC worker on port 50052 - `POST /v1/audio/speech` REST endpoint (OpenAI-compatible) - TTS metrics (TTFB, synthesis duration, requests, active sessions) - Session backpressure controller (rate limit at 1.2x real-time) - `session.configure` command for dynamic session settings - `input_audio_buffer.commit` command for manual buffer commit - `macaw pull` and `macaw remove` CLI commands - `macaw ps` command to list models on a running server ### Changed - WebSocket protocol extended with TTS commands and events - Session Manager now supports mute/unmute for full-duplex - Scheduler metrics made optional (graceful degradation without prometheus_client) --- ## Milestone History | Milestone | Description | Status | |-----------|-------------|:---:| | M1 | FastAPI server + health endpoint | Done | | M2 | Model Registry + manifests | Done | | M3 | Scheduler + priority queue + cancellation | Done | | M4 | gRPC STT workers (Faster-Whisper) | Done | | M5 | Streaming STT via WebSocket | Done | | M6 | Session Manager (state machine, ring buffer, WAL) | Done | | M7 | WeNet CTC engine + pipeline adaptation | Done | | M8 | TTS engine (Kokoro) + REST endpoint | Done | | M9 | Full-duplex STT + TTS | Done | > **Info** > For the complete commit history, see the [GitHub repository](https://github.com/usemacaw/macaw-openvoice). --- # Roadmap Macaw OpenVoice has completed all 9 milestones of the initial Product Requirements Document. The runtime is fully functional with STT, TTS, and full-duplex capabilities. ## Completed Milestones | Phase | Milestone | What Was Delivered | |:---:|-----------|-------------------| | 1 | **M1 — API Server** | FastAPI with health endpoint, CORS, OpenAI-compatible structure | | 1 | **M2 — Model Registry** | `macaw.yaml` manifests, model lifecycle, architecture field | | 2 | **M3 — Scheduler** | Priority queue, cancellation, dynamic batching, latency tracking | | 2 | **M4 — STT Workers** | gRPC subprocess workers, Faster-Whisper backend, crash recovery | | 3 | **M5 — Streaming STT** | WebSocket `/v1/realtime`, VAD pipeline, streaming preprocessor | | 3 | **M6 — Session Manager** | State machine (6 states), ring buffer, WAL, backpressure | | 4 | **M7 — Multi-Engine** | WeNet CTC backend, pipeline adaptation by architecture | | 4 | **M8 — TTS** | Kokoro TTS backend, `POST /v1/audio/speech`, gRPC TTS worker | | 5 | **M9 — Full-Duplex** | Mute-on-speak, `tts.speak`/`tts.cancel`, STT+TTS on same WebSocket | ## Current State - **1,600+ tests** passing (unit + integration) - **3 STT architectures** supported: encoder-decoder, CTC, streaming-native - **2 STT engines**: Faster-Whisper, WeNet - **1 TTS engine**: Kokoro (9 languages) - **Full-duplex** voice interactions on a single WebSocket - **OpenAI-compatible** REST API - **Ollama-style** CLI ## What's Next The following areas are under consideration for future development. These are not commitments — they represent directions the project may explore based on community feedback and priorities. ### Engine Ecosystem | Feature | Description | |---------|-------------| | Paraformer backend | Streaming-native architecture support | | Piper TTS | Lightweight TTS alternative for CPU-only deployments | | Whisper.cpp | GGML-based inference without Python/CUDA dependency | | Multi-model serving | Load multiple models per worker type | ### Scalability | Feature | Description | |---------|-------------| | Worker pooling | Multiple worker instances per engine for higher throughput | | Horizontal scaling | Multiple runtime instances behind a load balancer | | GPU sharing | Time-slice GPU across STT and TTS workers | | Kubernetes operator | Automated deployment with GPU scheduling | ### Features | Feature | Description | |---------|-------------| | Speaker diarization | Identify and label different speakers | | Word-level timestamps | Per-word timing in streaming mode | | Custom vocabularies | User-defined vocabularies beyond hot words | | Audio streaming output | Server-Sent Events for TTS as an alternative to WebSocket | | Barge-in | Client interrupts TTS to speak (currently requires `tts.cancel`) | ### Observability | Feature | Description | |---------|-------------| | OpenTelemetry | Distributed tracing across runtime and workers | | Dashboard templates | Pre-built Grafana dashboards for Prometheus metrics | | Structured audit logging | Request/response logging for compliance | ## Contributing Want to help shape the roadmap? See the [Contributing Guide](./contributing) to get started, or open an issue on [GitHub](https://github.com/usemacaw/macaw-openvoice/issues) to discuss new ideas.