# Macaw OpenVoice

> Open-source voice runtime for real-time Speech-to-Text and Text-to-Speech with OpenAI-compatible API, streaming session control, and extensible execution architecture.

- [Welcome to Macaw OpenVoice](https://docs.usemacaw.io/docs/intro): Macaw OpenVoice is an open-source voice runtime real-time Speech-to-Text and Text-to-Speech with OpenAI-compatible API, streaming session control, and extensible execution architecture.

## Getting Started

- [Installation](https://docs.usemacaw.io/docs/getting-started/installation): Macaw OpenVoice requires Python 3.11+ and uses pip extras to install only the engines you need.
- [Quickstart](https://docs.usemacaw.io/docs/getting-started/quickstart): Get from zero to your first transcription in under 5 minutes.
- [Configuration](https://docs.usemacaw.io/docs/getting-started/configuration): Macaw OpenVoice uses a combination of model manifests, runtime defaults, and environment variables for configuration.

## Supported Models

- [Supported Models](https://docs.usemacaw.io/docs/models): Macaw OpenVoice is engine-agnostic — it supports multiple STT and TTS engines through a unified backend interface. Each engine runs as an isolated gRPC subprocess, and the runtime adapts its pipeline automatically based on the model's architecture.
- [Faster-Whisper](https://docs.usemacaw.io/docs/models/faster-whisper): Faster-Whisper is the primary STT engine in Macaw OpenVoice. It provides high-accuracy multilingual speech recognition using CTranslate2-optimized Whisper models. Five model variants are available in the official catalog, covering use cases from lightweight testing to production-grade transcription.
- [WeNet](https://docs.usemacaw.io/docs/models/wenet): WeNet is a CTC-based STT engine optimized for low-latency streaming and Chinese speech recognition. Unlike Faster-Whisper, WeNet produces native partial transcripts frame-by-frame without requiring LocalAgreement, making it ideal for real-time applications where latency is critical.
- [Kokoro TTS](https://docs.usemacaw.io/docs/models/kokoro): Kokoro is a lightweight neural text-to-speech engine with 82 million parameters. It supports 9 languages, multiple voices, and produces high-quality 24kHz audio. Kokoro is the default TTS engine in Macaw OpenVoice, used for full-duplex voice interactions via WebSocket and REST speech synthesis.
- [Silero VAD](https://docs.usemacaw.io/docs/models/silero-vad): Silero VAD (Voice Activity Detection) is the neural speech detector used internally by Macaw OpenVoice. It determines which audio frames contain speech and which are silence, enabling the runtime to process only relevant audio. Silero VAD is not a user-installable model — it is bundled with the runtime and downloaded automatically via torch.hub.

## Guides

- [Batch Transcription](https://docs.usemacaw.io/docs/guides/batch-transcription): Macaw's REST API is OpenAI-compatible — you can use the official OpenAI SDK or any HTTP client to transcribe and translate audio files.
- [Streaming STT](https://docs.usemacaw.io/docs/guides/streaming-stt): Macaw provides real-time speech-to-text via WebSocket at /v1/realtime. Audio frames are sent as binary messages and transcription events are returned as JSON.
- [Full-Duplex STT + TTS](https://docs.usemacaw.io/docs/guides/full-duplex): Macaw supports full-duplex voice interactions on a single WebSocket connection. The client streams audio for STT while simultaneously receiving synthesized speech from TTS — all on the same /v1/realtime endpoint.
- [Adding an Engine](https://docs.usemacaw.io/docs/guides/adding-engine): Macaw is engine-agnostic. Adding a new STT or TTS engine requires implementing the backend interface, registering it in the factory, creating a model manifest, and writing tests. Zero changes to the runtime core.
- [CLI Reference](https://docs.usemacaw.io/docs/guides/cli): Macaw ships with an Ollama-style CLI for managing models, running the server, and transcribing audio. All commands are available via the macaw binary.

## API Reference

- [REST API](https://docs.usemacaw.io/docs/api-reference/rest-api): Macaw implements the OpenAI Audio API contract. Existing OpenAI client libraries work without modification -- just change the base_url.
- [WebSocket Protocol](https://docs.usemacaw.io/docs/api-reference/websocket-protocol): The /v1/realtime endpoint supports real-time bidirectional audio streaming with JSON control messages and binary audio frames.
- [gRPC (Internal)](https://docs.usemacaw.io/docs/api-reference/grpc-internal): Macaw uses gRPC for communication between the runtime process and its worker subprocesses. This protocol is internal and not intended for direct client use.

## Architecture

- [Architecture Overview](https://docs.usemacaw.io/docs/architecture/overview): Macaw OpenVoice is a unified voice runtime that orchestrates STT (Speech-to-Text) and TTS (Text-to-Speech) engines through a single process with isolated gRPC workers. It provides an OpenAI-compatible API while keeping engines modular and crash-isolated.
- [Session Manager](https://docs.usemacaw.io/docs/architecture/session-manager): The Session Manager is the core component for streaming STT. It coordinates audio buffering, speech detection, worker communication, and crash recovery for each WebSocket connection.
- [VAD Pipeline](https://docs.usemacaw.io/docs/architecture/vad-pipeline): Macaw runs all audio preprocessing and Voice Activity Detection (VAD) in the runtime, not in the engine. This guarantees consistent behavior regardless of which STT engine is active.
- [Scheduling](https://docs.usemacaw.io/docs/architecture/scheduling): The Scheduler routes batch (REST API) requests to gRPC workers. It provides priority queuing, request cancellation, dynamic batching, and latency tracking.

## Community

- [Contributing](https://docs.usemacaw.io/docs/community/contributing): Thank you for your interest in contributing to Macaw OpenVoice! This guide covers everything you need to set up a development environment, run tests, and submit changes.
- [Changelog](https://docs.usemacaw.io/docs/community/changelog): All notable changes to Macaw OpenVoice are documented here. This project follows Semantic Versioning and the Keep a Changelog format.
- [Roadmap](https://docs.usemacaw.io/docs/community/roadmap): Macaw OpenVoice has completed all 9 milestones of the initial Product Requirements Document. The runtime is fully functional with STT, TTS, and full-duplex capabilities.