MacawMacaw OpenVoice
Supported Models

TTS Models

Macaw supports a diverse set of TTS models ranging from ultra-lightweight (82M parameters) to large LLM-based systems. Models vary in synthesis quality, latency, voice control, and language support. All integrate through Macaw's unified engine interface with consistent audio output handling.

Models

StatusModelParametersLatencyStreamingLanguagesHardwareLicenseHuggingFace
AvailableKokoro-82M82M9CPU / GPUApache-2.0Link
PlannedQwen3-TTS-12Hz-0.6B0.6B~97 ms E2EYes10 (zh, en, ja, ko, de, fr, ru, pt, es, it)GPU recommendedApache-2.0Link
PlannedPocket-TTS100M~200 msYesEnglishCPU (2 cores)CC-BY-4.0Link
PlannedNeuTTS-Air748M~100-200 msYes4 (en, es, de, fr)CPU / GPUApache-2.0Link
PlannedOrpheus-3B4B~100-200 msYesEnglish (+ multilingual)GPUApache-2.0Link
PlannedCosyVoice 30.5B~150 msYes9 (zh, en, ja, ko, de, es, fr, it, ru)GPU recommendedApache-2.0Link
PlannedChatterbox Turbo350Msub-200 msNo23GPUMITLink
PlannedKimi-Audio-7B7B~300 msYeszh, enGPU (high VRAM)MITLink
PlannedVibeVoice-Realtime-0.5B0.5B~300 msYesEnglish (+ 9 experimental)GPUMITLink
PlannedDia2-2B2BYesEnglishGPUApache-2.0Link
PlannedCSM-1B~1BNoEnglishGPUApache-2.0Link
Plannedparler-tts-mini-multilingual0.9BNo8 (en, fr, es, pt, pl, de, it, nl)GPU recommendedApache-2.0Link
PlannedMegaTTS30.45BNozh, enGPUApache-2.0Link
PlannedOuteTTS-1.0-0.6B0.6BNo14GPUApache-2.0Link
PlannedZonos-v0.11.6B200-300 msNo5+ (en, ja, zh, fr, de + Indic)GPUApache-2.0Link
PlannedBark300MNo13GPU recommendedMITLink
PlannedLFM2.5-Audio-1.5B1.5B<100 ms E2EYes (interleaved)EnglishCPU (GGUF) / GPULFM Open v1.0Link

Choosing a model

Lowest latency (streaming):

  • Qwen3-TTS-12Hz-0.6B — ~97ms end-to-end latency with streaming generation. 10 languages.
  • NeuTTS-Air — ~100ms streaming on-device TTS with voice cloning. Runs on CPU.
  • Orpheus-3B — ~100-200ms streaming with emotion/prosody control.

Best for production today:

  • Kokoro-82M — Ultra-lightweight at 82M params, runs on CPU. available now via macaw pull.
  • Pocket-TTS — 100M params, 6x real-time on CPU, ~200ms streaming latency.

Most languages:

  • Chatterbox Turbo — 23 languages with voice cloning and sub-200ms latency.
  • OuteTTS-1.0-0.6B — 14 trained languages, Apache-2.0 licensed.
  • Bark — 13 languages with non-speech audio generation (music, sound effects).

High quality:

  • Kimi-Audio-7B — Large LLM-based TTS from Moonshot AI with streaming at ~300ms.
  • Zonos-v0.1 — 1.6B params, high expressiveness, trained on 200k+ hours.

Open-source ecosystem:

  • CosyVoice 3 — Alibaba's streaming multilingual TTS with ~150ms first-packet latency.
  • Dia2-2B — Nari Labs streaming dialogue TTS model.

End-to-end audio (interleaved):

  • LFM2.5-Audio-1.5B — Single 1.5B model for ASR+TTS+voice chat. 4 preset voices (US/UK male/female). CPU-friendly via GGUF (English only).

On this page