TTS Models

Macaw supports a diverse set of TTS models ranging from ultra-lightweight (82M parameters) to large LLM-based systems. Models vary in synthesis quality, latency, voice control, and language support. All integrate through Macaw's unified engine interface with consistent audio output handling.

Models

Status	Model	Parameters	Latency	Streaming	Languages	Hardware	License	HuggingFace
Available	Kokoro-82M	82M	—	—	9	CPU / GPU	Apache-2.0	Link
Planned	Qwen3-TTS-12Hz-0.6B	0.6B	~97 ms E2E	Yes	10 (zh, en, ja, ko, de, fr, ru, pt, es, it)	GPU recommended	Apache-2.0	Link
Planned	Pocket-TTS	100M	~200 ms	Yes	English	CPU (2 cores)	CC-BY-4.0	Link
Planned	NeuTTS-Air	748M	~100-200 ms	Yes	4 (en, es, de, fr)	CPU / GPU	Apache-2.0	Link
Planned	Orpheus-3B	4B	~100-200 ms	Yes	English (+ multilingual)	GPU	Apache-2.0	Link
Planned	CosyVoice 3	0.5B	~150 ms	Yes	9 (zh, en, ja, ko, de, es, fr, it, ru)	GPU recommended	Apache-2.0	Link
Planned	Chatterbox Turbo	350M	sub-200 ms	No	23	GPU	MIT	Link
Planned	Kimi-Audio-7B	7B	~300 ms	Yes	zh, en	GPU (high VRAM)	MIT	Link
Planned	VibeVoice-Realtime-0.5B	0.5B	~300 ms	Yes	English (+ 9 experimental)	GPU	MIT	Link
Planned	Dia2-2B	2B	—	Yes	English	GPU	Apache-2.0	Link
Planned	CSM-1B	~1B	—	No	English	GPU	Apache-2.0	Link
Planned	parler-tts-mini-multilingual	0.9B	—	No	8 (en, fr, es, pt, pl, de, it, nl)	GPU recommended	Apache-2.0	Link
Planned	MegaTTS3	0.45B	—	No	zh, en	GPU	Apache-2.0	Link
Planned	OuteTTS-1.0-0.6B	0.6B	—	No	14	GPU	Apache-2.0	Link
Planned	Zonos-v0.1	1.6B	200-300 ms	No	5+ (en, ja, zh, fr, de + Indic)	GPU	Apache-2.0	Link
Planned	Bark	300M	—	No	13	GPU recommended	MIT	Link
Planned	LFM2.5-Audio-1.5B	1.5B	<100 ms E2E	Yes (interleaved)	English	CPU (GGUF) / GPU	LFM Open v1.0	Link

Choosing a model

Lowest latency (streaming):

Qwen3-TTS-12Hz-0.6B — ~97ms end-to-end latency with streaming generation. 10 languages.
NeuTTS-Air — ~100ms streaming on-device TTS with voice cloning. Runs on CPU.
Orpheus-3B — ~100-200ms streaming with emotion/prosody control.

Best for production today:

Kokoro-82M — Ultra-lightweight at 82M params, runs on CPU. available now via macaw pull.
Pocket-TTS — 100M params, 6x real-time on CPU, ~200ms streaming latency.

Most languages:

Chatterbox Turbo — 23 languages with voice cloning and sub-200ms latency.
OuteTTS-1.0-0.6B — 14 trained languages, Apache-2.0 licensed.
Bark — 13 languages with non-speech audio generation (music, sound effects).

High quality:

Kimi-Audio-7B — Large LLM-based TTS from Moonshot AI with streaming at ~300ms.
Zonos-v0.1 — 1.6B params, high expressiveness, trained on 200k+ hours.

Open-source ecosystem:

CosyVoice 3 — Alibaba's streaming multilingual TTS with ~150ms first-packet latency.
Dia2-2B — Nari Labs streaming dialogue TTS model.

End-to-end audio (interleaved):

LFM2.5-Audio-1.5B — Single 1.5B model for ASR+TTS+voice chat. 4 preset voices (US/UK male/female). CPU-friendly via GGUF (English only).

Models

On this page