STT Models

Macaw supports a growing catalog of speech-to-text models spanning multiple architectures — from encoder-decoder models like Whisper to CTC-based streaming models like NeMo Parakeet and LLM-based ASR systems. All models integrate through Macaw's unified engine interface with automatic pipeline adaptation based on the model's architecture.

Models

Status	Model	Parameters	WER (%)	RTFx	Streaming	Languages	Hardware	License	HuggingFace
Available	faster-whisper-large-v3	1.55B	7.4	—	Yes (chunked)	100+	GPU recommended	MIT	Link
Available	faster-whisper-medium	769M	—	—	Yes (chunked)	100+	GPU recommended	MIT	Link
Available	faster-whisper-small	244M	—	—	Yes (chunked)	100+	CPU / GPU	MIT	Link
Available	faster-whisper-tiny	39M	—	—	Yes (chunked)	100+	CPU / GPU	MIT	Link
Available	distil-whisper-large-v3	756M	~7.5	6x large-v3	Yes (chunked)	English	GPU recommended	MIT	Link
Planned	Canary-Qwen-2.5B	2.5B	5.63	418x	—	English	GPU	CC-BY-4.0	Link
Planned	Parakeet-TDT-1.1B	1.1B	~8.0	>2,000x	Yes (native)	English	GPU ~4 GB	CC-BY-4.0	Link
Planned	Whisper Large V3 Turbo	809M	7.75	216x	No	99+	GPU ~6 GB	MIT	Link
Planned	Voxtral-Mini-4B	4B	—	—	Yes (native)	13	GPU ≥16 GB	Apache-2.0	Link
Planned	Qwen3-ASR-0.6B	0.6B	—	—	Yes (unified)	30+	GPU recommended	Apache-2.0	Link
Planned	Qwen3-ASR-1.7B	1.7B	—	—	Yes (unified)	30+	GPU recommended	Apache-2.0	Link
Planned	Moonshine Streaming Medium	245M	6.65	—	Yes (native)	English	CPU / GPU	MIT	Link
Planned	Granite Speech 3.3 8B	~9B	5.85	—	No	English + multi-lang AST	GPU (high VRAM)	Apache-2.0	Link
Planned	kyutai/stt-2.6b-en	2.6B	—	—	—	English	GPU	—	Link
Planned	SenseVoice Small	234M	—	—	No	50+	CPU / GPU	Custom	Link
Planned	distil-large-v3.5	756M	7.10	1.5x Turbo	—	English	GPU recommended	MIT	Link
Planned	stt_pt_fastconformer	—	—	—	—	Portuguese	GPU recommended	CC-BY-4.0	Link
Planned	Canary-1B-Flash	883M	1.48 ^1^	—	—	4 (en, de, fr, es)	GPU	CC-BY-4.0	Link
Planned	Canary-180M-Flash	182M	—	—	—	4 (en, de, fr, es)	CPU / GPU	CC-BY-4.0	Link
Planned	omniASR-LLM-7B	7B + 1.2B	—	—	No	1,600+	GPU (high VRAM)	Apache-2.0	Link
Planned	Phi-4-multimodal-instruct	5.6B	6.14	—	No	8 (en, zh, de, fr, it, ja, es, pt)	GPU (high VRAM)	MIT	Link
Planned	whisper-large-v3	1.55B	—	—	No	100+	GPU recommended	MIT	Link
Planned	whisper-large-v2	1.55B	—	—	No	100+	GPU recommended	MIT	Link
Planned	whisper-large	1.55B	—	—	No	100+	GPU recommended	MIT	Link
Planned	whisper-medium	769M	—	—	No	100+	GPU recommended	MIT	Link
Planned	whisper-small	244M	—	—	No	100+	CPU / GPU	MIT	Link
Planned	whisper-tiny	39M	—	—	No	100+	CPU / GPU	MIT	Link
Planned	whisper-base	74M	—	—	No	100+	CPU / GPU	MIT	Link
Planned	whisper-medium.en	769M	—	—	No	English	GPU recommended	MIT	Link
Planned	whisper-small.en	244M	—	—	No	English	CPU / GPU	MIT	Link
Planned	whisper-tiny.en	39M	—	—	No	English	CPU / GPU	MIT	Link
Planned	whisper-base.en	74M	—	—	No	English	CPU / GPU	MIT	Link
Planned	LFM2.5-Audio-1.5B	1.5B	7.53 ^2^	—	No	English	CPU (GGUF) / GPU	LFM Open v1.0	Link

Choosing a model

Best accuracy (English):

Canary-Qwen-2.5B — 5.63% WER, state-of-the-art on Open ASR benchmarks
Granite Speech 3.3 8B — 5.85% WER, also supports multilingual audio-to-text translation
Phi-4-multimodal-instruct — 6.14% WER, multimodal LLM with audio understanding (MIT)

Best multilingual:

faster-whisper-large-v3 — 100+ languages with auto-detection (available now)
Qwen3-ASR-1.7B — 30+ languages with unified streaming/offline inference

Best speed:

Parakeet-TDT-1.1B — RTFx >2,000x, among the fastest open models
Whisper Large V3 Turbo — 6x faster than large-v3 with only ~0.35% WER gap

Best for edge/low-resource:

faster-whisper-tiny — 39M params, runs on CPU (available now)
Canary-180M-Flash — 180M params, designed for resource-constrained deployment
Moonshine Streaming Medium — Designed for on-device streaming ASR

Real-time streaming:

Voxtral-Mini-4B — Native streaming with <500ms latency (requires vLLM)

End-to-end audio (interleaved):

LFM2.5-Audio-1.5B — Single model for ASR+TTS+voice chat. Sub-100ms end-to-end latency. CPU-friendly via GGUF (English only).

Models

On this page