Supported Models
STT Models
Macaw supports a growing catalog of speech-to-text models spanning multiple architectures — from encoder-decoder models like Whisper to CTC-based streaming models like NeMo Parakeet and LLM-based ASR systems. All models integrate through Macaw's unified engine interface with automatic pipeline adaptation based on the model's architecture.
Models
| Status | Model | Parameters | WER (%) | RTFx | Streaming | Languages | Hardware | License | HuggingFace |
|---|---|---|---|---|---|---|---|---|---|
| Available | faster-whisper-large-v3 | 1.55B | 7.4 | — | Yes (chunked) | 100+ | GPU recommended | MIT | Link |
| Available | faster-whisper-medium | 769M | — | — | Yes (chunked) | 100+ | GPU recommended | MIT | Link |
| Available | faster-whisper-small | 244M | — | — | Yes (chunked) | 100+ | CPU / GPU | MIT | Link |
| Available | faster-whisper-tiny | 39M | — | — | Yes (chunked) | 100+ | CPU / GPU | MIT | Link |
| Available | distil-whisper-large-v3 | 756M | ~7.5 | 6x large-v3 | Yes (chunked) | English | GPU recommended | MIT | Link |
| Planned | Canary-Qwen-2.5B | 2.5B | 5.63 | 418x | — | English | GPU | CC-BY-4.0 | Link |
| Planned | Parakeet-TDT-1.1B | 1.1B | ~8.0 | >2,000x | Yes (native) | English | GPU ~4 GB | CC-BY-4.0 | Link |
| Planned | Whisper Large V3 Turbo | 809M | 7.75 | 216x | No | 99+ | GPU ~6 GB | MIT | Link |
| Planned | Voxtral-Mini-4B | 4B | — | — | Yes (native) | 13 | GPU ≥16 GB | Apache-2.0 | Link |
| Planned | Qwen3-ASR-0.6B | 0.6B | — | — | Yes (unified) | 30+ | GPU recommended | Apache-2.0 | Link |
| Planned | Qwen3-ASR-1.7B | 1.7B | — | — | Yes (unified) | 30+ | GPU recommended | Apache-2.0 | Link |
| Planned | Moonshine Streaming Medium | 245M | 6.65 | — | Yes (native) | English | CPU / GPU | MIT | Link |
| Planned | Granite Speech 3.3 8B | ~9B | 5.85 | — | No | English + multi-lang AST | GPU (high VRAM) | Apache-2.0 | Link |
| Planned | kyutai/stt-2.6b-en | 2.6B | — | — | — | English | GPU | — | Link |
| Planned | SenseVoice Small | 234M | — | — | No | 50+ | CPU / GPU | Custom | Link |
| Planned | distil-large-v3.5 | 756M | 7.10 | 1.5x Turbo | — | English | GPU recommended | MIT | Link |
| Planned | stt_pt_fastconformer | — | — | — | — | Portuguese | GPU recommended | CC-BY-4.0 | Link |
| Planned | Canary-1B-Flash | 883M | 1.48 ^1^ | — | — | 4 (en, de, fr, es) | GPU | CC-BY-4.0 | Link |
| Planned | Canary-180M-Flash | 182M | — | — | — | 4 (en, de, fr, es) | CPU / GPU | CC-BY-4.0 | Link |
| Planned | omniASR-LLM-7B | 7B + 1.2B | — | — | No | 1,600+ | GPU (high VRAM) | Apache-2.0 | Link |
| Planned | Phi-4-multimodal-instruct | 5.6B | 6.14 | — | No | 8 (en, zh, de, fr, it, ja, es, pt) | GPU (high VRAM) | MIT | Link |
| Planned | whisper-large-v3 | 1.55B | — | — | No | 100+ | GPU recommended | MIT | Link |
| Planned | whisper-large-v2 | 1.55B | — | — | No | 100+ | GPU recommended | MIT | Link |
| Planned | whisper-large | 1.55B | — | — | No | 100+ | GPU recommended | MIT | Link |
| Planned | whisper-medium | 769M | — | — | No | 100+ | GPU recommended | MIT | Link |
| Planned | whisper-small | 244M | — | — | No | 100+ | CPU / GPU | MIT | Link |
| Planned | whisper-tiny | 39M | — | — | No | 100+ | CPU / GPU | MIT | Link |
| Planned | whisper-base | 74M | — | — | No | 100+ | CPU / GPU | MIT | Link |
| Planned | whisper-medium.en | 769M | — | — | No | English | GPU recommended | MIT | Link |
| Planned | whisper-small.en | 244M | — | — | No | English | CPU / GPU | MIT | Link |
| Planned | whisper-tiny.en | 39M | — | — | No | English | CPU / GPU | MIT | Link |
| Planned | whisper-base.en | 74M | — | — | No | English | CPU / GPU | MIT | Link |
| Planned | LFM2.5-Audio-1.5B | 1.5B | 7.53 ^2^ | — | No | English | CPU (GGUF) / GPU | LFM Open v1.0 | Link |
Choosing a model
Best accuracy (English):
- Canary-Qwen-2.5B — 5.63% WER, state-of-the-art on Open ASR benchmarks
- Granite Speech 3.3 8B — 5.85% WER, also supports multilingual audio-to-text translation
- Phi-4-multimodal-instruct — 6.14% WER, multimodal LLM with audio understanding (MIT)
Best multilingual:
- faster-whisper-large-v3 — 100+ languages with auto-detection (available now)
- Qwen3-ASR-1.7B — 30+ languages with unified streaming/offline inference
Best speed:
- Parakeet-TDT-1.1B — RTFx >2,000x, among the fastest open models
- Whisper Large V3 Turbo — 6x faster than large-v3 with only ~0.35% WER gap
Best for edge/low-resource:
- faster-whisper-tiny — 39M params, runs on CPU (available now)
- Canary-180M-Flash — 180M params, designed for resource-constrained deployment
- Moonshine Streaming Medium — Designed for on-device streaming ASR
Real-time streaming:
- Voxtral-Mini-4B — Native streaming with <500ms latency (requires vLLM)
End-to-end audio (interleaved):
- LFM2.5-Audio-1.5B — Single model for ASR+TTS+voice chat. Sub-100ms end-to-end latency. CPU-friendly via GGUF (English only).