MacawMacaw OpenVoice
Supported Models

STT Models

Macaw supports a growing catalog of speech-to-text models spanning multiple architectures — from encoder-decoder models like Whisper to CTC-based streaming models like NeMo Parakeet and LLM-based ASR systems. All models integrate through Macaw's unified engine interface with automatic pipeline adaptation based on the model's architecture.

Models

StatusModelParametersWER (%)RTFxStreamingLanguagesHardwareLicenseHuggingFace
Availablefaster-whisper-large-v31.55B7.4Yes (chunked)100+GPU recommendedMITLink
Availablefaster-whisper-medium769MYes (chunked)100+GPU recommendedMITLink
Availablefaster-whisper-small244MYes (chunked)100+CPU / GPUMITLink
Availablefaster-whisper-tiny39MYes (chunked)100+CPU / GPUMITLink
Availabledistil-whisper-large-v3756M~7.56x large-v3Yes (chunked)EnglishGPU recommendedMITLink
PlannedCanary-Qwen-2.5B2.5B5.63418xEnglishGPUCC-BY-4.0Link
PlannedParakeet-TDT-1.1B1.1B~8.0>2,000xYes (native)EnglishGPU ~4 GBCC-BY-4.0Link
PlannedWhisper Large V3 Turbo809M7.75216xNo99+GPU ~6 GBMITLink
PlannedVoxtral-Mini-4B4BYes (native)13GPU ≥16 GBApache-2.0Link
PlannedQwen3-ASR-0.6B0.6BYes (unified)30+GPU recommendedApache-2.0Link
PlannedQwen3-ASR-1.7B1.7BYes (unified)30+GPU recommendedApache-2.0Link
PlannedMoonshine Streaming Medium245M6.65Yes (native)EnglishCPU / GPUMITLink
PlannedGranite Speech 3.3 8B~9B5.85NoEnglish + multi-lang ASTGPU (high VRAM)Apache-2.0Link
Plannedkyutai/stt-2.6b-en2.6BEnglishGPULink
PlannedSenseVoice Small234MNo50+CPU / GPUCustomLink
Planneddistil-large-v3.5756M7.101.5x TurboEnglishGPU recommendedMITLink
Plannedstt_pt_fastconformerPortugueseGPU recommendedCC-BY-4.0Link
PlannedCanary-1B-Flash883M1.48 ^1^4 (en, de, fr, es)GPUCC-BY-4.0Link
PlannedCanary-180M-Flash182M4 (en, de, fr, es)CPU / GPUCC-BY-4.0Link
PlannedomniASR-LLM-7B7B + 1.2BNo1,600+GPU (high VRAM)Apache-2.0Link
PlannedPhi-4-multimodal-instruct5.6B6.14No8 (en, zh, de, fr, it, ja, es, pt)GPU (high VRAM)MITLink
Plannedwhisper-large-v31.55BNo100+GPU recommendedMITLink
Plannedwhisper-large-v21.55BNo100+GPU recommendedMITLink
Plannedwhisper-large1.55BNo100+GPU recommendedMITLink
Plannedwhisper-medium769MNo100+GPU recommendedMITLink
Plannedwhisper-small244MNo100+CPU / GPUMITLink
Plannedwhisper-tiny39MNo100+CPU / GPUMITLink
Plannedwhisper-base74MNo100+CPU / GPUMITLink
Plannedwhisper-medium.en769MNoEnglishGPU recommendedMITLink
Plannedwhisper-small.en244MNoEnglishCPU / GPUMITLink
Plannedwhisper-tiny.en39MNoEnglishCPU / GPUMITLink
Plannedwhisper-base.en74MNoEnglishCPU / GPUMITLink
PlannedLFM2.5-Audio-1.5B1.5B7.53 ^2^NoEnglishCPU (GGUF) / GPULFM Open v1.0Link

Choosing a model

Best accuracy (English):

  • Canary-Qwen-2.5B — 5.63% WER, state-of-the-art on Open ASR benchmarks
  • Granite Speech 3.3 8B — 5.85% WER, also supports multilingual audio-to-text translation
  • Phi-4-multimodal-instruct — 6.14% WER, multimodal LLM with audio understanding (MIT)

Best multilingual:

  • faster-whisper-large-v3 — 100+ languages with auto-detection (available now)
  • Qwen3-ASR-1.7B — 30+ languages with unified streaming/offline inference

Best speed:

  • Parakeet-TDT-1.1B — RTFx >2,000x, among the fastest open models
  • Whisper Large V3 Turbo — 6x faster than large-v3 with only ~0.35% WER gap

Best for edge/low-resource:

  • faster-whisper-tiny — 39M params, runs on CPU (available now)
  • Canary-180M-Flash — 180M params, designed for resource-constrained deployment
  • Moonshine Streaming Medium — Designed for on-device streaming ASR

Real-time streaming:

  • Voxtral-Mini-4B — Native streaming with <500ms latency (requires vLLM)

End-to-end audio (interleaved):

  • LFM2.5-Audio-1.5B — Single model for ASR+TTS+voice chat. Sub-100ms end-to-end latency. CPU-friendly via GGUF (English only).

On this page