Supported Models
VAD & Turn Detection Models
Voice Activity Detection (VAD) and turn detection are critical components in any real-time voice pipeline. VAD determines when someone is speaking, while semantic turn detection determines when someone has finished their thought — a much harder problem that requires understanding conversational context.
Macaw uses VAD internally to segment audio before sending it to STT engines. Turn detection can be layered on top to enable natural conversational AI experiences.
Models
| Status | Model | Type | Latency | Parameters | Hardware | License | HuggingFace |
|---|---|---|---|---|---|---|---|
| Available | Silero VAD | Energy + Neural VAD | ~2 ms/frame | ~2M | CPU | MIT | Link |
| Planned | smart-turn-v2 | Semantic End-of-Turn | ~12 ms (L40S) | 94.8M | CPU / GPU | BSD-2-Clause | Link |
| Planned | parakeet-realtime-eou-120m | End-of-Utterance | — | 120M | GPU | CC-BY-4.0 | Link |
| Planned | FireRedChat-pVAD | Speaker-aware VAD | — | — | CPU / GPU | Apache-2.0 | Link |
Choosing a model
- Default VAD: Silero VAD is already integrated and handles voice activity detection for all STT engines. No additional setup needed.
- Conversational AI: Add smart-turn-v2 for semantic end-of-turn detection — it understands when a user has finished speaking, not just when they paused. Supports 14 languages.
- Low-latency streaming: parakeet-realtime-eou-120m is designed for NVIDIA NeMo pipelines with real-time end-of-utterance detection.