MacawMacaw OpenVoice
Supported Models

VAD & Turn Detection Models

Voice Activity Detection (VAD) and turn detection are critical components in any real-time voice pipeline. VAD determines when someone is speaking, while semantic turn detection determines when someone has finished their thought — a much harder problem that requires understanding conversational context.

Macaw uses VAD internally to segment audio before sending it to STT engines. Turn detection can be layered on top to enable natural conversational AI experiences.

Models

StatusModelTypeLatencyParametersHardwareLicenseHuggingFace
AvailableSilero VADEnergy + Neural VAD~2 ms/frame~2MCPUMITLink
Plannedsmart-turn-v2Semantic End-of-Turn~12 ms (L40S)94.8MCPU / GPUBSD-2-ClauseLink
Plannedparakeet-realtime-eou-120mEnd-of-Utterance120MGPUCC-BY-4.0Link
PlannedFireRedChat-pVADSpeaker-aware VADCPU / GPUApache-2.0Link

Choosing a model

  • Default VAD: Silero VAD is already integrated and handles voice activity detection for all STT engines. No additional setup needed.
  • Conversational AI: Add smart-turn-v2 for semantic end-of-turn detection — it understands when a user has finished speaking, not just when they paused. Supports 14 languages.
  • Low-latency streaming: parakeet-realtime-eou-120m is designed for NVIDIA NeMo pipelines with real-time end-of-utterance detection.

On this page