MacawMacaw OpenVoice
Supported Models

Forced Alignment Models

Forced alignment models take audio and its corresponding transcript, then predict precise timestamps for each word or phoneme. This is essential for subtitle generation, audio editing, pronunciation assessment, and synchronizing audio with text in production pipelines.

Models

StatusModelParametersLanguagesHardwareLicenseHuggingFace
PlannedQwen3-ForcedAligner-0.6B0.6B11 (zh, en, yue, fr, de, it, ja, ko, pt, ru, es)GPU recommendedApache-2.0Link
PlannedSoundChoice G2P~129MEnglishCPU / GPUApache-2.0Link

Choosing a model

  • Multilingual timestamp prediction: Qwen3-ForcedAligner-0.6B supports 11 languages and can align audio segments up to ~5 minutes. Uses the same qwen-asr runtime as the Qwen3 ASR models.
  • English phoneme conversion: SoundChoice G2P (Grapheme-to-Phoneme) converts text to phoneme sequences, useful for TTS preprocessing and pronunciation dictionaries.

On this page