MacawMacaw OpenVoice
Supported Models

Voice Cloning Models

Voice cloning models generate speech that mimics a target speaker's voice characteristics from a short reference audio sample. These models enable personalized TTS experiences, custom voice creation, and speaker-adaptive synthesis. Some models require just 3-6 seconds of reference audio.

Consent and rights

Even when model weights are permissively licensed, voice cloning carries consent and intellectual property risks. Production deployments should implement consent verification, tenant isolation, and usage policies to prevent unauthorized voice replication.

Models

StatusModelParametersReference AudioLanguagesHardwareLicenseHuggingFace
PlannedQwen3-TTS-1.7B-CustomVoice1.7B3 seconds10 (zh, en, ja, ko, de, fr, ru, pt, es, it)GPU recommendedApache-2.0Link
PlannedOpenVoice V2~10 seconds6+ (en, es, fr, zh, ja, ko)GPU recommendedMITLink
PlannedIndex-TTS5-10 secondszh, enGPUApache-2.0Link
PlannedOpenF5-TTS336M~5 secondsMultilingualGPUApache-2.0Link
PlannedCosyVoice 20.5B3-10 secondszh, en, ja, ko, yueGPU recommendedApache-2.0Link

Choosing a model

  • Fastest cloning: Qwen3-TTS-1.7B-CustomVoice needs only 3 seconds of reference audio and supports streaming generation with 10 languages.
  • Open-source leader: OpenVoice V2 by MyShell is MIT-licensed and supports cross-lingual voice cloning with tone color control.
  • Chinese focus: CosyVoice 2 from Alibaba (FunAudioLLM) offers high-quality multilingual synthesis with strong Chinese language support.
  • Flow-matching TTS: F5-TTS uses a flow-matching architecture for natural-sounding zero-shot voice cloning.

On this page