Voice Cloning Models

Voice cloning models generate speech that mimics a target speaker's voice characteristics from a short reference audio sample. These models enable personalized TTS experiences, custom voice creation, and speaker-adaptive synthesis. Some models require just 3-6 seconds of reference audio.

Consent and rights

Even when model weights are permissively licensed, voice cloning carries consent and intellectual property risks. Production deployments should implement consent verification, tenant isolation, and usage policies to prevent unauthorized voice replication.

Models

Status	Model	Parameters	Reference Audio	Languages	Hardware	License	HuggingFace
Planned	Qwen3-TTS-1.7B-CustomVoice	1.7B	3 seconds	10 (zh, en, ja, ko, de, fr, ru, pt, es, it)	GPU recommended	Apache-2.0	Link
Planned	OpenVoice V2	—	~10 seconds	6+ (en, es, fr, zh, ja, ko)	GPU recommended	MIT	Link
Planned	Index-TTS	—	5-10 seconds	zh, en	GPU	Apache-2.0	Link
Planned	OpenF5-TTS	336M	~5 seconds	Multilingual	GPU	Apache-2.0	Link
Planned	CosyVoice 2	0.5B	3-10 seconds	zh, en, ja, ko, yue	GPU recommended	Apache-2.0	Link

Choosing a model

Fastest cloning: Qwen3-TTS-1.7B-CustomVoice needs only 3 seconds of reference audio and supports streaming generation with 10 languages.
Open-source leader: OpenVoice V2 by MyShell is MIT-licensed and supports cross-lingual voice cloning with tone color control.
Chinese focus: CosyVoice 2 from Alibaba (FunAudioLLM) offers high-quality multilingual synthesis with strong Chinese language support.
Flow-matching TTS: F5-TTS uses a flow-matching architecture for natural-sounding zero-shot voice cloning.

Models

On this page