Speaker Diarization Models

Speaker diarization answers the question "who spoke when?" — segmenting audio into speaker-attributed regions. This is essential for meeting transcription, call center analytics, and any multi-speaker scenario. Related models include speaker verification (confirming identity) and language identification.

Models

Status	Model	Type	Parameters	Languages	Hardware	License	HuggingFace
Planned	speaker-diarization-3.1	Diarization Pipeline	—	Multilingual	GPU recommended	MIT	Link
Planned	Sortformer 4spk-v2	Streaming Diarization	—	English	GPU	CC-BY-4.0	Link
Planned	TitaNet-Large	Speaker Embedding	~23M	English	GPU	CC-BY-4.0	Link
Planned	lang-id-voxlingua107	Language Identification	~21M	107 languages	CPU / GPU	Apache-2.0	Link

Choosing a model

Full diarization pipeline: pyannote speaker-diarization-3.1 is the most widely adopted open-source diarization system. MIT-licensed but requires accepting access conditions on HuggingFace (gated model).
NVIDIA NeMo ecosystem: Sortformer uses a novel sort-based approach for up to 4 speakers. TitaNet provides speaker embeddings for verification and clustering.
Language identification: lang-id-voxlingua107 identifies the spoken language from 107 options — useful for routing audio to the correct STT model.

Gated model access

pyannote/speaker-diarization-3.1 requires accepting user conditions on HuggingFace and sharing contact details to access model files. This may create friction in fully automated CI/CD pipelines or air-gapped environments.

Models

On this page