MacawMacaw OpenVoice
Supported Models

Speaker Diarization Models

Speaker diarization answers the question "who spoke when?" — segmenting audio into speaker-attributed regions. This is essential for meeting transcription, call center analytics, and any multi-speaker scenario. Related models include speaker verification (confirming identity) and language identification.

Models

StatusModelTypeParametersLanguagesHardwareLicenseHuggingFace
Plannedspeaker-diarization-3.1Diarization PipelineMultilingualGPU recommendedMITLink
PlannedSortformer 4spk-v2Streaming DiarizationEnglishGPUCC-BY-4.0Link
PlannedTitaNet-LargeSpeaker Embedding~23MEnglishGPUCC-BY-4.0Link
Plannedlang-id-voxlingua107Language Identification~21M107 languagesCPU / GPUApache-2.0Link

Choosing a model

  • Full diarization pipeline: pyannote speaker-diarization-3.1 is the most widely adopted open-source diarization system. MIT-licensed but requires accepting access conditions on HuggingFace (gated model).
  • NVIDIA NeMo ecosystem: Sortformer uses a novel sort-based approach for up to 4 speakers. TitaNet provides speaker embeddings for verification and clustering.
  • Language identification: lang-id-voxlingua107 identifies the spoken language from 107 options — useful for routing audio to the correct STT model.

Gated model access

pyannote/speaker-diarization-3.1 requires accepting user conditions on HuggingFace and sharing contact details to access model files. This may create friction in fully automated CI/CD pipelines or air-gapped environments.

On this page