Skip to main content

Kokoro TTS

Kokoro is a lightweight neural text-to-speech engine with 82 million parameters. It supports 9 languages, multiple voices, and produces high-quality 24kHz audio. Kokoro is the default TTS engine in Macaw OpenVoice, used for full-duplex voice interactions via WebSocket and REST speech synthesis.

Installation

pip install macaw-openvoice[kokoro]

This installs kokoro>=0.1,<1.0 as an optional dependency.

Model

PropertyValue
Catalog namekokoro-v1
HuggingFace repohexgrad/Kokoro-82M
Parameters82M
Memory512 MB
GPURecommended
Load time~3 seconds
Output sample rate24,000 Hz
Output format16-bit PCM
Chunk size4,096 bytes (~85ms at 24kHz)
API versionKokoro v0.9.4
macaw pull kokoro-v1

Languages

Kokoro supports 9 languages, identified by a single-character prefix in the voice name:

PrefixLanguageExample Voice
aEnglish (American)af_heart
bEnglish (British)bf_emma
eSpanishef_dora
fFrenchff_siwis
hHindihf_alpha
iItalianif_sara
jJapanesejf_alpha
pPortuguesepf_dora
zChinesezf_xiaobei

The language is selected automatically based on the voice prefix. When loading the model, the lang_code in engine_config sets the default pipeline language.

Voices

Naming Convention

Kokoro voices follow a strict naming pattern:

[language][gender][_name]
PositionMeaningValues
1st characterLanguage prefixa, b, e, f, h, i, j, p, z
2nd characterGenderf = female, m = male
RestVoice nameUnique identifier (e.g., _heart, _emma)

Examples:

  • af_heart — American English, female, "heart"
  • bm_george — British English, male, "george"
  • pf_dora — Portuguese, female, "dora"
  • jf_alpha — Japanese, female, "alpha"

Default Voice

The default voice is af_heart (American English, female). This is used when:

  • voice is set to "default" in the API request
  • No voice is specified

Voice Resolution

When a voice is requested, Kokoro resolves it in this order:

  1. "default" → uses the configured default_voice (default: af_heart)
  2. Simple name (e.g., "af_heart") → looks for <voices_dir>/af_heart.pt
  3. Absolute path or .pt extension → uses as-is

Voice files are .pt (PyTorch) files stored in the voices/ subdirectory of the model path.

Voice Discovery

The backend scans <model_path>/voices/*.pt to discover available voices. Each .pt file becomes a selectable voice:

List available voices
ls ~/.macaw/models/kokoro-v1/voices/
# af_heart.pt af_sky.pt am_adam.pt bf_emma.pt ...

Usage Examples

REST API — Speech Synthesis

Generate speech (WAV)
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro-v1",
"input": "Hello! Welcome to Macaw OpenVoice.",
"voice": "af_heart"
}' \
--output speech.wav
Generate speech (raw PCM)
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro-v1",
"input": "Olá! Bem-vindo ao Macaw OpenVoice.",
"voice": "pf_dora",
"response_format": "pcm"
}' \
--output speech.raw
Adjust speed
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro-v1",
"input": "Speaking slowly for clarity.",
"voice": "af_heart",
"speed": 0.8
}' \
--output slow.wav

Full-Duplex (WebSocket)

In a full-duplex WebSocket session, TTS is triggered via the tts.speak command:

Full-duplex voice assistant
import asyncio
import json
import websockets

async def voice_assistant():
uri = "ws://localhost:8000/v1/realtime"
async with websockets.connect(uri) as ws:
# Configure session with STT + TTS models
await ws.send(json.dumps({
"type": "session.configure",
"model": "faster-whisper-large-v3",
"tts_model": "kokoro-v1",
"tts_voice": "af_heart"
}))

# When user says something and you get a final transcript...
# Send TTS response:
await ws.send(json.dumps({
"type": "tts.speak",
"text": "I heard you! Let me help with that."
}))

# Receive TTS audio as binary frames
while True:
msg = await ws.recv()
if isinstance(msg, bytes):
# Binary frame = TTS audio (16-bit PCM, 24kHz)
play_audio(msg)
else:
event = json.loads(msg)
if event["type"] == "tts.speaking_start":
print("TTS started")
elif event["type"] == "tts.speaking_end":
print("TTS finished")
break
Mute-on-speak

During TTS playback, the STT pipeline automatically mutes (discards incoming audio frames). This prevents the system from transcribing its own speech output. Unmute is guaranteed via try/finally — even if TTS crashes, the microphone is restored.

Python SDK

Direct synthesis
import httpx

async def synthesize():
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8000/v1/audio/speech",
json={
"model": "kokoro-v1",
"input": "Macaw OpenVoice converts text to natural speech.",
"voice": "af_heart",
"speed": 1.0,
},
)
with open("output.wav", "wb") as f:
f.write(response.content)

Engine Configuration

engine_config in macaw.yaml
engine_config:
device: "auto" # "auto" (→ cpu), "cpu", "cuda"
default_voice: "af_heart" # Default voice when "default" is requested
sample_rate: 24000 # Output sample rate (fixed at 24kHz)
lang_code: "a" # Default pipeline language prefix
ParameterTypeDefaultDescription
devicestring"auto"Inference device. "auto" maps to "cpu"
default_voicestring"af_heart"Voice used when "default" is requested
sample_rateint24000Output sample rate (always 24kHz)
lang_codestring"a"Default language prefix for the KPipeline

Streaming Behavior

Kokoro synthesizes text in a streaming fashion using AsyncIterator[bytes]:

  1. The full text is passed to Kokoro's KPipeline
  2. KPipeline processes text in segments, yielding (graphemes, phonemes, audio) tuples
  3. Audio arrays are concatenated and converted to 16-bit PCM
  4. The PCM data is chunked into 4,096-byte segments (~85ms of audio at 24kHz)
  5. Chunks are yielded one at a time for low time-to-first-byte (TTFB)
Text → KPipeline → [segment audio] → float32→PCM16 → 4096-byte chunks → Client

Speed Control

The speed parameter adjusts synthesis speed:

ValueEffect
0.254x slower (minimum)
1.0Normal speed (default)
2.02x faster
4.04x faster (maximum)

Response Formats

The POST /v1/audio/speech endpoint supports two output formats:

FormatContent-TypeDescription
wav (default)audio/wavWAV file with header (44 bytes header + PCM data)
pcmaudio/pcmRaw 16-bit PCM, little-endian, mono, 24kHz

Manifest Reference

macaw.yaml (kokoro-v1)
name: kokoro-v1
version: "1.0.0"
engine: kokoro
type: tts
description: "Kokoro TTS - neural text-to-speech"

capabilities:
streaming: true
languages: ["en", "pt", "ja"]

resources:
memory_mb: 512
gpu_required: false
gpu_recommended: true
load_time_seconds: 3

engine_config:
device: "auto"
default_voice: "af_heart"
sample_rate: 24000
lang_code: "a"

Key Behaviors

  • TTS is stateless per request. Each tts.speak or POST /v1/audio/speech is independent. The Session Manager is not used for TTS.
  • New tts.speak cancels the previous. If a tts.speak command arrives while another is in progress, the previous one is cancelled automatically.
  • Binary WebSocket frames are directional. Server→client binary frames are always TTS audio. Client→server binary frames are always STT audio. No ambiguity.
  • TTS worker is a separate subprocess. Runs on a different gRPC port (default 50052 vs 50051 for STT). Crash does not affect the runtime.