API Examples

Complete request/response examples for every Macaw OpenVoice endpoint. All outputs shown here are real responses captured from a running instance with faster-whisper-tiny (STT) and kokoro-v1 (TTS).

Base URL

All examples assume the server is running at http://localhost:8000. Adjust the URL if your setup differs.

Health & System

GET /health

curl

curl http://localhost:8000/health

Response

{
  "status": "ok",
  "version": "0.1.7",
  "models_loaded": 2,
  "workers_ready": 2,
  "workers_total": 2
}

GET /v1/models

curl

curl http://localhost:8000/v1/models

Response

{
  "object": "list",
  "data": [
    {
      "id": "faster-whisper-tiny",
      "object": "model",
      "owned_by": "macaw",
      "created": 0,
      "type": "stt",
      "engine": "faster-whisper"
    },
    {
      "id": "kokoro-v1",
      "object": "model",
      "owned_by": "macaw",
      "created": 0,
      "type": "tts",
      "engine": "kokoro"
    }
  ]
}

Audio Transcription

JSON format (default)

curl

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=faster-whisper-tiny \
  -F response_format=json

Python

import httpx

with open("audio.wav", "rb") as f:
    r = httpx.post(
        "http://localhost:8000/v1/audio/transcriptions",
        files={"file": ("audio.wav", f, "audio/wav")},
        data={"model": "faster-whisper-tiny", "response_format": "json"},
        timeout=120,
    )
print(r.json())

Response

{
  "text": "Hello world, this is a test of the Macaw voice system."
}

Verbose JSON format

Includes segments with timestamps, language detection, and duration:

curl

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=faster-whisper-tiny \
  -F response_format=verbose_json

Response

{
  "text": "Hello world, this is a test of the Macaw voice system.",
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 4.0,
      "text": "Hello world, this is a test of the Macaw voice system."
    }
  ],
  "language": "en",
  "duration": 3.9
}

Text format

Returns plain text without JSON wrapping:

curl

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=faster-whisper-tiny \
  -F response_format=text

Response

Hello world, this is a test of the Macaw voice system.

SRT subtitle format

curl

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=faster-whisper-tiny \
  -F response_format=srt

Response

1
00:00:00,000 --> 00:00:04,000
Hello world, this is a test of the Macaw voice system.

VTT subtitle format

curl

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=faster-whisper-tiny \
  -F response_format=vtt

Response

WEBVTT

00:00:00.000 --> 00:00:04.000
Hello world, this is a test of the Macaw voice system.

With explicit language

curl

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=faster-whisper-tiny \
  -F language=en \
  -F response_format=verbose_json

Response

{
  "language": "en",
  "duration": 3.9,
  "text": "Hello world, this is a test of the Macaw voice system."
}

Word-level timestamps

curl

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=faster-whisper-tiny \
  -F response_format=verbose_json \
  -F "timestamp_granularities[]=word"

Response — words array

[
  {"word": "Hello", "start": 0.0, "end": 0.64},
  {"word": "world,", "start": 0.64, "end": 1.06},
  {"word": "this", "start": 1.46, "end": 1.66},
  {"word": "is", "start": 1.66, "end": 1.86},
  {"word": "a", "start": 1.86, "end": 1.98}
]

Audio Translation

Translates audio from any supported language to English:

curl

curl -X POST http://localhost:8000/v1/audio/translations \
  -F file=@audio.wav \
  -F model=faster-whisper-tiny

Response

{
  "text": "Hello world, this is a test of the Macaw voice system."
}

Translation target

Translation always outputs English text, regardless of the source language. This matches the OpenAI API behavior.

Speech Synthesis

WAV format (default)

curl

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "Hello world", "voice": "default"}' \
  --output speech.wav

Python

import httpx

r = httpx.post(
    "http://localhost:8000/v1/audio/speech",
    json={"model": "kokoro-v1", "input": "Hello world", "voice": "default"},
    timeout=120,
)
assert r.content[:4] == b"RIFF"  # WAV header
with open("speech.wav", "wb") as f:
    f.write(r.content)

Response

Content-Type: audio/wav
Body: 73,244 bytes (WAV file with RIFF header)

PCM format

Raw PCM 16-bit audio without WAV headers:

curl

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "Hello world", "response_format": "pcm"}' \
  --output speech.pcm

Response

Content-Type: audio/pcm
Body: 73,200 bytes (raw PCM 16-bit, 24kHz, mono)

WAV vs PCM size

The 44-byte difference between WAV (73,244) and PCM (73,200) is exactly the WAV file header.

Speed control

curl — 2x speed

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "Testing speed control", "speed": 2.0}' \
  --output fast.wav

Response comparison

Speed 1.0x: 90,044 bytes
Speed 2.0x: 45,644 bytes  (~50% size, as expected)

Audio effects — pitch shift

Shift the pitch up or down by semitones:

curl

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "Pitch shift test",
       "effects": {"pitch_shift_semitones": 3.0}}' \
  --output pitched.wav

Response

Status: 200
Body: 82,844 bytes (WAV)

Audio effects — reverb

Add room reverb to the generated speech:

curl

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "Reverb test",
       "effects": {"reverb_room_size": 0.7, "reverb_damping": 0.5,
                   "reverb_wet_dry_mix": 0.3}}' \
  --output reverb.wav

Response

Status: 200
Body: 85,244 bytes (WAV)

Audio effects — combined

Pitch shift and reverb can be applied together:

curl

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "Combined effects",
       "effects": {"pitch_shift_semitones": -2.0,
                   "reverb_room_size": 0.5, "reverb_wet_dry_mix": 0.2}}' \
  --output combined.wav

Response

Status: 200
Body: 81,644 bytes (WAV)

Word-level alignment (NDJSON)

When include_alignment is enabled, the response switches from binary audio to NDJSON streaming — each line contains base64-encoded audio with per-word timing:

curl

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "Hello world",
       "include_alignment": true, "alignment_granularity": "word"}'

Response — NDJSON lines (content-type: application/x-ndjson)

{"type": "audio", "audio": "<base64>", "alignment": {"items": [{"text": "Hello", "start_ms": 350, "duration_ms": 275}, {"text": "world", "start_ms": 625, "duration_ms": 625}], "granularity": "word"}}
{"type": "audio", "audio": "<base64>"}
{"type": "done", "duration": 1.525}

Alignment data

Alignment is attached to the first audio chunk of each synthesis segment. Subsequent chunks carry audio only. The done line provides the total audio duration.

Character-level alignment

curl

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "Hi",
       "include_alignment": true, "alignment_granularity": "character"}'

Response — alignment items

{
  "items": [
    {"text": "H", "start_ms": 375, "duration_ms": 262},
    {"text": "i", "start_ms": 637, "duration_ms": 263}
  ],
  "granularity": "character"
}

Seed parameter

For reproducible output with non-deterministic engines (e.g., Qwen3-TTS):

curl

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "Reproducibility test", "seed": 42}' \
  --output seeded.wav

Response

Status: 200
Body: 98,444 bytes (WAV)

Deterministic engines

Kokoro is a deterministic engine — seed is accepted but has no effect. Seed is meaningful for non-deterministic engines like Qwen3-TTS where it controls torch.manual_seed() before generation.

Text normalization

Controls whether the engine normalizes text (e.g., numbers to words):

curl — normalization off

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "I have 3 cats.",
       "text_normalization": "off"}' \
  --output normalized.wav

Response for each mode

auto: 86,444 bytes
on:   86,444 bytes
off:  86,444 bytes

Voice Management

List preset voices

curl

curl http://localhost:8000/v1/voices

Python

import httpx

r = httpx.get("http://localhost:8000/v1/voices", timeout=30)
data = r.json()
print(f"Total voices: {len(data['data'])}")
for v in data["data"][:5]:
    print(f"  {v['voice_id']}: {v['name']} ({v.get('language')})")

Response (first 5 of 54)

{
  "object": "list",
  "data": [
    {"voice_id": "af_alloy", "name": "af_alloy", "language": "en"},
    {"voice_id": "af_aoede", "name": "af_aoede", "language": "en"},
    {"voice_id": "af_bella", "name": "af_bella", "language": "en"},
    {"voice_id": "af_heart", "name": "af_heart", "language": "en"},
    {"voice_id": "af_jessica", "name": "af_jessica", "language": "en"}
  ]
}

Create a saved voice

Requires VoiceStore

Voice CRUD requires starting the server with --voice-dir or setting the MACAW_VOICE_DIR environment variable.

curl

curl -X POST http://localhost:8000/v1/voices \
  -F name=test-voice \
  -F voice_type=designed \
  -F "instruction=A calm and warm English voice" \
  -F language=en

Python

import httpx

r = httpx.post(
    "http://localhost:8000/v1/voices",
    data={
        "name": "test-voice",
        "voice_type": "designed",
        "instruction": "A calm and warm English voice",
        "language": "en",
    },
    timeout=30,
)
voice_id = r.json()["voice_id"]
print(f"Created: {voice_id}")

Response — 201 Created

{
  "voice_id": "333742fe-336a-4858-8b67-f1fced51d0d1",
  "name": "test-voice",
  "voice_type": "designed",
  "instruction": "A calm and warm English voice",
  "language": "en",
  "created_at": 1740076801.0
}

Get a saved voice

curl

curl http://localhost:8000/v1/voices/333742fe-336a-4858-8b67-f1fced51d0d1

Response — 200 OK

{
  "voice_id": "333742fe-336a-4858-8b67-f1fced51d0d1",
  "name": "test-voice",
  "voice_type": "designed",
  "instruction": "A calm and warm English voice",
  "language": "en",
  "created_at": 1740076801.0
}

Use a saved voice in synthesis

Reference saved voices with the voice_ prefix:

curl

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "Saved voice test",
       "voice": "voice_333742fe-336a-4858-8b67-f1fced51d0d1"}' \
  --output saved_voice.wav

Response

Status: 200
Body: 88,844 bytes (WAV)

Delete a saved voice

curl

curl -X DELETE http://localhost:8000/v1/voices/333742fe-336a-4858-8b67-f1fced51d0d1

Response — 204 No Content

(empty body)

Confirming deletion:

curl — verify deleted

curl http://localhost:8000/v1/voices/333742fe-336a-4858-8b67-f1fced51d0d1

Response — 404 Not Found

{
  "error": {
    "message": "Voice '333742fe-336a-4858-8b67-f1fced51d0d1' not found",
    "type": "voice_not_found"
  }
}

WebSocket Realtime

The WebSocket endpoint at /v1/realtime supports bidirectional STT streaming and TTS full-duplex in a single connection.

STT streaming

Python

import asyncio, json, websockets

async def stream_stt():
    ws_url = "ws://localhost:8000/v1/realtime?model=faster-whisper-tiny"
    async with websockets.connect(ws_url) as ws:
        # 1. Receive session.created
        msg = json.loads(await ws.recv())
        assert msg["type"] == "session.created"
        print(f"Session: {msg['session_id']}")

        # 2. Send PCM audio frames (16kHz, 16-bit, mono)
        with open("audio.wav", "rb") as f:
            pcm = f.read()[44:]  # skip WAV header
        for i in range(0, len(pcm), 3200):
            await ws.send(pcm[i:i+3200])
            await asyncio.sleep(0.05)

        # 3. Force commit
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # 4. Receive transcript events
        while True:
            raw = await asyncio.wait_for(ws.recv(), timeout=15)
            if isinstance(raw, str):
                ev = json.loads(raw)
                print(f"  {ev['type']}")
                if ev["type"] == "transcript.final":
                    print(f"  Text: {ev['text']}")
                    break

        await ws.send(json.dumps({"type": "session.close"}))

asyncio.run(stream_stt())

Output

Session: sess_fa87d9875e6e
  vad.speech_start
  transcript.final
  Text: Hello world, this is a test.

Session configuration

Python

await ws.send(json.dumps({
    "type": "session.configure",
    "language": "en",
    "enable_partial_transcripts": True,
    "vad_sensitivity": "high",
}))

TTS via WebSocket

Python

import asyncio, json, websockets

async def stream_tts():
    ws_url = "ws://localhost:8000/v1/realtime?model=faster-whisper-tiny"
    async with websockets.connect(ws_url) as ws:
        json.loads(await ws.recv())  # session.created

        # Configure TTS model
        await ws.send(json.dumps({
            "type": "session.configure", "model_tts": "kokoro-v1"
        }))

        # Speak
        await ws.send(json.dumps({
            "type": "tts.speak",
            "text": "Hello from WebSocket",
            "request_id": "test_tts_1",
        }))

        # Receive events + binary audio frames
        events, audio_frames = [], []
        while True:
            raw = await asyncio.wait_for(ws.recv(), timeout=30)
            if isinstance(raw, bytes):
                audio_frames.append(raw)
            else:
                ev = json.loads(raw)
                events.append(ev)
                print(f"  {ev['type']}")
                if ev["type"] == "tts.speaking_end":
                    break

        total = sum(len(f) for f in audio_frames)
        print(f"Audio: {len(audio_frames)} frames, {total:,} bytes")
        await ws.send(json.dumps({"type": "session.close"}))

asyncio.run(stream_tts())

Output

  tts.speaking_start
  tts.speaking_end
Audio: 23 frames, 91,200 bytes

Binary frame direction

In the WebSocket protocol, binary frames server-to-client are always TTS audio. Binary frames client-to-server are always STT audio. No ambiguity.

TTS with word alignment

Python

await ws.send(json.dumps({
    "type": "tts.speak",
    "text": "Alignment test",
    "include_alignment": True,
    "request_id": "align",
}))

tts.alignment event (received before audio frame)

{
  "type": "tts.alignment",
  "items": [
    {"text": "Alignment", "start_ms": 350, "duration_ms": 425},
    {"text": "test", "start_ms": 775, "duration_ms": 575}
  ]
}

TTS cancel

Cancel an in-progress synthesis:

Python

# Start speaking
await ws.send(json.dumps({
    "type": "tts.speak",
    "text": "This long sentence should be cancelled before finishing.",
    "request_id": "cancel_test",
}))

# Wait for tts.speaking_start, then cancel
# ...receive tts.speaking_start...
await ws.send(json.dumps({"type": "tts.cancel"}))

# Receive tts.speaking_end with cancelled flag
# ...receive tts.speaking_end...

tts.speaking_end event

{
  "type": "tts.speaking_end",
  "request_id": "cancel_test",
  "cancelled": true
}

Error Handling

All error responses follow the OpenAI error format:

Empty text (422)

curl

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": ""}'

Response

Status: 422 (Pydantic validation — input min_length=1)

Non-existent model (404)

curl

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "nonexistent", "input": "test"}'

Response

{
  "error": {
    "message": "Model 'nonexistent' not found",
    "type": "model_not_found"
  }
}

Non-existent voice (404)

curl

curl http://localhost:8000/v1/voices/nonexistent

Response

{
  "error": {
    "message": "Voice 'nonexistent' not found",
    "type": "voice_not_found"
  }
}

Invalid audio format (400)

curl

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "file=@bad.txt;type=text/plain" \
  -F model=faster-whisper-tiny

Response

Status: 400 (unsupported content type)

Alignment + Opus conflict (400)

curl

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "test",
       "include_alignment": true, "response_format": "opus"}'

Response

Status: 400 (alignment requires raw audio, not codec-encoded)

Error Response Summary

Status	Scenario	Description
400	Bad request	Invalid audio format, missing fields, conflicting options
404	Not found	Model or voice does not exist
422	Validation	Pydantic field validation (empty text, out-of-range values)
502	Worker crash	gRPC worker process died mid-request
503	Unavailable	No workers ready for the requested model
504	Timeout	Worker did not respond within gRPC deadline

On this page