Skip to main content

WeNet

WeNet is a CTC-based STT engine optimized for low-latency streaming and Chinese speech recognition. Unlike Faster-Whisper, WeNet produces native partial transcripts frame-by-frame without requiring LocalAgreement, making it ideal for real-time applications where latency is critical.

Bring your own model

WeNet has no pre-configured models in the Macaw catalog. You provide your own WeNet model and create a macaw.yaml manifest for it. See Creating a Manifest below.

Installation

pip install macaw-openvoice[wenet]

This installs wenet>=2.0,<3.0 as an optional dependency.

Architecture

WeNet uses the CTC (Connectionist Temporal Classification) architecture. This means the runtime adapts the streaming pipeline with:

  • No LocalAgreement — CTC produces native partial transcripts directly
  • No cross-segment context — CTC does not support initial_prompt conditioning
  • No accumulation — each chunk is processed immediately (frame-by-frame, minimum 160ms)
Audio → [immediate processing] → Native CTC partials → Final
(160ms minimum)

Faster-Whisper vs. WeNet Streaming

BehaviorFaster-Whisper (Encoder-Decoder)WeNet (CTC)
Audio buffering5s accumulationFrame-by-frame (160ms min)
Partial generationVia LocalAgreementNative
Cross-segment context224 tokens via initial_promptNot supported
First partial latency~5 seconds~160 milliseconds
Best forAccuracyLow latency

Capabilities

CapabilitySupportedNotes
StreamingYesNative frame-by-frame partials
Batch inferenceYesVia POST /v1/audio/transcriptions
Word timestampsYesFrom token-level output
Language detectionNoLanguage is fixed per model
TranslationNo
Initial promptNoCTC does not support conditioning
Hot wordsYesNative keyword boosting via context biasing
Partial transcriptsYesNative CTC partials

Native Hot Words

WeNet supports native keyword boosting (context biasing), unlike Faster-Whisper which uses an initial_prompt workaround. This makes hot word recognition more reliable for domain-specific vocabulary:

WebSocket session.configure
{
"type": "session.configure",
"model": "my-wenet-model",
"hot_words": ["CPF", "CNPJ", "PIX"]
}

Language Handling

InputBehavior
"auto"Falls back to "zh" (Chinese)
"mixed"Falls back to "zh" (Chinese)
"zh", "en", etc.Uses the specified language
OmittedFalls back to "zh"

WeNet models are typically trained for a specific language (most commonly Chinese). The language parameter is informational — the model always uses the language it was trained for.

Device Handling

InputBehavior
"auto"Maps to "cpu"
"cpu"CPU inference
"cuda"GPU inference
"cuda:0"Specific GPU
tip

Unlike Faster-Whisper where "auto" selects GPU if available, WeNet's "auto" always maps to "cpu". Explicitly set device: "cuda" if you want GPU inference.

Creating a Manifest

Since WeNet has no catalog entries, you must create a macaw.yaml manifest manually in your model directory:

~/.macaw/models/my-wenet-model/macaw.yaml
name: my-wenet-model
version: "1.0.0"
engine: wenet
type: stt
description: "Custom WeNet CTC model for Mandarin"

capabilities:
streaming: true
architecture: ctc
languages: ["zh"]
word_timestamps: true
translation: false
partial_transcripts: true
hot_words: true
batch_inference: true
language_detection: false
initial_prompt: false

resources:
memory_mb: 512
gpu_required: false
gpu_recommended: false
load_time_seconds: 3

engine_config:
language: "chinese"
device: "cpu"

Manifest Fields for WeNet

FieldRequiredDescription
capabilities.architectureYesMust be ctc
capabilities.hot_wordsYesSet to true — WeNet supports native hot words
capabilities.initial_promptYesMust be false — CTC does not support conditioning
capabilities.translationYesMust be false — WeNet does not translate
capabilities.language_detectionYesMust be false — WeNet does not auto-detect language
engine_config.languageNoDefault language for the model (default: "chinese")
engine_config.deviceNoInference device (default: "cpu")

Setting Up a WeNet Model

  1. Download or train a WeNet model — obtain a model directory with the required files (model weights, config, etc.)

  2. Create the model directory:

    mkdir -p ~/.macaw/models/my-wenet-model
  3. Copy model files into the directory

  4. Create the manifest:

    Create macaw.yaml
    cat > ~/.macaw/models/my-wenet-model/macaw.yaml << 'EOF'
    name: my-wenet-model
    version: "1.0.0"
    engine: wenet
    type: stt
    description: "Custom WeNet model"
    capabilities:
    streaming: true
    architecture: ctc
    languages: ["zh"]
    word_timestamps: true
    translation: false
    partial_transcripts: true
    hot_words: true
    batch_inference: true
    language_detection: false
    initial_prompt: false
    resources:
    memory_mb: 512
    gpu_required: false
    gpu_recommended: false
    load_time_seconds: 3
    engine_config:
    language: "chinese"
    device: "cpu"
    EOF
  5. Verify the model is detected:

    macaw list
    # Should show: my-wenet-model wenet stt ctc
  6. Test transcription:

    macaw transcribe audio_zh.wav --model my-wenet-model

Usage Examples

Batch Transcription

Transcribe a Chinese audio file
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F "file=@audio_zh.wav" \
-F "model=my-wenet-model"

Streaming (WebSocket)

Low-latency streaming with WeNet
import asyncio
import json
import websockets

async def stream_low_latency():
uri = "ws://localhost:8000/v1/realtime"
async with websockets.connect(uri) as ws:
# Configure with WeNet model and hot words
await ws.send(json.dumps({
"type": "session.configure",
"model": "my-wenet-model",
"hot_words": ["CPF", "CNPJ", "PIX"]
}))

# Stream audio — partials arrive within ~160ms
with open("audio.raw", "rb") as f:
while chunk := f.read(3200): # 100ms chunks
await ws.send(chunk)

try:
msg = await asyncio.wait_for(ws.recv(), timeout=0.05)
event = json.loads(msg)
if event["type"] == "transcript.partial":
print(f" ...{event['text']}")
elif event["type"] == "transcript.final":
print(f" >> {event['text']}")
except asyncio.TimeoutError:
pass

asyncio.run(stream_low_latency())

Engine Configuration Reference

ParameterTypeDefaultDescription
languagestring"chinese"Model language (informational)
devicestring"cpu"Inference device ("cpu", "cuda", "auto""cpu")

When to Choose WeNet

Choose WeNet when:

  • You need the lowest possible latency for streaming (partials in ~160ms vs ~5s for Faster-Whisper)
  • Your application is Chinese-focused
  • You need reliable native hot word support for domain-specific vocabulary
  • You have your own trained WeNet model

Choose Faster-Whisper instead when:

  • You need multilingual support (100+ languages)
  • You need translation capabilities
  • You want ready-to-use catalog models (no manual setup)
  • Accuracy is more important than latency