Skip to main content

Faster-Whisper

Faster-Whisper is the primary STT engine in Macaw OpenVoice. It provides high-accuracy multilingual speech recognition using CTranslate2-optimized Whisper models. Five model variants are available in the official catalog, covering use cases from lightweight testing to production-grade transcription.

Installation

pip install macaw-openvoice[faster-whisper]

This installs faster-whisper>=1.1,<2.0 as an optional dependency.

Architecture

Faster-Whisper uses the encoder-decoder architecture (based on OpenAI Whisper). This means the runtime adapts the streaming pipeline with:

  • LocalAgreement — confirms tokens across multiple inference passes before emitting partials
  • Cross-segment context — passes up to 224 tokens from the previous final transcript as initial_prompt to maintain continuity across segments
  • Accumulation — audio is buffered for ~5 seconds before each inference pass (not frame-by-frame)
Audio → [5s accumulation] → Inference → LocalAgreement → Partial/Final

initial_prompt (224 tokens from previous final)
Why accumulation?

Encoder-decoder models like Whisper process audio in fixed-length windows. Sending tiny chunks would produce poor results. The 5-second accumulation threshold balances latency with transcription quality.

Model Variants

large-v3

The highest quality Faster-Whisper model. Best accuracy across 100+ languages.

PropertyValue
Catalog namefaster-whisper-large-v3
HuggingFace repoSystran/faster-whisper-large-v3
Memory3,072 MB
GPURecommended
Load time~8 seconds
Languages100+ (auto-detect)
TranslationYes (any → English)
macaw pull faster-whisper-large-v3

Best for: Production workloads where accuracy matters most. Multilingual support. Translation tasks.

medium

Good balance between quality and speed. Suitable for production with GPU.

PropertyValue
Catalog namefaster-whisper-medium
HuggingFace repoSystran/faster-whisper-medium
Memory1,536 MB
GPURecommended
Load time~5 seconds
Languages100+ (auto-detect)
TranslationYes (any → English)
macaw pull faster-whisper-medium

Best for: Production with moderate GPU resources. Near large-v3 quality at lower cost.

small

Lightweight model that runs well on CPU. Good quality for common languages.

PropertyValue
Catalog namefaster-whisper-small
HuggingFace repoSystran/faster-whisper-small
Memory512 MB
GPUOptional
Load time~3 seconds
Languages100+ (auto-detect)
TranslationYes (any → English)
macaw pull faster-whisper-small

Best for: CPU-only deployments. Development and staging environments. Good speed/accuracy trade-off.

tiny

Ultra-lightweight model for testing and prototyping. Fastest to load.

PropertyValue
Catalog namefaster-whisper-tiny
HuggingFace repoSystran/faster-whisper-tiny
Memory256 MB
GPUOptional
Load time~2 seconds
Languages100+ (auto-detect)
TranslationYes (any → English)
macaw pull faster-whisper-tiny

Best for: Quick testing and prototyping. CI/CD pipelines. Environments with minimal resources.

distil-large-v3

Distilled version of large-v3. Approximately 6x faster with only ~1% WER gap. English only.

PropertyValue
Catalog namedistil-whisper-large-v3
HuggingFace repoSystran/faster-distil-whisper-large-v3
Memory1,536 MB
GPURecommended
Load time~5 seconds
LanguagesEnglish only
TranslationNo
macaw pull distil-whisper-large-v3

Best for: English-only production workloads where speed matters. High-throughput transcription. When large-v3 is too slow but you want near-equal quality.

Capabilities

CapabilitySupportedNotes
StreamingYes5s accumulation threshold
Batch inferenceYesVia POST /v1/audio/transcriptions
Word timestampsYesPer-word start/end/probability
Language detectionYesAutomatic when language is "auto" or omitted
TranslationYesAny language → English (except distil-large-v3)
Initial promptYesContext string to guide transcription
Hot wordsNoWorkaround via initial_prompt prefix
Partial transcriptsYesVia LocalAgreement in streaming mode

Hot Words Workaround

Faster-Whisper does not support native keyword boosting. However, Macaw provides a workaround by prepending hot words to the initial_prompt:

# In the backend, hot_words are converted to an initial_prompt prefix:
# hot_words=["Macaw", "OpenVoice"] → initial_prompt="Terms: Macaw, OpenVoice."

This biases the model toward recognizing these terms but is less reliable than native hot word support (see WeNet for native hot words).

Language Handling

InputBehavior
"auto"Auto-detect language (passed as None to Faster-Whisper)
"mixed"Auto-detect language (same as "auto")
"en", "pt", etc.Force specific language
OmittedAuto-detect

The model supports 100+ languages. The catalog manifests list ["auto", "en", "pt", "es", "ja", "zh"] as common examples, but all Whisper-supported languages work.

Engine Configuration

The engine_config section in the model manifest controls Faster-Whisper behavior:

engine_config defaults
engine_config:
model_size: "large-v3" # Model size or path
compute_type: "float16" # float16, int8, int8_float16, float32
device: "auto" # "auto", "cpu", "cuda"
beam_size: 5 # Beam search width
vad_filter: false # Always false — VAD is handled by the runtime
ParameterTypeDefaultDescription
model_sizestring(from catalog)Whisper model size or path to model directory
compute_typestring"float16"Quantization type. float16 for GPU, int8 for CPU
devicestring"auto"Inference device. "auto" selects GPU if available
beam_sizeint5Beam search width. Higher = more accurate, slower
vad_filterboolfalseInternal VAD filter. Always false — Macaw handles VAD
Never set vad_filter: true

The Macaw runtime runs its own VAD pipeline (energy pre-filter + Silero VAD). Enabling the internal Faster-Whisper VAD filter would duplicate the work and produce inconsistent behavior.

Usage Examples

Batch Transcription (REST API)

Transcribe a file
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F "file=@meeting.wav" \
-F "model=faster-whisper-large-v3" \
-F "language=auto" \
-F "response_format=verbose_json"
Translate to English
curl -X POST http://localhost:8000/v1/audio/translations \
-F "file=@audio_pt.wav" \
-F "model=faster-whisper-large-v3"

Streaming (WebSocket)

Python WebSocket client
import asyncio
import json
import websockets

async def stream_audio():
uri = "ws://localhost:8000/v1/realtime"
async with websockets.connect(uri) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.configure",
"model": "faster-whisper-large-v3",
"language": "auto"
}))

# Stream audio chunks (16-bit PCM, 16kHz, mono)
with open("audio.raw", "rb") as f:
while chunk := f.read(3200): # 100ms chunks
await ws.send(chunk)

# Check for transcripts
try:
msg = await asyncio.wait_for(ws.recv(), timeout=0.1)
event = json.loads(msg)
if event["type"] == "transcript.partial":
print(f" ...{event['text']}")
elif event["type"] == "transcript.final":
print(f" >> {event['text']}")
except asyncio.TimeoutError:
pass

asyncio.run(stream_audio())

CLI

Transcribe via CLI
macaw transcribe meeting.wav --model faster-whisper-large-v3

# With word timestamps
macaw transcribe meeting.wav --model faster-whisper-large-v3 --word-timestamps

Comparison of Variants

tinysmallmediumlarge-v3distil-large-v3
Memory256 MB512 MB1,536 MB3,072 MB1,536 MB
GPU neededNoNoRecommendedRecommendedRecommended
Load time~2s~3s~5s~8s~5s
Languages100+100+100+100+English only
TranslationYesYesYesYesNo
Relative speedFastestFastModerateSlowest~6x faster than large-v3
Best forTestingCPU deployBalancedMax accuracyFast English

Manifest Reference

Every Faster-Whisper model in the catalog uses the same manifest structure. Here is the full manifest for faster-whisper-large-v3:

macaw.yaml (faster-whisper-large-v3)
name: faster-whisper-large-v3
version: "3.0.0"
engine: faster-whisper
type: stt
description: "Faster Whisper Large V3 - encoder-decoder STT"

capabilities:
streaming: true
architecture: encoder-decoder
languages: ["auto", "en", "pt", "es", "ja", "zh"]
word_timestamps: true
translation: true
partial_transcripts: true
hot_words: false
batch_inference: true
language_detection: true
initial_prompt: true

resources:
memory_mb: 3072
gpu_required: false
gpu_recommended: true
load_time_seconds: 8

engine_config:
model_size: "large-v3"
compute_type: "float16"
device: "auto"
beam_size: 5
vad_filter: false