Skip to main content

Roadmap

Macaw OpenVoice has completed all 9 milestones of the initial Product Requirements Document. The runtime is fully functional with STT, TTS, and full-duplex capabilities.

Completed Milestones

PhaseMilestoneWhat Was Delivered
1M1 — API ServerFastAPI with health endpoint, CORS, OpenAI-compatible structure
1M2 — Model Registrymacaw.yaml manifests, model lifecycle, architecture field
2M3 — SchedulerPriority queue, cancellation, dynamic batching, latency tracking
2M4 — STT WorkersgRPC subprocess workers, Faster-Whisper backend, crash recovery
3M5 — Streaming STTWebSocket /v1/realtime, VAD pipeline, streaming preprocessor
3M6 — Session ManagerState machine (6 states), ring buffer, WAL, backpressure
4M7 — Multi-EngineWeNet CTC backend, pipeline adaptation by architecture
4M8 — TTSKokoro TTS backend, POST /v1/audio/speech, gRPC TTS worker
5M9 — Full-DuplexMute-on-speak, tts.speak/tts.cancel, STT+TTS on same WebSocket

Current State

  • 1,600+ tests passing (unit + integration)
  • 3 STT architectures supported: encoder-decoder, CTC, streaming-native
  • 2 STT engines: Faster-Whisper, WeNet
  • 1 TTS engine: Kokoro (9 languages)
  • Full-duplex voice interactions on a single WebSocket
  • OpenAI-compatible REST API
  • Ollama-style CLI

What's Next

The following areas are under consideration for future development. These are not commitments — they represent directions the project may explore based on community feedback and priorities.

Engine Ecosystem

FeatureDescription
Paraformer backendStreaming-native architecture support
Piper TTSLightweight TTS alternative for CPU-only deployments
Whisper.cppGGML-based inference without Python/CUDA dependency
Multi-model servingLoad multiple models per worker type

Scalability

FeatureDescription
Worker poolingMultiple worker instances per engine for higher throughput
Horizontal scalingMultiple runtime instances behind a load balancer
GPU sharingTime-slice GPU across STT and TTS workers
Kubernetes operatorAutomated deployment with GPU scheduling

Features

FeatureDescription
Speaker diarizationIdentify and label different speakers
Word-level timestampsPer-word timing in streaming mode
Custom vocabulariesUser-defined vocabularies beyond hot words
Audio streaming outputServer-Sent Events for TTS as an alternative to WebSocket
Barge-inClient interrupts TTS to speak (currently requires tts.cancel)

Observability

FeatureDescription
OpenTelemetryDistributed tracing across runtime and workers
Dashboard templatesPre-built Grafana dashboards for Prometheus metrics
Structured audit loggingRequest/response logging for compliance

Contributing

Want to help shape the roadmap? See the Contributing Guide to get started, or open an issue on GitHub to discuss new ideas.