STT Configuration

File: configs/stt.yaml
Command: kenzy-stt [config_path]

The STT service accepts POST requests with base64-encoded PCM audio and returns a transcript. It is built on faster-whisper, a CTranslate2-optimized implementation of OpenAI Whisper.

Full reference

Key Default Description
host "127.0.0.1" Bind address
port 8767 HTTP port
log_level "info" Log verbosity

Whisper model

Key Default Description
whisper.model "tiny" Model size: tiny, base, small, medium, large-v2, large-v3. Larger models are more accurate but slower and need more RAM.
whisper.device "cpu" Inference device: cpu or cuda
whisper.compute_type "int8" Quantisation: int8 (fastest on CPU), float16 (GPU), float32 (highest quality)
whisper.language "en" Language code (e.g. "en", "fr"), or null for auto-detect

Model size guide

Model Size Relative speed Notes
tiny ~75 MB Fastest Good for fast hardware or simple commands
base ~145 MB Fast Better accuracy, still CPU-friendly
small ~460 MB Moderate Good balance for a dedicated CPU server
medium ~1.5 GB Slow on CPU Recommended with a GPU
large-v3 ~3 GB Slow Best accuracy; GPU strongly recommended

Raspberry Pi

On a Pi Zero 2 W, run the STT service on a more powerful server and point stt.url in server.yaml at it. The tiny or base model on a modern x86 CPU gives acceptable latency.

Example

host: "127.0.0.1"
port: 8767

whisper:
  model: "base"
  device: "cpu"
  compute_type: "int8"
  language: "en"