Architecture¶

End-to-end pipeline¶

Node (mic) ──PCM frames over WebSocket──► Server
                                              │
                          ┌───────────────────┘ on_session_end
                          ▼
                  STT service ──┐  (parallel)
                  Speaker ID ───┘
                          │
                          ▼
                   LLM service  ◄──► Skills (tool-calling loop)
                          │
                          ▼
                   TTS service
                          │
               PCM frames over WebSocket ──► Node (speaker)

When a wake word fires, the node streams raw 16 kHz PCM to the server. When the user stops speaking (VAD silence detection or hard cap), the server runs STT and speaker identification in parallel, passes the transcript and speaker name to the LLM, and streams the synthesized speech response back to the node.

Protocol¶

Control messages are JSON text frames. Audio is raw int16 PCM binary frames at 16 kHz, 80 ms / 1 280 samples / 2 560 bytes each.

Message	Direction	Purpose
`hello`	node → server	Registration with `room_id` on connect
`audio_start`	node → server	Begins a capture session
`audio_end`	node → server	Ends a capture session, includes `reason`
`wakeword`	node → server	Wake word fired mid-stream
`trigger`	server → node	Server-initiated activation of an idle node
`stop`	server → node	Tells node to stop streaming or TTS playback
`ack`	server → node	Confirms `audio_start` was received
`tts_start`	server → node	Begins TTS playback; includes `sample_rate` and `channels`
`tts_end`	server → node	Ends TTS playback

Binary frames sent server → node between tts_start and tts_end are raw int16 PCM at 24 kHz mono.

Node state machine¶

The node runs three concurrent asyncio tasks:

_audio_loop — pulls frames from a thread-safe queue fed by the sounddevice callback. Runs openwakeword on every frame regardless of state.

IDLE: wake word detected → begin streaming
STREAMING: frames sent to server; VAD/timeout/hard-cap ends session
TTS: wake word detected → interrupt playback, begin new session

_recv_loop — reads inbound messages from the server WebSocket and routes them to _cmd_q (JSON control) or _tts_q (binary PCM).

_cmd_loop — processes _cmd_q: handles trigger, stop, tts_start, tts_end.

State transitions¶

IDLE ──── wake word ────────────────────────────► STREAMING
IDLE ──── server TRIGGER ───────────────────────► STREAMING
IDLE ──── server TTS_START ─────────────────────► TTS

STREAMING ──── silence / no-speech / hard cap ──► IDLE
STREAMING ──── server STOP ─────────────────────► IDLE

TTS ──── server TTS_END ────────────────────────► IDLE (after playback finishes)
TTS ──── server STOP ───────────────────────────► IDLE
TTS ──── wake word ─────────────────────────────► STREAMING (interrupts playback)

Sound playback¶

_SoundPlayer holds a single persistent sd.OutputStream that runs for the lifetime of the process, outputting silence when idle. This avoids the hardware DAC activation pop that occurs when opening a stream from cold. A waiting sound plays between audio_end and the TTS response to give the user feedback during server processing.

Server pipeline¶

TranscribingServer (subclass of AudioServer) implements the full pipeline:

Buffers raw PCM per room in _buffers
On on_session_end: snapshots the buffer, runs _call_stt and _call_speaker in parallel via asyncio.gather, then calls _call_llm, then streams TTS back
A new wake word from the same node cancels any in-flight pipeline task for that room
Mid-stream disconnect triggers on_session_end("disconnect")

Extending the pipeline¶

Subclass TranscribingServer (or AudioServer) and override any hooks:

async def on_session_start(self, session: NodeSession) -> None: ...
async def on_audio_frame(self, session: NodeSession, data: bytes) -> None: ...
async def on_session_end(self, session: NodeSession, reason: str) -> None: ...
async def on_wakeword(self, session: NodeSession, model: str, score: float) -> None: ...

LLM service¶

Deterministic fast path¶

Before the model is consulted, each /process request is run through the registered fast intents (@fast_intent matchers). If one confidently matches — a time/date query, a Home Assistant control command — it answers locally with no remote model call, and the pipeline skips straight to TTS. Only requests that every matcher misses fall through to the tool-calling LLM below. This keeps high-frequency commands like "turn on the lights" at local-parse-plus-one-HTTP-call latency instead of a full model round-trip. See Skills → Two resolution tiers.

The /process response carries an expect_response flag (set by the fast path) reserved for re-opening the mic for a follow-up turn without the wake word.

Conversation history¶

The LLM service maintains a per-room conversation history — a rolling window of the last 10 turns, each expiring 3 minutes after it was recorded. History is injected as real role: user / role: assistant message pairs between the system prompt and the current request, so the model can resolve follow-up references ("tell me more about the second one") naturally. Fast-path responses are recorded too, so context carries across both tiers.

The tool-calling loop executes skills sequentially until the model returns a plain text response or the iteration limit is reached. Internal tool calls are never stored in conversation history — only the final spoken response and the user's utterance are recorded.

Wake-word models¶

Two .tflite models ship with the package: hey_kenzie.tflite (loaded by default) and ken_zee.tflite. Custom models (.tflite or .onnx) can be specified via wakeword_models in configs/node.yaml; the inference framework is inferred from the file extension.