Speaker Enrollment¶

Speaker identification lets Kenzy know who is talking. This enables personalized responses and is required for sensitive operations like locking and unlocking doors.

How it works¶

During enrollment, Kenzy records several short audio samples from a person and computes a speaker embedding — a compact numerical representation of their voice. At runtime, each captured utterance is compared against all enrolled embeddings using cosine similarity. The speaker with the highest similarity above the configured threshold is returned; otherwise the speaker is reported as unknown.

Embeddings are stored as .npy files in data/speakers/<name>.npy.

Requirements¶

The kenzy-speaker service must be running
The kenzy-tts service must be running (used to read prompts aloud during enrollment)
A microphone connected to the machine running kenzy-enroll

Running enrollment¶

kenzy-enroll [configs/speaker.yaml]

The CLI will:

Ask for the speaker's name
Read each enrollment prompt aloud via TTS
Record the speaker saying the prompt
Repeat for all prompts
Compute and save the embedding to data/speakers/<name>.npy

The default prompts are phonetically diverse sentences chosen to capture a broad range of sounds. You can customize them in configs/speaker.yaml under enroll_prompts.

Re-enrolling a speaker¶

Run kenzy-enroll again with the same name. The existing embedding file is overwritten.

Removing a speaker¶

Delete the embedding file:

rm data/speakers/<name>.npy

Restart kenzy-speaker for the change to take effect.

Tuning the identification threshold¶

The identify_threshold in configs/speaker.yaml controls how strict the match must be:

Threshold	Behavior
`0.20`	Permissive — fewer `unknown` results, higher risk of misidentification
`0.25`	Default — good balance for a home environment
`0.30–0.35`	Strict — more `unknown` results if audio quality varies

If enrolled speakers are frequently returned as unknown, lower the threshold. If strangers are being matched to enrolled speakers, raise it.

Enrollment quality

Record in the room and with the microphone you will use day-to-day. Enrollment done in a quiet studio with a headset will not generalize well to a noisy kitchen with a far-field mic.

Security implications¶

Speaker identification is not a strong authentication mechanism — it can be fooled by a recording or a similar-sounding voice. It is used as a convenience gate (requiring a recognisable voice for lock/cover operations) rather than a cryptographic security boundary.