Speaker Enrollment¶
Speaker identification lets Kenzy know who is talking. This enables personalized responses and is required for sensitive operations like locking and unlocking doors.
How it works¶
During enrollment, Kenzy records several short audio samples from a person and computes a speaker embedding — a compact numerical representation of their voice. At runtime, each captured utterance is compared against all enrolled embeddings using cosine similarity. The speaker with the highest similarity above the configured threshold is returned; otherwise the speaker is reported as unknown.
Embeddings are stored as .npy files in data/speakers/<name>.npy.
Requirements¶
- The
kenzy-speakerservice must be running - The
kenzy-ttsservice must be running (used to read prompts aloud during enrollment) - A microphone connected to the machine running
kenzy-enroll
Running enrollment¶
kenzy-enroll [configs/speaker.yaml]
The CLI will:
- Ask for the speaker's name
- Read each enrollment prompt aloud via TTS
- Record the speaker saying the prompt
- Repeat for all prompts
- Compute and save the embedding to
data/speakers/<name>.npy
The default prompts are phonetically diverse sentences chosen to capture a broad range of sounds. You can customize them in configs/speaker.yaml under enroll_prompts.
Re-enrolling a speaker¶
Run kenzy-enroll again with the same name. The existing embedding file is overwritten.
Removing a speaker¶
Delete the embedding file:
rm data/speakers/<name>.npy
Restart kenzy-speaker for the change to take effect.
Tuning the identification threshold¶
The identify_threshold in configs/speaker.yaml controls how strict the match must be:
| Threshold | Behavior |
|---|---|
0.20 |
Permissive — fewer unknown results, higher risk of misidentification |
0.25 |
Default — good balance for a home environment |
0.30–0.35 |
Strict — more unknown results if audio quality varies |
If enrolled speakers are frequently returned as unknown, lower the threshold. If strangers are being matched to enrolled speakers, raise it.
Enrollment quality
Record in the room and with the microphone you will use day-to-day. Enrollment done in a quiet studio with a headset will not generalize well to a noisy kitchen with a far-field mic.
Security implications¶
Speaker identification is not a strong authentication mechanism — it can be fooled by a recording or a similar-sounding voice. It is used as a convenience gate (requiring a recognisable voice for lock/cover operations) rather than a cryptographic security boundary.