Beach Wreck Ignition:

Challenges in open source voice

Kathy Reid @KathyReid

(formerly) Director of Developer Relations, @Mycroft_AI

Attribution: Vanity 365 Day 55 via Rocky Sun on Flickr.

Attribution: Kitt aus Knight Rider via Marco Verch on Flickr.

Attribution: LCARS desktop via Morn on Wikimedia Commons.

Introduction to the general voice stack

Overview

Voice Stack - components that make up a voice stack
Wake Word - detection that the user wants to issue a command
Speech to Text - transcribing voice sounds into written form
Intent matching - matching utterances to a command
Skills - executing commands
Text to Speech - turning written text into voice sounds
Multilingual considerations - how do you handle this for multiple languages?

Wake Word

PocketSphinx - https://github.com/cmusphinx/pocketsphinx
Snowboy - https://github.com/Kitt-AI/snowboy
Mycroft AI Precise - https://github.com/MycroftAI/mycroft-precise

Phonemes

"The smallest unit of sound that distinguishes one word from another in a particular language.
Different languages have different phonemes."

Attribution: EnglishClub.com

Similar-sounding phonemes

"p* / b*" sounds - try saying bizza instead of pizza.
"s* / z*" sounds - try saying soo instead of zoo
"k* / g*" sounds - try saying gate instead of Kate

Wake Word - Challenges

Always listening - Wake Word listeners are "always on"
Accuracy - False negatives and false positives

Haber's
Classification
of Contexts

Attribution:Haber, J., Greening, M., Castellano, L., & Wheaton, P. (n.d.). Proxemic Conversational UI: Moving beyond simple conversation.

Attribution: Project Alias via Project Alias

Wake Word - Accuracy

Attribution: bullseye via Emilio Kuffer on Flickr.

Wake Word - measuring accuracy

False positive - failure - Wake Word detected when it wasn't spoken
True positive - success - Wake Word correctly detected when it was spoken
True negative - success - Wake Word not detected when it wasn't spoken
False negative - failure - Wake Word spoken but not detected

Speech to Text

Kaldi - https://github.com/kaldi-asr/kaldi
Mozilla DeepSpeech - https://github.com/mozilla/DeepSpeech
Mozilla Common Voice - https://voice.mozilla.org/en

STT - Challenges

Training a model - Amount of data and training required
Accuracy - Accuracy has an impact on voice user experience

Consider the phrase

"Yeah nah mate, there's been a bingle in Broady, and the Western's chokkas back to the servo, I'm gonna be late for bevvies at Tommo's."

Translation for non-Australians ;-)


Greetings, friend
There's been a car accident in Broadmeadows
and the Western Freeway is congested
back to the service station
and as a result I will be late
to the social function at Mr Thompson's.

Mycroft Translate - Challenges

Line by line translation - Does not allow for context
Gender - Different languages handle gender differently
Hierarchy - Different language for different formality

Attribution: kia ora mate via @waikatoreo on Twitter.

Intent Parsers

Rasa - https://rasa.com/docs/nlu/
Mycroft Adapt - https://github.com/MycroftAI/adapt
Mycroft Padatious - https://github.com/MycroftAI/padatious

Intent Parser challenges

Intent collisions - Diambiguating intents so that the "most likely" command is invoked for the user

Common Play Framework

CPSMatchLevel.EXACT (The input matches exact)

CPSMatchLevel.MULTI_KEY (The input contains multiple matches such as Artist and Album title)

CPSMatchLevel.TITLE (The phrase contains a matching title)

CPSMatchLevel.ARTIST (The phrase contains a matching artist)

CPSMatchLevel.CATEGORY (The phrase contains a category supported by the skill, Rock, bitpop, Podcast etc.)

CPSMatchLevel.GENERIC (Generic match, maybe contains the skill title but no media match)

where CPSMatchLevel.EXACT is the greatest confidence and the CPSMatchLevel.GENERIC is lowest.

Text to Speech

Mary TTS
- http://mary.dfki.de/
Espeak
- http://espeak.sourceforge.net/
Mycroft Mimic
- https://mycroft.ai/documentation/mimic/
Mycroft Mimic 2
- https://github.com/MycroftAI/mimic2

Text to Speech Challenges

Natural sounding voice - making the voice sound not robotic
Pronunciation - often requires correction

A parting quote

"When the whole world is silent, even one voice becomes powerful."

- MALALA YOUSAFZAI

Thank you :-)

Questions warmly welcomed