Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Beach Wreck Ignition:

Challenges in open source voice

Kathy Reid @KathyReid

(formerly) Director of Developer Relations, @Mycroft_AI

Attribution: Vanity 365 Day 55 via Rocky Sun on Flickr.

Attribution: Kitt aus Knight Rider via Marco Verch on Flickr.

Attribution: LCARS desktop via Morn on Wikimedia Commons.

Introduction to the general voice stack

Overview

  • Voice Stack - components that make up a voice stack
  • Wake Word - detection that the user wants to issue a command
  • Speech to Text - transcribing voice sounds into written form
  • Intent matching - matching utterances to a command
  • Skills - executing commands
  • Text to Speech - turning written text into voice sounds
  • Multilingual considerations - how do you handle this for multiple languages?

Wake Word

  • PocketSphinx - https://github.com/cmusphinx/pocketsphinx
  • Snowboy - https://github.com/Kitt-AI/snowboy
  • Mycroft AI Precise - https://github.com/MycroftAI/mycroft-precise

Phonemes

"The smallest unit of sound that distinguishes one word from another in a particular language.
Different languages have different phonemes."

Attribution: EnglishClub.com

Similar-sounding phonemes

  • "p* / b*" sounds - try saying bizza instead of pizza.
  • "s* / z*" sounds - try saying soo instead of zoo
  • "k* / g*" sounds - try saying gate instead of Kate

Wake Word - Challenges

  • Always listening - Wake Word listeners are "always on"
  • Accuracy - False negatives and false positives

Haber's
Classification
of Contexts

Attribution:Haber, J., Greening, M., Castellano, L., & Wheaton, P. (n.d.). Proxemic Conversational UI: Moving beyond simple conversation.

Attribution: Project Alias via Project Alias

Wake Word - Accuracy

Attribution: bullseye via Emilio Kuffer on Flickr.

Wake Word - measuring accuracy

  • False positive - failure - Wake Word detected when it wasn't spoken
  • True positive - success - Wake Word correctly detected when it was spoken
  • True negative - success - Wake Word not detected when it wasn't spoken
  • False negative - failure - Wake Word spoken but not detected

Speech to Text

  • Kaldi - https://github.com/kaldi-asr/kaldi
  • Mozilla DeepSpeech - https://github.com/mozilla/DeepSpeech
  • Mozilla Common Voice - https://voice.mozilla.org/en

STT - Challenges

  • Training a model - Amount of data and training required
  • Accuracy - Accuracy has an impact on voice user experience

Consider the phrase

"Yeah nah mate, there's been a bingle in Broady, and the Western's chokkas back to the servo, I'm gonna be late for bevvies at Tommo's."

Translation for non-Australians ;-)


Greetings, friend
There's been a car accident in Broadmeadows
and the Western Freeway is congested
back to the service station
and as a result I will be late
to the social function at Mr Thompson's.
      

    

Mycroft Translate - Challenges

  • Line by line translation - Does not allow for context
  • Gender - Different languages handle gender differently
  • Hierarchy - Different language for different formality

Intent Parsers

  • Rasa - https://rasa.com/docs/nlu/
  • Mycroft Adapt - https://github.com/MycroftAI/adapt
  • Mycroft Padatious - https://github.com/MycroftAI/padatious

Intent Parser challenges

  • Intent collisions - Diambiguating intents so that the "most likely" command is invoked for the user

Common Play Framework

CPSMatchLevel.EXACT (The input matches exact)

CPSMatchLevel.MULTI_KEY (The input contains multiple matches such as Artist and Album title)

CPSMatchLevel.TITLE (The phrase contains a matching title)

CPSMatchLevel.ARTIST (The phrase contains a matching artist)

CPSMatchLevel.CATEGORY (The phrase contains a category supported by the skill, Rock, bitpop, Podcast etc.)

CPSMatchLevel.GENERIC (Generic match, maybe contains the skill title but no media match)


where CPSMatchLevel.EXACT is the greatest confidence and the CPSMatchLevel.GENERIC is lowest.

Text to Speech

  • Mary TTS
    - http://mary.dfki.de/
  • Espeak
    - http://espeak.sourceforge.net/
  • Mycroft Mimic
    - https://mycroft.ai/documentation/mimic/
  • Mycroft Mimic 2
    - https://github.com/MycroftAI/mimic2

Text to Speech Challenges

  • Natural sounding voice - making the voice sound not robotic
  • Pronunciation - often requires correction

A parting quote

"When the whole world is silent, even one voice becomes powerful."

- MALALA YOUSAFZAI

Thank you :-)

Questions warmly welcomed

Use a spacebar or arrow keys to navigate.
Press 'P' to launch speaker console.