Voice control is now becoming a popular interface with hands-free capabilities making daily tasks easier and quicker. How exactly does this innovative technology work to magically respond to your client’s every command? Here are 16 voice control keywords that will help explain how it all works.
1. Far-Field Microphones
Personal computing devices have had microphones for a long time, but they don’t work well from far away. Far-field microphones, on the other hand, are an array of mics that utilize their location in space to amplify and reduce signals. This makes it possible to speak from across the room in a “hands-free” environment. By suppressing certain surrounding noises in the environment, these microphones utilize algorithms to help deliver a clear and easily understandable signal. The far-field voice experience is enhanced by other technologies, defined below, which include barge-in, beamforming, noise reduction, acoustic cancellation, and automatic speech recognition. Because this array utilizes the distance between microphones in its calculations, it’s hard to make these devices smaller than a minimum threshold.
Imagine playing music or watching TV with a nearby far-field microphone. Trying to yell over the noise can be quite difficult. This is where “barge-in” technology comes in. With “barge-in,” the listening microphone is aware of the audio source and able to digitally remove it, thus reducing noise and increasing accuracy. Amazon Echo is a great example of this technology. Saying “Alexa” while it’s playing music will interrupt the music and alert Alexa of your next command. Unfortunately, this is really difficult if the music source is external to the microphone, but expect that to improve over time.
Imagine you have a far-field microphone in a room with a TV on one side and you on the other. Even if the TV is relatively loud, beamforming technology enables the microphones to amplify your speech and reduce the noise from the TV, effectively making it easy to be heard in a loud environment. This is particularly useful in automotive applications where the driver is always in a fixed location and noise in front of the car can be reduced. Unfortunately, if you take the earlier example and move next to the TV, beamforming won’t help discern your voice from the TV, which is why beamforming by itself is not a perfect solution.
4. Microphone Array
We’ve mentioned this term a couple times, but it’s important to define as a standalone term. A microphone array is a single piece of hardware with multiple individual microphones operating in tandem. This increases voice accuracy with the ability to accept sounds from multiple directions regardless of background noise, the position of the microphone, and the speaker placement.
5. Automatic Speech Recognition
Often abbreviated as (ASR), it is the conversion of spoken language into written text. When you say “Hey Siri” and follow with “…send a text,” you’re watching ASR in action. In other words, "speech rec" (as it’s sometimes shortened) makes it possible for computers to know what you’re saying.
6. Speaker Recognition
Although easy to confuse with SR, speaker recognition is the specific art of determining who is speaking. This is achieved based on the characteristics of voices and a variety of technologies including Markov models, pattern recognition algorithms, and neural networks (defined below). Another term you might hear related to speaker recognition is “Voice Biometrics,” which defines the technology behind speaker rec. There are two major applications of speaker recognition: 1) verification, which aims to verify if the speaker is who they claim, and 2) identification, the task of determining an unknown speaker’s identity.
7. Markov Models
Rooted in probability theory, a Markov Model uses randomly changing systems to forecast future states. A great example is the predictive text you’ve probably seen in your iPhone. If you type “I love,” the system can predict the next word to be “you” based on probability. There are four types of Markov models, including hidden and Markov chains. If you’re interested in learning more, we suggest the Clemson University Intro to Markov Models. Markov Models are very important in speech recognition because it’s similar to how humans process text. The sentences “make the lights red” and “make the lights read” are pronounced the same, but understanding the probability helps assure accurate speech recognition.
8. Pattern Recognition
As the name suggests, this is a branch of machine learning that utilizes patterns and regularities in data to train systems. There’s a lot to pattern rec, with algorithms aiding in classification, clustering, learning, predicting, regression, sequencing, and more. Pattern recognition is very important in the field of speech recognition and understanding what sounds form what words.
9. Artificial Neural Networks
A computer system modeled on how we believe the human brain works, neural networks utilize artificial neurons to learn how to solve problems that typical rule-based systems struggle with. For example, neural networks are imperative for facial recognition, self-driving cars, and of course, voice control. For a great, if not highly technical, article on how neural networks are used in speech recognition, see this post by Andrew Gibiansky.
10. Natural Language Processing (NLP)
When a computer can analyze, understand, and derive meaning from human language, it is utilizing natural language processing. NLP covers a range of applications including syntax, semantics, discourse, and speech. For example, consider this named entity recognition example from Stanford CoreNLP:
11. Natural Language Understanding (NLU)
NLU is a subtopic of natural language processing in artificial intelligence that deals with machine reading comprehension. This gives the user flexibility when speaking to the system, as it understands the intent. Whether you say, “turn off the lights” in “my room,” “the bedroom,” or “Alex’s room,” you can get the same desired result. NLU focuses on the problem of handling unstructured inputs governed by poorly defined and flexible rules and converts them to a structured form that a machine can understand and act upon. While humans are able to effortlessly handle mispronunciations, swapped words, contractions, colloquialisms, and other quirks, machines are less adept at handling unpredictable inputs. In other words, NLU focuses on the machine’s ability to understand what we say.
12. Anaphora Resolution
Anaphora resolution the act of recalling earlier references and properly responding to their associated pronouns. By saying, “Turn on the TV,” and later saying, “Turn it up,” there is an implied understanding that “it” is in reference to the TV’s volume. This is very important when it comes to natural speech control, particularly in the home.
13. Compound Commands
This is the ability to understand and process multiple commands uttered in a single breath. For example, “Turn off the lights, stop the music, and watch Black Mirror.”
14. Virtual Assistant
A software agent that can perform tasks or services for an individual can be referred to as a virtual assistant. For example, a Chatbot is a virtual assistant that is accessed via online chat. By using NLU combined with automatic speech recognition, Alexa, for example, can act as a virtual assistant to complete daily tasks, such as ordering pizza or an Uber.
15. Voice User Interface (VUI)
You’ve probably heard the term GUI (graphical user interface). VUI is a voice user interface, which relates to how a user interacts with a voice assistant. As prevalent as they are becoming lately, VUIs are not without their challenges. People have little patience for a machine that doesn’t understand. Therefore, there is little room for error. VUIs need to respond to input reliably and gracefully fail when they can’t. Designing a good VUI requires interdisciplinary skills of computer science, linguistics, and psychology. Constructing an effective VUI requires an in-depth understanding of both the tasks to be performed as well as the target audience using it. If designed properly, a VUI requires little or no training, and provides a delightful user experience.
16. Wake Word
When you say “Alexa” or “Hey Google,” you’re activating a wake word, also known as a hot word or key word. Typically wake word detection runs on the local device, which is why “always listening” devices need direct power and can’t be battery operated. Once the wake word is heard, the voice assistant is activated and speech is typically processed in the cloud. Wake words need to be fine-tuned in order to work untrained with most users, which is why it’s tough to arbitrarily choose a wake word and expect it to work well. That said, when you do train a wake word, hard stop sounds like “k” in “okay” or “x” in “Alexa”, as well as multiple syllables, help increase the reliability.