This article starts a new series on Speech Recognition. A “smart microphone” is an array of microphones with special signal processing hardware to locate and isolate speech, even in noisy environments. Looky here:
All the cool kids now have in home, voice activated devices like Amazon Echo or Google Home. These devices can play your favorite music, answer questions, read books, control home automation, and all those other things people thought the future was about in the 1960s. For the most part, the speech recognition on the devices works well, although you may find yourself with an extra dollhouse or two occasionally.
One of the enabling technologies of these devices is what is called a microphone array. Several microphones are placed in a circle, with the output being sent to a Digital Signal Processor, or DSP for short. The DSP has several special algorithms which help detect where a voice originates from (localization) and uses audio beamforming to process, reduce echo and reverberation from the signal. The result is an audio stream that is an accurate representation of the original voice.
Once a suitable audio stream has been acquired, the stream can be either processed locally or sent to a server for further processing. In the case of something like an Amazon Echo, a local processor “listens” to the incoming audio stream for a keyword trigger, e.g. “Alexa”. Once the keyword has been identified, the rest of the audio stream is sent to online servers which do speech recognition on the stream, and then parse the audio into “actions”. The service then sends the action back to the device. These actions vary from device to device, but typically allow the user to request the device to play music, control home automation devices, or ask/answer questions. Amazon, Google and Microsoft all have APIs to interface their online services with audio.
The online services have large data bases which they have used machine learning techniques to train their speech recognizers. You may have noticed that many of the online services have become significantly better at recognizing speech over the last couple of years. This advance is mostly due to advances in machine learning.
Speech Recognition for the rest of us
The consumer devices are interesting, and now the technology for smart microphones is available separately from several manufacturers. In the video, a Seeed Studio Respeaker is shown. There are several other manufacturers, the Respeaker in the video was ordered through a Kickstarter campaign.
The Far Field Microphone Array is built around a XVSM-2000 chip from XMOS. Watch the video for a rundown of the rest of the fun hardware that is available on the Respeaker, with sprinkles like RGB LEDs and an Arduino type of processor. The Jetson can talk to either the Respeaker Core or Microphone Array using USB.
Over the course of the next few articles, we’ll figure out how to interface with the Microphone Array, gather the audio stream, and then perform speech recognition both locally and through online services.