An area of great improvement in the mobile device field over the last several years has been speech recognition. In part, this is due to advances in speech recognition algorithms, but it has also been bolstered by increase in mobile device computing power. The JetsonBot leverages the advances to include voice commands as a control mechanism. Looky here:
Speech recognition has been around for several decades. About 20 years ago, Carnegie Mellon University starting developing Sphinx, an open source speech recognition and acoustic model training system. Sphinx is a speaker independent, continuous speech recognizer which uses Hidden Markov Models (HMMs) and statistical language models. This can be thought of as an early application of deep learning, where a large corpus of speech was used to ‘train’ speech recognizers. An example of one of the early training sets was the oral arguments for the US Supreme Court.
Typically a trainer has a transcript of speech and an audio recording of speech as input, and builds a dictionary of the matching phonemes. The speech recognizer then uses this dictionary to try to understand incoming audio by breaking the audio into phonemes and do a dictionary match predictively. Typically a speech recognizer on an embedded device has a small dictionary on board to do speech lookup. As you might have guessed, someone figured out that you could send audio on a phone type device to the ‘cloud’ and do speech recognition as a service. For example, this approach is used by Apple Siri. For more in depth information on such systems, there happens to be a link to a paper on the JetsonHacks site: Sirius: Open End-to-End Voice and Vision Intelligent Personal Assistant.
However for a small dictionary, like that of commanding a robot, it’s easy to keep the dictionary on board the device. Another project that CMU works on is called PocketSphinx which is optimized to work on embedded systems like the Jetson TK1. Icing on the cake? There is a ROS package which supports PocketSphinx!
Michael Ferguson wrote a simple wrapper around PocketSphinx for ROS using Gstreamer and Python. You might remember Gstreamer from some of the earlier articles on image input. Gstreamer does audio input too! Gstreamer takes an incoming audio stream and breaks it down into utterances to be recognized by PocketSphinx via a Python interface.
Installation of ROS PocketSphinx is straightforward. You can use the install script in the JetsonHacks Github repository. The script is to be executed on the JetsonBot:
# Add the gstreamer dependencies
sudo apt-get install gstreamer0.10-pocketsphinx gstreamer0.10-gconf python-gst0.10 -y
# Install virtual x11 for running pocketsphinx on JetsonBot without attached display
sudo apt-get install xvfb -y
# Assume that the installation for the JetsonBot is in ~/jetsonbot
git clone https://github.com/jetsonhacks/pocketsphinx.git
Of course to be able to actually talk to the JetsonBot you’ll need some type of audio device. The on board ASUS Xtion has a microphone, but unfortunately it’s sitting on top of the robot so it can’t hear all that well when the robot is in motion. A better method is to use a wireless headset of some type. For the demos, I used a Jawbone ERA Bluetooth headset which has since been discontinued, but it worked with the Jetson TK1.
Pair the headset of choice with the Jetson, and you are ready to modify the Jetson voice command launch file. The launch file should be located in: ~/jetsonbot/src/pocketsphinx/demo/jetsonbot_voice_cmd.launch
You will have to modify the ‘mic_name’ parameter to match the name of the audio input device. You can get the name of the audio input device from:
$ pacmd list-sources
Your audio device will have a ‘name:‘ field associated with it. Watch the video for an example if need be. If you remove the mic_input parameter from the launch file, the default audio input will be used.
Note: In the jetsonbot_voice_cmd.launch file the “cmd_vel” channel has been remapped to “cmd_vel_mux/input/teleop” to match what the JetsonBot expects for command sequences. Note that this is different from the other examples in the folder.
Once you have the headset paired and the software installed, you can start interfacing from the remote Robot Operating Center (ROC) machine. In the example, the ROC is being run on a Jetson TK1.
On the ROC, open three Terminals. When JetsonBot finishes booting, SSH into the JetsonBot on each Terminal. In the first terminal, launch ROS. In the example, the simple minimal.launch script was used.
In the second Terminal, start a virtual X11 session. Gstreamer will use this session to process events from the audio input device. Because the JetsonBot does not have a connected display, there is no X11 session normally present. So a virtual one is used:
$ Xvfb :1 &
In the third Terminal, launch PocketSphinx. First, make sure that the headset is paired with the JetsonBot. Then setup the Terminal process to use the virtual X11 session and launch PocketSphinx:
$ export DISPLAY=:1
$ roslaunch pocketsphinx jetsonbot_voice_cmd.launch
At that point, the JetsonBot should respond to voice commands. The list of commands is in: ~/jetsonbot/src/pocketsphinx/demo/voice_cmd_corpus:
After playing with the voice commands on the JetsonBot, you’ll notice that there are a lot of false positives with this command set. That’s one of the areas of improvement that needs to be worked on going forward, along with a larger command set. For example, it would be nice to say “turn left 90 degrees”. Fortunately this is mostly fiddling with the ROS recognizer Python script, it is not an inherent issue with PocketSphinx itself.
I think ideally what is desired is an ability to “talk” over the recognizer, and have it reject normal conversation. In most science fiction settings this starts off by addressing the robot with something like “Computer …” just to know that you’re asking a question or delivering a demand. On your phone or TV box it’s usually preceded by pressing a button to indicate that there’s a command or question coming. Also, some type of feedback would be helpful from the robot acknowledging that it is processing a voice command would be useful. But isn’t that what the future is for?