Back in October 2014, Google’s Pete Warden wrote an interesting article: How to run the Caffe deep learning vision library on Nvidia’s Jetson mobile GPU board. At the time, I thought, “What fun!”. However, I noticed in the article that at the time there were issues with running Caffe on CUDA 6.5, which was just being introduced in LT4 21.1.
After the holiday break, I realized a long enough period of time had passed that most of the issues would probably be worked out since we are now on LT4 21.2 and that I would be able to run Caffee in all its CUDA 6.5 goodness. In fact, I could, and with even better results than the original! Looky here:
Here’s the install script on Github: installCaffe.sh. You may want to season it to taste, as it does run the examples towards the end of the script. After you download the script, set the Permissions of the file to ‘Allow executing file as program’ in the file ‘Properties’ dialog before executing.
NOTE: This installation was done on a clean system, which was installed using JetPack. JetPack did a new flash of LT4 21.2, CUDA 6.5 and OpenCV.
NOTE (6-15-2015): Aaron Schumacher found an issue with some of the later versions of Caffe. From his article: The NVIDIA Jetson TK1 with Caffe on MNIST
Unfortunately master has a really large value for LMDB_MAP_SIZE in src/caffe/util/db.cpp, which confuses our little 32-bit ARM processor on the Jetson, eventually leading to Caffe tests failing with errors like MDB_MAP_FULL: Environment mapsize limit reached. Caffe GitHub issue #1861 has some discussion about this and maybe it will be fixed eventually, but for the moment if you manually adjust the value from 1099511627776 to 536870912, you’ll be able to run all the Caffe tests successfully.
NOTE (7-10-2015): Corey Thompson also adds:
To get the LMDB portion of tests to work, make sure to also update examples/mnist/convert_mnist_data.cpp as well:
examples/mnist/convert_mnist_data.cpp:89:56: warning: large integer implicitly truncated to unsigned type [-Woverflow]
CHECK_EQ(mdb_env_set_mapsize(mdb_env, 1099511627776), MDB_SUCCESS) // 1TB
adjust the value from 1099511627776 to 536870912.
Just as a reminder, Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors.
So why is that an interesting topic? The quick answer is it’s currently all the rage in the “in the know” community. You’ll see it talked about in various terms, but here’s some high level background from Wikipedia: Deep Learning.
Caffe comes with a pre-built ‘Alexnet’ model that recognizes 1,000 different kinds of objects.
The reason that this is an interesting topic on the Jetson is the speed at which images can be recognized after a model has been trained, taking advantage of the Jetson CUDA GPU. In the example, it takes around 24 milliseconds per recognition result. And what does that mean? As an example, it means that when you’re driving your car, on board cameras can recognize traffic signs better and faster. Deep learning also means better speech recognition is possible. Better alignment of audio and video with transcripts. All sorts of things, too numerous to list here.
Note: In the video, you see the results posted as ~235 ms for the ‘Average Forward Pass’. This is for an iteration of 10 times, so the timing for one image recognition is around ~24 ms.
I’ll note here that Deep Learning also makes it possible to find cute kitten and puppy pictures faster than ever before, and pray that the servers hosting this blog don’t get crushed or melted from all the web traffic. (Sorry, that just seems like a gratuitous mention to increase traffic).
I did play around with some of the parameters of the Jetson board when running the example. In Pete Warden’s test, he was able to get 34ms per recognition using CUDA 6.0. Not surprisingly, CUDA 6.5 is mo’ better and faster (the Caffe code is probably better now too), and I was able to get around 27ms per recognition. But I also used another trick, which is to clock the CPUs into performance mode which lowered the recognition time to 24ms. The maxCPU script sets the appropriate flags. Overall, that’s quite a difference in speed. Note that I haven’t used cuDNN or reclocked the GPUs just yet. This is all with a power budget of around 10 watts!