NVIDIA’s cuDNN is a GPU-accelelerated library of primitives for deep neural networks, which is designed to be integrated into higher-level machine learning frameworks, such as UC Berkeley’s Caffe deep learning framework software.
In an earlier blog post, we installed Caffe on a Jetson TK1. Here’s a short video on how to install cuDNN and compile Caffe with cuDNN support. Looky here:
As of this writing cuDNN requires CUDA 6.5. On the Jetson this means LT4 21.x must be installed. The current release of cuDNN is RC2. However, the current implementation of Caffe requires R1. The video shows installation of R1. The Jetson is running LT4 21.2 with CUDA 6.5
In order to install cuDNN, go to the NVIDIA cuDNN page and download the cuDNN libraries using your NVIDIA Developer account. There are two versions currently listed on the website, R1 and R2. R2 is the most recent. As noted above R1 is the version currently being used by Caffe. Once you download cuDNN, you can untar them and place them in your library and include paths. Here are some gists on Github to install cuDNN RC1 or another gist to install cuDNN R2. The gists are of the form:
$ tar -zxvf cudnn-6.5version.tgz
$ cd cudnn-6.5version
# copy the include file
$ sudo cp cudnn.h /usr/local/cuda-6.5/include
$ sudo cp libcudnn* /usr/local/cuda-6.5/lib
The cuDNN libraries are placed into the cuda-6.5 library directory, a convenient place since CUDA 6.5 needs to be in your LD_LIBRARY_PATH. Note: You should only install one of the releases.
Installing Caffe is straightforward, here a Github gist for installation. Basically install the tools and get Caffe:
Note: Make sure that you are in the ~/ directory, i.e. $ cd ~/ before the next steps.
$ sudo add-apt-repository universe
$ sudo apt-get update
$ sudo apt-get install libprotobuf-dev protobuf-compiler gfortran \
libboost-dev cmake libleveldb-dev libsnappy-dev \
libboost-thread-dev libboost-system-dev \
libatlas-base-dev libhdf5-serial-dev libgflags-dev \
libgoogle-glog-dev liblmdb-dev -y
$ sudo usermod -a -G video $USER
# Git clone Caffe
$ sudo apt-get install -y git
$ git clone https://github.com/BVLC/caffe.git
$ cd caffe && git checkout dev
$ cp Makefile.config.example Makefile.config
At this point, edit Makefile.confg:
$ gedit Makefile.config
and uncomment the line:
# USE_CUDNN := 1
Changing the line to:
USE_CUDNN := 1
For the tests, I also uncommented the CUDA_ARCH *_50 lines which are lower down in the file.
NOTE (6-15-2015): Aaron Schumacher found an issue with some of the later versions of Caffe. From his article: The NVIDIA Jetson TK1 with Caffe on MNIST
Unfortunately master has a really large value for LMDB_MAP_SIZE in src/caffe/util/db.cpp, which confuses our little 32-bit ARM processor on the Jetson, eventually leading to Caffe tests failing with errors like MDB_MAP_FULL: Environment mapsize limit reached. Caffe GitHub issue #1861 has some discussion about this and maybe it will be fixed eventually, but for the moment if you manually adjust the value from 1099511627776 to 536870912, you’ll be able to run all the Caffe tests successfully.
Then compile the Caffe and run the tests:
make -j 4 all
make -j 4 runtest
Caffe with cuDNN Results
In the video the results are shown for running the Caffe time example:
build/tools/caffe time –model=models/bvlc_alexnet/deploy.prototxt –gpu=0
Note: These results are the summation of 10 iterations, so per image recognition on the Average Forward Pass is the listed result divided by 10, i.e. 227.156 ms is ~23 ms per image recognition.
I also timed the examples without cuDNN installed:
I0119 22:22:13.032065 2223 caffe.cpp:273] Average Forward pass: 252.767 ms.
I0119 22:22:13.032985 2223 caffe.cpp:275] Average Backward pass: 261.981 ms.
I0119 22:22:13.033064 2223 caffe.cpp:277] Average Forward-Backward: 517.052 ms.
CPU Maximum Performance Setting
I0119 22:23:42.967684 2246 caffe.cpp:273] Average Forward pass: 233.343 ms.
I0119 22:23:42.967722 2246 caffe.cpp:275] Average Backward pass: 247.55 ms.
I0119 22:23:42.967759 2246 caffe.cpp:277] Average Forward-Backward: 481.215 ms.
I0119 22:23:42.967803 2246 caffe.cpp:279] Total Time: 24060.7 ms.
GPU and CPU Maximum Performance
I0119 22:24:59.754598 2261 caffe.cpp:273] Average Forward pass: 233.941 ms.
I0119 22:24:59.754642 2261 caffe.cpp:275] Average Backward pass: 246.8 ms.
I0119 22:24:59.754683 2261 caffe.cpp:277] Average Forward-Backward: 481.099 ms.
I0119 22:24:59.754729 2261 caffe.cpp:279] Total Time: 24055 ms.
I0119 21:21:15.920301 2080 caffe.cpp:273] Average Forward pass: 248.729 ms.
I0119 21:21:15.920436 2080 caffe.cpp:275] Average Backward pass: 243.773 ms.
I0119 21:21:15.920559 2080 caffe.cpp:277] Average Forward-Backward: 494.648 ms.
I0119 21:21:15.920708 2080 caffe.cpp:279] Total Time: 24732.4 ms.
CPU Maximum Performance Setting
I0119 21:25:51.013579 2228 caffe.cpp:273] Average Forward pass: 225.83 ms.
I0119 21:25:51.013624 2228 caffe.cpp:275] Average Backward pass: 227.208 ms.
I0119 21:25:51.013659 2228 caffe.cpp:277] Average Forward-Backward: 453.36 ms.
I0119 21:25:51.013701 2228 caffe.cpp:279] Total Time: 22668 ms.
GPU and CPU Maximum Performance
I0119 21:27:20.919353 2254 caffe.cpp:273] Average Forward pass: 225.722 ms.
I0119 21:27:20.919394 2254 caffe.cpp:275] Average Backward pass: 227.156 ms.
I0119 21:27:20.919428 2254 caffe.cpp:277] Average Forward-Backward: 453.203 ms.
I0119 21:27:20.919467 2254 caffe.cpp:279] Total Time: 22660.2 ms.
In comparing the numbers, you’ll note that the cuDNN produces about the same results as with the hand written code. A couple of things to remember: First, this is for cuDNN R1 so undoubtably the new version is probably faster. Second, by using a library cuDNN learning frameworks, such as Caffe, developers can concentrate on their particular problem set and avoid hand writing CUDA code for common tasks.
I was surprised that cranking up the GPU clocks to their maximum didn’t shorten the times significantly. Interesting, it means that there’s bottlenecks elsewhere.
Compared with the Big Iron
The Caffe website states:
Caffe can process over 40M images per day with a single NVIDIA K40 or Titan GPU*. That’s 5 ms/image in training, and 2 ms/image in test. We believe that Caffe is the fastest CNN implementation available.
Those results are about an order of magnitude faster than the TK1 on a Jetson. Why is that interesting?
While absolutely faster, the big GPU cards require a lot of power. Probably starting at a minimum of 500 watts for an ultra performance GPU card. In comparison, a Jetson requires about 10 watts. Math is not my strong point, but I could imagine putting together a cluster of Jetsons that produces similar results in a much smaller power footprint.
By leveraging a NVIDIA supported library such as cuDNN, developers can take advantage of fast algorithms that take advantage of the GPU architecture without having to worry about the minutiae of hand writing CUDA code. The advantage multiplies as architectures and feature sets of the hardware change.