Back in September, we installed the Caffe Deep Learning Framework on a Jetson TX1 Development Kit. With the advent of the Jetson TX2, now is the time to install Caffe and compare the performance difference between the two. Looky here:
As you recall, Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors.
Over the last couple of years, a great deal of progress has been made in speeding up the performance of the supporting underlying software stack. In particular the cuDNN library has been tightly integrated with Caffe, giving a nice bump in performance.
A script is available in the JetsonHack Github repository which will install the dependencies for Caffe, download the source files, configure the build system, compile Caffe, and then run a suite of tests. Passing the tests indicates that Caffe is installed correctly.
This installation demonstration is for a NVIDIA Jetson TX2 running L4T 27.1, a 64-bit Ubuntu 16.04 variant. The installation of L4T 27.1 was done using JetPack 3.0, and includes installation of OpenCV4Tegra, CUDA 8.0, and cuDNN 5.1.
Before starting the installation, you may want to set the CPU and GPU clocks to maximum by running the script:
$ sudo ./jetson_clocks.sh
The script is in the home directory.
In order to install Caffe:
$ git clone https://github.com/jetsonhacks/installCaffeJTX2.git
$ cd installCaffeJTX2
Installation should not require intervention, in the video installation of dependencies and compilation took about 14 minutes. Running the unit tests takes about 19 minutes. While not strictly necessary, running the unit tests makes sure that the installation is correct.
At the end of the video, there are a couple of timed tests which can be compared with the Jetson TX1. The following table adds some more information:
|Jetson TK1 vs. Jetson TX1 vs. Jetson TX2 Caffe GPU Example Comparison
10 iterations, times in milliseconds
|Machine||Average FWD||Average BACK||Average FWD-BACK|
|Jetson TK1 (32-bit OS)||234||243||478|
|Jetson TX1 (64-bit OS)||80||119||200|
|Jetson TX2 (Mode Max-Q)||78||97||175|
|Jetson TX2 (Mode Max-P)||65||85||149|
|Jetson TX2 (Mode Max-N)||56||75||132|
The tests are running 50 iterations of the recognition pipeline, and each one is analyzing 10 different crops of the input image, so look at the ‘Average Forward pass’ time and divide by 10 to get the timing per recognition result. For the Max-N version of the Jetson TX2, that means that an image recognition takes about 5.6 ms.
The Jetson TX2 introduces the concept of performance modes. The Jetson TX1 has 4 ARM Cortex A57 CPU cores. In comparison, there are 6 CPU cores in the Tegra T2 SoC. Four are ARM Cortex-A57, the other two are NVIDIA Denver 2. Depending on performance and power requirements the cores can be taken on or offline, and the frequencies of their clocks set independently. There are five predefined modes available through the use of the nvpmodel CLI tool.
- sudo nvpmodel -m 1 (Max-Q)
- sudo nvpmodel -m 2 (Max-P)
- sudo nvpmodel -m 0 (Max-N)
Max-Q uses only the 4 ARM A57 cores at a minimal clock frequency. Note that from the table, this gives performance equivalent to the Jetson TX1. Max-Q sets the power profile to be 7.5W, so this represents Jetson TX1 performance while only using half the amount of power of a TX1 at full speed!
Max-P also uses only the 4 ARM A57 cores, but at a faster clock frequency. From the table, we can see that the Average Forward Pass drops from the Max-Q value of 78 to the Max-P value of 65. My understanding is that Max-P limits power usage to 15W.
Finally, we can see that in Max-N mode the Jetson TX2 performs best of all. (Note: This wasn’t shown in the video, it’s a special bonus for our readers here!) In addition to the 4 ARM A57 cores the Denver 2 cores come on line, and the clocks on the CPU and the GPU are put to their maximum values. To put it in perspective, the Jetson TX1 at max clock runs the test in about ~10000 ms, the Jetson TX2 at Max-N runs the same test in ~6600 ms. Quite a bit of giddy-up.
Deep learning is in its infancy and as people explore its potential, the Jetson TX2 seems well positioned to take the lessons learned and deploy them in the embedded computing ecosystem. There are several different deep learning platforms being developed, the improvement in Caffe on the Jetson Dev Kits over the last couple of years is way impressive.
The installation in this video was done directly after flashing L4T 27.1 on to the Jetson TX2 with CUDA 8.0, cuDNN r5.1 and OpenCV4Tegra.
The latest Caffe commit used in the video is: 317d162acbe420c4b2d1faa77b5c18a3841c444c