Caffe Deep Learning Framework – 64-bit NVIDIA Jetson TX1

Back in February, we installed Caffe on the TX1. At the time, the TX1 was running a 32-bit version of L4T 23.1. With the advent of the 64-bit L4T 24.2, this seems like a good time to do a performance comparison of the two. The TX1 can now do an image recognition in about 8 ms! For the install and test, Looky Here:

Background

As you recall, Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors.

The L4T 23.1 Operating System release was a 64-bit kernel supporting a 32-bit user space. For the L4T 24.2 release, both the kernel and the user space are 64-bit.

Caffe Installation

A script is available in the JetsonHack Github repository which will install the dependencies for Caffe, downloads the source files, configures the build system, compiles Caffe, and then runs a suite of tests. Passing the tests indicates that Caffe is installed correctly.

This installation demonstration is for a NVIDIA Jetson TX1 running L4T 24.2, an Ubuntu 16.04 variant. The installation of L4T 24.2 was done using JetPack 2.3, and includes installation of OpenCV4Tegra, CUDA 8.0, and cuDNN 5.1.

Before starting the installation, you may want to set the CPU and GPU clocks to maximum by running the script:

$ sudo ./jetson_clocks.sh

The script is in the home directory, and is also included in the installCaffeJTX1 repository for convenience.

In order to install Caffe:

$ git clone https://github.com/jetsonhacks/installCaffeJTX1.git
$ cd installCaffeJTX1
$ ./installCaffe.sh

Installation should not require intervention, in the video installation of dependencies and compilation took about 10 minutes. Running the unit tests takes about 45 minutes. While not strictly necessary, running the unit tests makes sure that the installation is correct.

Test Results

At the end of the video, there are a couple of timed tests which can be compared with the Jetson TK1, and the previous installation:

Jetson TK1 vs. Jetson TX1 Caffe GPU Example Comparison
10 iterations, times in milliseconds
Machine Average FWD Average BACK Average FWD-BACK
Jetson TK1 (32-bit OS) 234 243 478
Jetson TX1 (32-bit OS) 179 144 324
Jetson TX1
with cuDNN support (32-bit OS)
103 117 224
Jetson TX1 (64-bit OS) 110 122 233
Jetson TX1
with cuDNN support (64-bit)
80 119 200

There is definitely a performance improvement between the 32-bit and 64-bit releases. There are a couple of factors for the performance improvement. One is the change from a 32-bit to 64-bit operating system. Another factor is the improvement of the deep learning libraries, CUDA and cuDNN, between the releases. Considering that the tests are running on the exact same hardware, the performance boost is impressive. Using cuDNN provides a huge gain in the forward pass tests.

The tests are running 50 iterations of the recognition pipeline, and each one is analyzing 10 different crops of the input image, so look at the ‘Average Forward pass’ time and divide by 10 to get the timing per recognition result. For the 64-bit version, that means that an image recognition takes about 8 ms.

NVCaffe

It is worth mentioning that NVCaffe is a special branch of Caffe used on the TX1 which includes support for FP16. The above tests use FP32. In many cases, FP32 and FP16 give very similar results; FP16 is faster. For example, in the above tests, the Average Forward Pass test finishes in about 60ms, a result of 6 ms per image recognition!

Conclusion

Deep learning is in its infancy and as people explore its potential, the Jetson TX1 seems well positioned to take the lessons learned and deploy them in the embedded computing ecosystem. There are several different deep learning platforms being developed, the improvement in Caffe on the Jetson Dev Kits over the last couple of years is quite impressive.

Notes

The installation in this video was done directly after flashing L4T 24.2 on to the Jetson TX1 with CUDA 8.0, cuDNN r5.1 and OpenCV4Tegra. Git was then installed:

$ sudo apt-get install git

The latest Caffe commit used in the video is: 80f44100e19fd371ff55beb3ec2ad5919fb6ac43

11 Comments

  1. Thanks for the great manual!

    Build worked flawlessly, however tests get stuck.
    Sometimes after just a few tests, sometimes after a few tens, but always get stuck. No errors.

    The board is brand new, full clean L4T using latest JetPack. jetson_clocks.sh executed.

    Any suggestions on how to debug the issue?

    • What version of L4T are you using?
      Also, do you have any idea how much memory is being used? You can try to open up the System Monitor while it is running and make sure that memory pressure isn’t causing the issue.
      Another thing to try is to turn off the jetson_clocks.sh script, some people have reported issues there.

      • ubuntu@tegra-ubuntu:~/caffe$ head -n 1 /etc/nv_tegra_release
        # R24 (release), REVISION: 2.1, GCID: 8028265, BOARD: t210ref, EABI: aarch64, DATE: Thu Nov 10 03:51:59 UTC 2016

        When it gets stuck, I seem to have plenty of memory:
        ubuntu@tegra-ubuntu:~$ free
        total used free shared buff/cache available
        Mem: 4090604 2197844 521628 42648 1371132 2154284
        Swap: 0 0 0

        Tried disabling jetson_clocks.sh and running the test without -j4.
        Same result.

        One more data point: it freezes on different tests, but always on some variant of gradient testing…

        [ RUN ] CuDNNConvolutionLayerTest/0.TestGradientGroupCuDNN
        or
        [ RUN ] ConvolutionLayerTest/2.Test1x1Gradient
        or
        [ RUN ] InnerProductLayerTest/2.TestGradientTranspose
        or
        [ RUN ] RNNLayerTest/3.TestGradientNonZeroContBufferSize2
        etc.

        Any way to see what it gets stuck on exactly?

          • nvcaffe got quite a few issues:
            1) I had to fix include path
            2) Had to create links to some of the libraries, so it finds them at linking
            3) Some of tests fail
            4) One of the tests runs out of memory and aborts “make runtest”…

            So I am back to your version and trying to debug it.
            According to GDB, the hanging tests are spinning around “usleep” inside
            cuMemcpy, probably waiting for the transfer to finish. Forever.

            Are there any non-caffe, Cuda tests that I can run on TX1?
            May be it’s just the specific board/CPU are faulty and I should RMA it..

1 Trackback / Pingback

  1. Tutorial: Integrating Deep Learning Applications With FlytOS on Nvidia TX1 (Part 1/2)

Leave a Reply

Your email address will not be published.


*