Caffe Deep Learning Framework – NVIDIA Jetson TX2

Back in September, we installed the Caffe Deep Learning Framework on a Jetson TX1 Development Kit. With the advent of the Jetson TX2, now is the time to install Caffe and compare the performance difference between the two. Looky here:

Background

As you recall, Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors.

Over the last couple of years, a great deal of progress has been made in speeding up the performance of the supporting underlying software stack. In particular the cuDNN library has been tightly integrated with Caffe, giving a nice bump in performance.

Caffe Installation

A script is available in the JetsonHack Github repository which will install the dependencies for Caffe, download the source files, configure the build system, compile Caffe, and then run a suite of tests. Passing the tests indicates that Caffe is installed correctly.

This installation demonstration is for a NVIDIA Jetson TX2 running L4T 27.1, a 64-bit Ubuntu 16.04 variant. The installation of L4T 27.1 was done using JetPack 3.0, and includes installation of OpenCV4Tegra, CUDA 8.0, and cuDNN 5.1.

Before starting the installation, you may want to set the CPU and GPU clocks to maximum by running the script:

$ sudo ./jetson_clocks.sh

The script is in the home directory.

In order to install Caffe:

$ git clone https://github.com/jetsonhacks/installCaffeJTX2.git
$ cd installCaffeJTX2
$ ./installCaffe.sh

Installation should not require intervention, in the video installation of dependencies and compilation took about 14 minutes. Running the unit tests takes about 19 minutes. While not strictly necessary, running the unit tests makes sure that the installation is correct.

Test Results

At the end of the video, there are a couple of timed tests which can be compared with the Jetson TX1. The following table adds some more information:

Jetson TK1 vs. Jetson TX1 vs. Jetson TX2 Caffe GPU Example Comparison
10 iterations, times in milliseconds
Machine Average FWD Average BACK Average FWD-BACK
Jetson TK1 (32-bit OS) 234 243 478
Jetson TX1 (64-bit OS) 80 119 200
Jetson TX2 (Mode Max-Q) 78 97 175
Jetson TX2 (Mode Max-P) 65 85 149
Jetson TX2 (Mode Max-N) 56 75 132

The tests are running 50 iterations of the recognition pipeline, and each one is analyzing 10 different crops of the input image, so look at the ‘Average Forward pass’ time and divide by 10 to get the timing per recognition result. For the Max-N version of the Jetson TX2, that means that an image recognition takes about 5.6 ms.

The Jetson TX2 introduces the concept of performance modes. The Jetson TX1 has 4 ARM Cortex A57 CPU cores. In comparison, there are 6 CPU cores in the Tegra T2 SoC. Four are ARM Cortex-A57, the other two are NVIDIA Denver 2. Depending on performance and power requirements the cores can be taken on or offline, and the frequencies of their clocks set independently. There are five predefined modes available through the use of the nvpmodel CLI tool.

  • sudo nvpmodel -m 1 (Max-Q)
  • sudo nvpmodel -m 2 (Max-P)
  • sudo nvpmodel -m 0 (Max-N)

Max-Q uses only the 4 ARM A57 cores at a minimal clock frequency. Note that from the table, this gives performance equivalent to the Jetson TX1. Max-Q sets the power profile to be 7.5W, so this represents Jetson TX1 performance while only using half the amount of power of a TX1 at full speed!

Max-P also uses only the 4 ARM A57 cores, but at a faster clock frequency. From the table, we can see that the Average Forward Pass drops from the Max-Q value of 78 to the Max-P value of 65. My understanding is that Max-P limits power usage to 15W.

Finally, we can see that in Max-N mode the Jetson TX2 performs best of all. (Note: This wasn’t shown in the video, it’s a special bonus for our readers here!) In addition to the 4 ARM A57 cores the Denver 2 cores come on line, and the clocks on the CPU and the GPU are put to their maximum values. To put it in perspective, the Jetson TX1 at max clock runs the test in about ~10000 ms, the Jetson TX2 at Max-N runs the same test in ~6600 ms. Quite a bit of giddy-up.

Conclusion

Deep learning is in its infancy and as people explore its potential, the Jetson TX2 seems well positioned to take the lessons learned and deploy them in the embedded computing ecosystem. There are several different deep learning platforms being developed, the improvement in Caffe on the Jetson Dev Kits over the last couple of years is way impressive.

Notes

The installation in this video was done directly after flashing L4T 27.1 on to the Jetson TX2 with CUDA 8.0, cuDNN r5.1 and OpenCV4Tegra.

The latest Caffe commit used in the video is: 317d162acbe420c4b2d1faa77b5c18a3841c444c

17 Comments

  1. Appreciate the video, just purchased the TX2 a day ago and expect it soon. Before I do the install, I was looking for some references on installing caffe and came upon your blog. Does this installation include the installation of pycaffe? Thanks

  2. Thanks a lot!
    I need to do just one extra step in the end:
    sudo pip install protobuf

    I could not understand why though, because installCaffe.sh already have it installed.

  3. I’m using a Caffe trained model on my jetsonTX2, when I use OpenCV(which is using only the CPU)for reading and forward passing the network I get 1.7 sec for one image (which is divided to 64 patches) approximately 26ms for each patch… but the problem is when I’m using Caffe(using GPU) for reading and forward passing the network the time increase to 2.1 ! which is 32ms for each patch …my conclusion is that I’m not using GPU at all ! here is part of my code would you please help to run my code on GPU?

    ///
    void Network::useGPU(const bool _useGPU)
    {
    useGPU_ = _useGPU;
    caffe::Caffe::set_mode(useGPU_ ? caffe::Caffe::GPU : caffe::Caffe::CPU);
    }

    const Result Network::classifier(
    const cv::Mat& image,
    const size_t imageSize
    ) throw (utility::Exception)
    {
    caffe::Blob* caffeInput;
    caffe::Blob* caffeOutput;
    cv::Mat blob;
    cv::Mat caffeInputMatrix;
    cv::Mat probabilities;
    Class _class;
    Result result;
    caffe::Timer forwardTimer;

    // Convert image to batch of images
    blob = cv::dnn::blobFromImage(
    image,
    1.0f,
    cv::Size(imageSize, imageSize),
    cv::Scalar(meanPixels[0], meanPixels[1], meanPixels[2]),
    false
    );
    if (useGPU_) {
    // Run Caffe model using Caffe
    caffeInput = caffeNetwork_->input_blobs()[0];
    // Wrap Caffe’s input blob to cv::Mat
    caffeInputMatrix = cv::Mat(
    caffeInput->shape(),
    CV_32F,
    (char*) caffeInput->cpu_data()
    );
    blob.copyTo(caffeInputMatrix);
    // forwardTimer.Start();
    caffeOutput = caffeNetwork_->Forward()[0];
    // std::cout << "Forward Time: " << forwardTimer.MilliSeconds() <shape(),
    CV_32F,
    (char*) caffeOutput->cpu_data()
    );
    } else {
    network_.setInput(blob, “data”);
    probabilities = network_.forward(“softmax”);
    }
    _class = getClass(probabilities);
    result.label(labels[_class.first]);
    result.probability(_class.second);

    return result;
    }

  4. I check my CPU usage using CPUSTAT which is showing CPU is working the same when using GPU and CPU mode, and besides, I expect when running on GPU mode the time should decrease not increase 😐

    • I do not think those are correct assumptions. Because the CPU and GPU share the same memory on the Jetson and there can be different cache coherency issues, the only way to determine if the GPU is being utilized it to actually check GPU usage. It is possible that GPU code executes more slowly than CPU code under certain circumstances.

  5. one more test I’ve done after your comment, I use only Caffe once setmode(GPU) and once setmode(CPU) the time in both conditions is exactly the same!
    and by the way, how can I check my GPU usage on jetsonTX2? I try using GPUSTAT but I was not able to run it gives the following error:
    Error on querying NVIDIA devices. Use –debug flag for details
    and the debug result is :
    Error on querying NVIDIA devices. Use –debug flag for details
    Traceback (most recent call last):
    File “/usr/local/lib/python2.7/dist-packages/gpustat/__main__.py”, line 16, in print_gpustat
    gpu_stats = GPUStatCollection.new_query()
    File “/usr/local/lib/python2.7/dist-packages/gpustat/core.py”, line 261, in new_query
    N.nvmlInit()
    File “/usr/local/lib/python2.7/dist-packages/pynvml.py”, line 747, in nvmlInit
    _LoadNvmlLibrary()
    File “/usr/local/lib/python2.7/dist-packages/pynvml.py”, line 785, in _LoadNvmlLibrary
    _nvmlCheckReturn(NVML_ERROR_LIBRARY_NOT_FOUND)
    File “/usr/local/lib/python2.7/dist-packages/pynvml.py”, line 405, in _nvmlCheckReturn
    raise NVMLError(ret)
    NVMLError_LibraryNotFound: NVML Shared Library Not Found

    is it possible for you to give me a Caffe code that runs on GPU?which I could compare it with your table?

  6. thanks, I use GPU-graph and was able to monitor GPU usage, I trained a simple CNN and the GPU was working !!! but when I run my code the GPU is not working 😐 is there somthing wrong with my code?!

  7. Hello. I’ve installed CAFFE on Jetson TX2 according to the way you mentioned in this article, and also get the pycaffe interface. In terminal, I can import the CAFFE package normally .
    But when I ran my python script(this script could be used in other computers), an exceptional :error raised (Check failed: status == CUDNN_STATUS_SUCCESS(4 vs. 0)CUDNN_STATUS_INTETRNAL_ERROR). Then I used ‘sudo’ command, but another mistake occurred, ImportError:no module named CAFFE .
    I am trapped at this problems for several days, and sincerely wish someone could help me deal with this problem. Thank you.

Leave a Reply

Your email address will not be published.


*