Research Paper: Accelerating Deep Convolutional Neural Networks Using Specialized Hardware

Microsoft Research just published a paper on using FPGAs to accelerate machine learning: Accelerating Deep Convolutional Neural Networks Using Specialized Hardware. This is an interesting alternative to using GPUs for the same task.

This brings up a point that I’ve been giving a lot of thought to recently. 15 years ago the fastest computer in the world was around 5,000 Gigaflops/second (RMAX) and used ~ 2,000 KW of power. 5 years ago, this had increased to ~ 2,500 Teraflops/Second (RMAX) consuming 4,040 KW. 10 years had brought a speed increase of 500 times but only at a cost of 2x the power consumption. Monetarily it was about $640 USD per Gigaflop in 2000, and around $2-3 per Gigaflop USD in 2010.

Fast forward to today at the end of 2014 where the fastest machine is 33,000 Teraflops/Second (RMAX) using 17,800 KW. Cost per gigaflop is around $ 0.05 to 0.08. With these numbers, the actual cost of acquisition starts to be a relatively small number compared to the cost of the power to actually run the machine.

What we expect to see is exactly what Microsoft is discussing. Special purpose machines that provide equivalent performance in a much smaller power footprint. Why? Because that’s where the real money, and therefore the constraint, really lies.

There has been a lot of talk about reaching the end of Moore’s law the last few years. In fact, it feels much more interesting in the sense that there are going to be a lot more CPU/GPU cores or specialized computing units all running in parallel where the actual resource constraint is the amount of power that they (don’t) draw. The actual speed of the computers doesn’t seem like the bottleneck, rather it is the ability of software to parallelize algorithms to take advantage of all the computing units available.

Can FPGAs actually outperform GPUs in the long run? Certainly in the big iron case such as the NVIDIA Kepler K40/K80 FPGAs can kick some tail, but what about the case where the power is much more optimized such as a mobile System on a Chip (SoC)? Tegra K1 comes to mind (360GFlops in 10 watts) or the newly announced Tegra X1 (1 Teraflop in ~ 10 watts), as well as Apple’s A8 and other chips in that class. While the specialized FPGA approach can probably outperform the SoC, that margin may not be worth the amount of work and the lack of flexibility that the FPGAs entail. It may just be a test of faith. Do you believe that you can select the best algorithm and implement it in silicon as in the FPGA case? Or perhaps you believe that you can iterate through several different algorithms and come up with something even better in a more flexible SoC environment? Good questions!


The figures come from

Be the first to comment

Leave a Reply

Your email address will not be published.