Earlier this year we introduced CEVA’s 1st generation CDNN, where they supported the Caffe open-source deep learning framework and translation onto the CEVA-XM4 vision engine IP. Caffe focuses on image processing, and more arbitrary use cases are driving rapid adoption of the TensorFlow deep learning framework since its public introduction last November. (There is an interesting debate developing over whether companies will want to have dependencies on Google moving forward – and if the initial response is any indication, value is outweighing any perceived risks.) In its latest release TensorFlow operates on distributed GPUs, prompting many developers to move forward from the older Theano framework.
Deep learning algorithms are also proliferating. Using the ImageNet database, there is an annual Large Scale Visual Recognition Challenge with evaluation on localization, detection, and parsing. Previous challenges have brought names like AlexNet, GoogLeNet, ResNet, SegNet, VGG16, VGG19, and other networks to light. GoogLeNet introduces the network-in-network, or inception concept – a demonstration of increasing complexity in CNN topology.
Complexity is the crux of the problem when it comes to embedding CNNs. It’s a requirement to lower error rates; advanced algorithms have moved ImageNet classification to fewer than 5% errors (according to Nervana Systems), now better than humans. As I wrote last week, “Anyone armed with server-class GPUs and gigantic FPGAs and racks of equipment in a data center can research CNNs.” That might be an exaggeration since people with advanced degrees are very helpful, but the point is valid – these new algorithms drag huge processing and memory bandwidth requirements along for the ride.
Pre-training generates the coefficients, but the problem of translating CNN research into a real-time network model on embeddable IP remains. Liran Bar, CEVA’s Director of Product Marketing for the Deep Neural Network efforts, says that while the CEVA-XM4 IP does support floating-point coefficients, most designers are omitting the floating-point unit for size and power savings. Instead, the CDNN2 software takes the floating-point network and weight coefficients from offline pre-training, and generates a fixed-point implementation without sacrificing accuracy.
This network generator capability in CDNN2 is a “push-button conversion” that generates results for a moderately sized network in about 10 minutes, ready to run on a CEVA-XM4 FPGA prototyping platform (or obviously the actual IP in an SoC). Bar offers an AlexNet example: 24 layers with a 224x224 network input size running 11x11, 5x5, and 3x3 convolution filters. The pre-trained version requires 253 MB of floating-point weights and data, where the CDNN2 optimized output cuts that to 16 MB of fixed-point weights and data. Running on a CEVA-XM4, that translates to a 2.8x performance improvement and 15x bandwidth savings, with no perceptible delta in classification.
For more on CEVA’s news from CVPR 2016, including a short demonstration video:
CEVA's 2nd Generation Neural Network Software Framework Extends Support for Artificial Intelligence Including Google's TensorFlow
I suspect that many of the researchers making stunning breakthroughs on server-class GPUs have little experience in embedding their algorithms. In the past, this has really stifled algorithmic creativity as embedded design constraints imposed difficult choices. Being able to move from a Caffe or TensorFlow model into a power-efficient embedded chip quickly is huge – the Tensor Processing Unit from Google is powerful but not very efficient in the embedded device world. CEVA’s combined CEVA-XM4 hardware IP with CDNN2 software implementation enablement for machine learning should match up advanced CNN algorithms with customized SoCs quickly.