AI News, IBM’s New Do-It-All Deep-Learning Chip

IBM’s New Do-It-All Deep-Learning Chip

“The most advanced precision that you can do for training is 16 bits, and the most advanced you can do for inference is 2 bits,” explains Kailash Gopalakrishnan, the distinguished member of the technical staff at IBM’s Yorktown Heights research center who led the effort.

“This chip potentially covers the best of training known today and the best of inference known today.” The chip’s ability to do all of this stems from two innovations that are both aimed at the same outcome—keeping all the processor components fed with data and working.

To break through these information infarctions, Gopalakrishnan’s team came up with a “customized” data flow system.The data flow system is a network scheme that speeds the movement of data from one processing engine to the next.It is customized according to whether it’shandlinglearning or inference and for the different scales of precision.

For example, there are certain situations where a cache would push a chunk of data out to the computer’s main memory (evict it), but if that data’s needed as part of the neural network’s inferencing or learning process, the system will then have to wait until it can be retrieved from main memory.

The resulting chip can perform all three of today’s main flavors of deep learning AI: convolutional neural networks (CNN), multilayer perceptrons (MLP), and long short-term memory (LSTM).

Gopalakrishnan points out that because the chip is made using an advanced silicon CMOS manufacturing process (GlobalFoundries’ 14-nanometer process), all those operations per second are packed into a pretty small area.

Deep Learning in Real Time — Inference Acceleration and Continuous Training

Deep learning is revolutionizing many areas of computer vision and natural language processing (NLP), infusing into increasingly more consumer and industrial products intelligence capabilities with the potential to impact the everyday experience of people and the standard processes of industry practices.

Like any statistical machine learning model, the validity and effectiveness of a deep neural network critically hinge on the assumption that the distribution of the input and output data does not change significantly over time, rendering the patterns and intricacies the model originally learned underperforming or even unusable.

However, such assumption rarely holds true in the real world, especially in domains such as information security, where fast-paced evolution of the underlying data generating mechanism is a norm (in the case of security, it is because both players, the defender and the adversary, are constantly striving to outmatch his opponent by changing his own strategies, thus exploiting the opponent’s unguarded vulnerabilities).

Therefore, as the prospect of using deep learning to better solve many once unresolvable problems march into such domains, the problem of continuous training of a deep neural network and how to do it well without jeopardizing production quality guarantees is gaining greater attention among both Machine-Learning-as-a-Service (MLaaS) providers and application architects.

The goal of this report is not to delve too deep into the technicalities of one or two specific techniques, but to survey the broader landscape of hardware and software solutions to these two important problems, offer our readers a starting point of study, and hopefully inspire more people with diverse expertise to join our discussion and exchange knowledge.

It is common to batch hundreds of training inputs (for example images in a computer vision task, sentence sequences in a NLP task or spectrograms in a speech recognition task) and perform forward or backward propagation on them as one unit of data simultaneously to amortize the cost of loading the network weights from GPU memory across many inputs.

For example, the latency requirement for the entire Google Autosuggest pipeline is less than 200 milliseconds, including Frontend, load balancing, query understanding and auto-complete suggestion powered by DNN, and a full search stack traversal to display what would the result be if users actually search for one of the auto-suggested queries.

Nvidia features its cutting-edge Pascal-architecture Tesla P4, P40 and P100 GPU accelerators with a peak 33x higher throughput than a single-socket CPU server while at the same time maintain maximum 31x lower latency, reported by an Nvidia study which compared inference performance of AlexNet, GoogleNet, ResNet-152 and VGG-19 on a CPU-only server (single Intel Xeon E5–2690 v4 @ 2.6GHz) versus a GPU server (same CPU with 1XP100 PCIe).

The traditional algorithm such as precomputed implicit GEMM (generalized matrix-matrix product) is optimized for large output matrices, and its default parallelization strategy suffers from the problem of not being able to launch enough thread blocks provided that batch size is a multiplicative factor in one of the output matrix dimensions.

In recent years, researchers have found that using lower precision floating point representations (FP16) for storage of layer activations and higher ones (FP32) for computation does not sacrifice classification accuracy, while in the meantime improves performance in bandwidth-limited situations and reduces overall memory footprint required to run the DNN.

With a flexible Instruction Set Architecture allowing for rich data representations such as 16FP, 32FP, 16Int, 32Int and fast SIMD multiply-accumulate operations, these devices provide efficient memory block loading for optimized convolution and generalized matrix-matrix multiplication, both critical for fast and energy-efficient inference on edge devices.

The Model Optimizer attempts horizontal and vertical layer fusion and redundant network branch pruning, before quantizing the network weights and feeding the reduced, quantized network to the Inference Engine, which further optimizes inference for target hardware with an emphasis on footprint reduction.

The instruction set contains optimized CISC instructions for reading data block from memory, reading weight block from memory, matrix multiply or convolve the data and weight block and accumulate the intermediate results, apply hardwired activation functions elementwise, and write the result to memory.

The Matrix Multiplier Unit is a massively parallel matrix processor capable of running hundreds of thousands of matrix operations (multiplication and addition) in a single clock cycle, reusing both inputs (data block and weight block) for many different operations without storing them back to a register, and in this case, the 24MB SRAM Unified Buffer.

A compressed model that can easily fit into on-chip SRAM cache rather than off-chip DRAM memory will facilitate the application of complex DNNs on mobile platforms or in driverless cars, where memory size, inference speed, and network bandwidth are all strictly constrained.

Like any statistical machine learning model, the validity and effectiveness of a deep neural network critically hinge on the assumption that the distribution of the input and output data does not change significantly over time, rendering the patterns and intricacies the model originally learned underperforming or even unusable.

However, such assumption rarely holds true in the real world, especially in domains such as information security, where fast-paced evolution of the underlying data generating mechanism is a norm (in the case of security, it is because both players, the defender and the adversary, are constantly striving to outmatch his opponent by changing his own strategies, thus exploiting the opponent’s unguarded vulnerabilities).

Therefore, as the prospect of using deep learning to better solve many once unsolvable problems march into such domains, the problem of continuous learning of a deep neural network and how to do it well without jeopardizing production quality guarantees or raising resource consumption is gaining greater attention among both Machine-Learning-as-a-Service (MLaaS) providers and application architects. In the second part of this report, we will formulate the continuous learning scenario and introduce an incremental fine-tuning approach.

The expectation is that the generative training stage will guide the network towards learning a good hierarchical representation of the data domain, and the discriminative stage will take advantage of this representation and hopefully learn a better discriminator function more easily in the representation space. More recently, researchers have been using fine-tuning to pretrain a sophisticated, state-of-the-art DNN on large general-purpose data sets like ImageNet, and then fine-tune the model on a smaller data set of interest.

The underlying assumption is that a reasonably good result on the large training data set already puts the network near a local optimum in the parameter space so that even a small amount of new data is able to quickly lead the network to an optimum. From the perspective of continuous learning, both methods above are extreme cases where the network is only trained twice — initial pretraining and a one-time update.

For this reason, in the discussion below, we focus on scenarios where each update step only takes in a small amount (compared to the original full training data set) of new data, but an updated model is expected to be immediately available.

This assumption holds true much more frequently than input domain changes because in most production systems, for example, autonomous vehicles, we cannot expect the system to be even remotely functional if the input videos or LIDAR images vary drastically (due to extreme weather, lighting, terrain or road conditions), but we do want the system to be able to accommodate a new type of unhittable object shows up and adapt to this new class of labels during continuous training.

In a production scenario, labels for the new data are rarely noiseless — due to the real-time requirement, human judges often times have to make quick but inaccurate decisions, and disagreement rate between three to five judges can be as high as 75%.

Thus, robustness against labeling noise is critical for a continuous learning scheme. We perform 10 epochs at each update step with a minibatch size of 64, and randomly pollute a fixed fraction of the newly available data by replacing them with examples from remaining classes while keeping their labels intact.

Lecture 15 | Efficient Methods and Hardware for Deep Learning

In Lecture 15, guest lecturer Song Han discusses algorithms and specialized hardware that can be used to accelerate training and inference of deep learning ...

Small Deep Neural Networks - Their Advantages, and Their Design

Deep neural networks (DNNs) have led to significant improvements to the accuracy of machine-learning applications. For many problems, such as object ...

Machine Learning Accelerated Image Classification

See how Xilinx FPGAs can accelerate machine learning, a critical data center workload, through an example of image classification. The demo accelerates ...

Neural Compute Stick (with 12 SHAVE processors) vs. Raspberry Pi's GPU

Comparison of deep learning inference acceleration by Movidius' Neural Compute Stick (MvNCS) and by Idein's software which uses Raspberry Pi's GPU ...

Memristor-Based Analog Computation and Neural Network Classification with a Dot Product Engine

Large memristor arrays composed of hafnium oxide are demonstrated with suitability for computing matrix operations at higher power efficiency than digital ...

ISSCC2018 - 50 Years of Computer Architecture:From Mainframe CPUs to Neural-Network TPUs

David Patterson, Google, Mountain View, CA, University of California, Berkeley, CA This talk reviews a half-century of computer architecture: We start with the ...

GTC Japan 2017 Part 6: New NVIDIA TensorRT 3

NVIDIA founder and CEO Jensen Huang describes TensorRT 3, a high-performance deep learning inference optimizer and runtime that delivers low latency, ...

Building Custom AI Models on Azure using TensorFlow and Keras : Build 2018

Learn how to simplify your Machine Learning workflow by using the experimentation, model management, and deployment services from AzureML.

Machine Learning Exposed: Introduction to Machine Learning by James Weaver and Katharine Beaumont

In the age of quantum computing, computer chip implants and artificial intelligence, it's easy to feel left behind. For example, the term "machine learning" is ...

Cloud OnAir: Cooking with TPUs: How to train your models at lightning speed on Google's ML hardware

We'll bring you into our ML kitchen and show you how to use your raw data in our recipes for creating tasty ML concoctions. We'll show you how to accelerate ...