AI News, NIPS Proceedingsβ

NIPS Proceedingsβ

Part of: Advances in Neural Information Processing Systems 29 (NIPS 2016) We present a framework for efficient inference in structured image models that explicitly reason about objects.

We show that such models learn to identify multiple objects - counting, locating and classifying the elements of a scene - without any supervision, e.g., decomposing 3D images with various numbers of objects in a single forward pass of a neural network at unprecedented speed.

What’s the Difference Between Deep Learning Training and Inference?

This is the second of a multi-part series explaining the fundamentals of deep learning by long-time tech journalist Michael Copeland.  School’s in session.

More specifically, the trained neural network is put to work out in the digital world using what it has learned — to recognize images, spoken words, a blood disease, or suggest the shoes someone is likely to buy next, you name it — in the streamlined form of an application.

And just as we don’t haul around all our teachers, a few overloaded bookshelves and a red-brick schoolhouse to read a Shakespeare sonnet, inference doesn’t require all the infrastructure of its training regimen to do its job well.

Unlike our brains, where any neuron can connect to any other neuron within a certain physical distance, artificial neural networks have separate layers, connections, and directions of data propagation.

When training a neural network, training data is put into the first layer of the network, and individual neurons assign a weighting to the input — how correct or incorrect it is — based on the task being performed.

Andrew Ng, who honed his AI chops at Google and Stanford and is now chief scientist at Baidu’s Silicon Valley Lab, says training one of Baidu’s Chinese speech recognition models requires not only four terabytes of training data, but also 20 exaflops of compute — that’s 20 billion billion math operations — across the entire training cycle.

What you had to put in place to get that sucker to learn — in our education analogy all those pencils, books, teacher’s dirty looks — is now way more than you need to get any specific task accomplished.

If anyone is going to make use of all that training in the real world, and that’s the whole point, what you need is a speedy application that can retain the learning and apply it quickly to data it’s never seen.

While this is a brand new area of the field of computer science, there are two main approaches to taking that hulking neural network and modifying it for speed and improved latency in applications that run across other networks.

Here too, GPUs — and their parallel computing capabilities — offer benefits, where they run billions of computations based on the trained network to identify known patterns or objects.

Deploying Deep Neural Networks with NVIDIA TensorRT

Power efficiency and speed of response are two key metrics for deployed deep learning applications, because they directly affect the user experience and the cost of the service provided.

Tensor RT automatically optimizes trained neural networks for run-time performance, delivering up to 16x higher energy efficiency (performance per watt) on a Tesla P100 GPU compared to common CPU-only deep learning inference systems (see Figure 1).

For example, if the target is an embedded device using the trained neural network to perceive its surroundings, then the forward inference pass through the model has a direct impact on the overall response time and the power consumed by the device.

 In this scenario, the need to minimize latency and energy used on large volumes of geographically and temporally disparate requests limits the ability to form large batches.

Tensor RT is a high-performance inference engine designed to deliver maximum inference throughput and efficiency for common deep learning applications such as image classification, segmentation, and object detection.

In the build phase, TensorRT performs optimizations on the network configuration and generates an optimized plan for computing the forward pass through the deep neural network.

The deployment phase generally takes the form of a long running service or user application that accepts batches of input data, performs inference by executing the plan on the input data and returns batches of output data (classification, object detection, etc).

You can define any parameter that varies between networks, including convolution layer weight dimensions and outputs as well as the window size and stride for pooling layers.

Horizontal layer fusion improves performance by combining layers that take the same source tensor and apply the same operations with similar parameters, resulting in a single larger layer for higher computational efficiency.

TensorRT performs its transformations during the build phase transparently to the API user after the TensorRT parser reads in the trained network and configuration file, as Listing 1 shows.

During the build phase TensorRT identifies opportunities to optimize the network, and in the deployment phase TensorRT runs the optimized network in a way that minimizes latency and maximizes throughput.

If you are running web or mobile applications that are backed by data center servers, TensorRT’s low overhead means that you can deploy more varied and complex models to add intelligence to your product that will delight your users.

If you are using deep learning to create the next generation of smart devices, TensorRT helps you deploy networks with high performance, high accuracy, and high energy efficiency.

Building Models that Learn to Discover Structure and Relations

Thomas Kipf, Ethan Fetaya, Jackson Wang, Max Welling, Rich Zemel Some argue that a key component of human intelligence is our ability to reason about objects and their relations (e.g.

To illustrate this task, let us consider a more concrete setting: we are looking at a physical simulation of balls rolling on a 2D surface (see Figure above), and some of the balls are connected by (invisible) springs that create an attractive force.

Have a look at the video to the left (or above, in case you’re viewing this on your phone) and see if you are able to correctly guess the interaction graph.You will notice that once the interaction structure is shown, it is suddenly much easier and highly intuitive to understand the dynamics.

After showing the model a wide set of simulations, it can recognize these hidden relations and give accurate predictions in 99.9% of the cases², interestingly without ever using a ground truth interaction graph example in training.

Deep learning

Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms.

Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design and board game programs, where they have produced results comparable to and in some cases superior to human experts.[4][5][6]

Deep learning models are vaguely inspired by information processing and communication patterns in biological nervous systems yet have various differences from the structural and functional properties of biological brains (especially human brain), which make them incompatible with neuroscience evidences.[7][8][9]

Most modern deep learning models are based on an artificial neural network, although they can also include propositional formulas or latent variables organized layer-wise in deep generative models such as the nodes in deep belief networks and deep Boltzmann machines.[11]

No universally agreed upon threshold of depth divides shallow learning from deep learning, but most researchers agree that deep learning involves CAP depth >

For supervised learning tasks, deep learning methods obviate feature engineering, by translating the data into compact intermediate representations akin to principal components, and derive layered structures that remove redundancy in representation.

The universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions.[15][16][17][18][19]

By 1991 such systems were used for recognizing isolated 2-D hand-written digits, while recognizing 3-D objects was done by matching 2-D images with a handcrafted 3-D object model.

But while Neocognitron required a human programmer to hand-merge features, Cresceptron learned an open number of features in each layer without supervision, where each feature is represented by a convolution kernel.

In 1994, André de Carvalho, together with Mike Fairhurst and David Bisset, published experimental results of a multi-layer boolean neural network, also known as a weightless neural network, composed of a 3-layers self-organising feature extraction neural network module (SOFT) followed by a multi-layer classification neural network module (GSN), which were independently trained.

In 1995, Brendan Frey demonstrated that it was possible to train (over two days) a network containing six fully connected layers and several hundred hidden units using the wake-sleep algorithm, co-developed with Peter Dayan and Hinton.[39]

Simpler models that use task-specific handcrafted features such as Gabor filters and support vector machines (SVMs) were a popular choice in the 1990s and 2000s, because of ANNs' computational cost and a lack of understanding of how the brain wires its biological networks.

These methods never outperformed non-uniform internal-handcrafting Gaussian mixture model/Hidden Markov model (GMM-HMM) technology based on generative models of speech trained discriminatively.[45]

The principle of elevating 'raw' features over hand-crafted optimization was first explored successfully in the architecture of deep autoencoder on the 'raw' spectrogram or linear filter-bank features in the late 1990s,[48]

Many aspects of speech recognition were taken over by a deep learning method called long short-term memory (LSTM), a recurrent neural network published by Hochreiter and Schmidhuber in 1997.[50]

showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then fine-tuning it using supervised backpropagation.[58]

The impact of deep learning in industry began in the early 2000s, when CNNs already processed an estimated 10% to 20% of all the checks written in the US, according to Yann LeCun.[67]

was motivated by the limitations of deep generative models of speech, and the possibility that given more capable hardware and large-scale data sets that deep neural nets (DNN) might become practical.

However, it was discovered that replacing pre-training with large amounts of training data for straightforward backpropagation when using DNNs with large, context-dependent output layers produced error rates dramatically lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov Model (HMM) and also than more-advanced generative model-based systems.[59][70]

offering technical insights into how to integrate deep learning into the existing highly efficient, run-time speech decoding system deployed by all major speech recognition systems.[10][72][73]

In 2010, researchers extended deep learning from TIMIT to large vocabulary speech recognition, by adopting large output layers of the DNN based on context-dependent HMM states constructed by decision trees.[75][76][77][72]

In 2009, Nvidia was involved in what was called the “big bang” of deep learning, “as deep-learning neural networks were trained with Nvidia graphics processing units (GPUs).”[78]

In 2014, Hochreiter's group used deep learning to detect off-target and toxic effects of environmental chemicals in nutrients, household products and drugs and won the 'Tox21 Data Challenge' of NIH, FDA and NCATS.[87][88][89]

Although CNNs trained by backpropagation had been around for decades, and GPU implementations of NNs for years, including CNNs, fast implementations of CNNs with max-pooling on GPUs in the style of Ciresan and colleagues were needed to progress on computer vision.[80][81][34][90][2]

In November 2012, Ciresan et al.'s system also won the ICPR contest on analysis of large medical images for cancer detection, and in the following year also the MICCAI Grand Challenge on the same topic.[92]

In 2013 and 2014, the error rate on the ImageNet task using deep learning was further reduced, following a similar trend in large-scale speech recognition.

For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labeled as 'cat' or 'no cat' and using the analytic results to identify cats in other images.

Over time, attention focused on matching specific mental abilities, leading to deviations from biology such as backpropagation, or passing information in the reverse direction and adjusting the network to reflect that information.

Neural networks have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games and medical diagnosis.

Despite this number being several order of magnitude less than the number of neurons on a human brain, these networks can perform many tasks at a level beyond that of humans (e.g., recognizing faces, playing 'Go'[99]

The user can review the results and select which probabilities the network should display (above a certain threshold, etc.) and return the proposed label.

The extra layers enable composition of features from lower layers, potentially modeling complex data with fewer units than a similarly performing shallow network.[11]

The training process can be guaranteed to converge in one step with a new batch of data, and the computational complexity of the training algorithm is linear with respect to the number of neurons involved.[115][116]

that involve multi-second intervals containing speech events separated by thousands of discrete time steps, where one time step corresponds to about 10 ms.

All major commercial speech recognition systems (e.g., Microsoft Cortana, Xbox, Skype Translator, Amazon Alexa, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a range of Nuance speech products, etc.) are based on deep learning.[10][122][123][124]

DNNs have proven themselves capable, for example, of a) identifying the style period of a given painting, b) 'capturing' the style of a given painting and applying it in a visually pleasing manner to an arbitrary photograph, and c) generating striking imagery based on random visual input fields.[128][129]

Word embedding, such as word2vec, can be thought of as a representational layer in a deep learning architecture that transforms an atomic word into a positional representation of the word relative to other words in the dataset;

Finding the appropriate mobile audience for mobile advertising is always challenging, since many data points must be considered and assimilated before a target segment can be created and used in ad serving by any ad server.[161][162]

'Deep anti-money laundering detection system can spot and recognize relationships and similarities between data and, further down the road, learn to detect anomalies or classify and predict specific events'.

Deep learning is closely related to a class of theories of brain development (specifically, neocortical development) proposed by cognitive neuroscientists in the early 1990s.[166][167][168][169]

These developmental models share the property that various proposed learning dynamics in the brain (e.g., a wave of nerve growth factor) support the self-organization somewhat analogous to the neural networks utilized in deep learning models.

Like the neocortex, neural networks employ a hierarchy of layered filters in which each layer considers information from a prior layer (or the operating environment), and then passes its output (and possibly the original input), to other layers.

Other researchers have argued that unsupervised forms of deep learning, such as those based on hierarchical generative models and deep belief networks, may be closer to biological reality.[173][174]

researchers at The University of Texas at Austin (UT) developed a machine learning framework called Training an Agent Manually via Evaluative Reinforcement, or TAMER, which proposed new methods for robots or computer programs to learn how to perform tasks by interacting with a human instructor.[165]

Such techniques lack ways of representing causal relationships (...) have no obvious ways of performing logical inferences, and they are also still a long way from integrating abstract knowledge, such as information about what objects are, what they are for, and how they are typically used.

systems, like Watson (...) use techniques like deep learning as just one element in a very complicated ensemble of techniques, ranging from the statistical technique of Bayesian inference to deductive reasoning.'[190]

As an alternative to this emphasis on the limits of deep learning, one author speculated that it might be possible to train a machine vision stack to perform the sophisticated task of discriminating between 'old master' and amateur figure drawings, and hypothesized that such a sensitivity might represent the rudiments of a non-trivial machine empathy.[191]

In further reference to the idea that artistic sensitivity might inhere within relatively low levels of the cognitive hierarchy, a published series of graphic representations of the internal states of deep (20-30 layers) neural networks attempting to discern within essentially random data the images on which they were trained[193]

Learning a grammar (visual or linguistic) from training data would be equivalent to restricting the system to commonsense reasoning that operates on concepts in terms of grammatical production rules and is a basic goal of both human language acquisition[199]

Such a manipulation is termed an “adversarial attack.” In 2016 researchers used one ANN to doctor images in trial and error fashion, identify another's focal points and thereby generate images that deceived it.

Another group showed that certain psychedelic spectacles could fool a facial recognition system into thinking ordinary people were celebrities, potentially allowing one person to impersonate another.

ANNs can however be further trained to detect attempts at deception, potentially leading attackers and defenders into an arms race similar to the kind that already defines the malware defense industry.

ANNs have been trained to defeat ANN-based anti-malware software by repeatedly attacking a defense with malware that was continually altered by a genetic algorithm until it tricked the anti-malware while retaining its ability to damage the target.[201]

How to easily Detect Objects with Deep Learning on Raspberry Pi

You can find a bunch of pretrained models here The process of training a model is unnecessarily difficult to simplify the process we created a docker image would make it easy to train.

To start training the model you can run: The docker image has a run.sh script that can be called with the following parameters You can find more details at: To train a model you need to select the right hyper parameters.

Training neural networks is done by applying many tiny nudges to the weights, and these small increments typically need floating point precision to work (though there are research efforts to use quantized representations here too).

The simplest motivation for quantization is to shrink file sizes by storing the min and max for each layer, and then compressing each float value to an eight-bit integer.The size of the files is reduced by 75%.

Lecture 15 | Efficient Methods and Hardware for Deep Learning

In Lecture 15, guest lecturer Song Han discusses algorithms and specialized hardware that can be used to accelerate training and inference of deep learning ...

How to Deploy a Tensorflow Model to Production

Once we've trained a model, we need a way of deploying it to a server so we can use it as a web or mobile app! We're going to use the Tensorflow Serving ...

Lecture 13 | Generative Models

In Lecture 13 we move beyond supervised learning, and discuss generative modeling as a form of unsupervised learning. We cover the autoregressive ...

Faster AI Deployment with NVIDIA TensorRT

Learn more about NVIDIA TensorRT, a programmable inference accelerator delivering the performance, efficiency, and responsiveness critical to powering the ...

Neural networks [10.12] : Natural language processing - merging representations

Demo: Accelerate Deep Learning Inference on Raspberry Pi

This is a demo to show that it is possible to execute Deep Learning Inference at a practical speed by utilizing full computational power of Raspberry Pi's GPU.

How To Train an Object Detection Classifier Using TensorFlow 1.5 (GPU) on Windows 10

These instructions work for newer versions of TensorFlow too! This tutorial shows you how to train your own object detector for multiple objects using Google's ...

OpenAI's robot hand learns to handle objects with human-like dexterity

Read the original article: WHY THIS MATTERS ..

Composing graphical models with neural networks

Paper: Code: How to combine the complementary strengths of probabilistic graphical models and .

Deep Learning Inference for Object Detection on Raspberry Pi

See how you can generate code from a trained deep neural network in MATLAB® for ARM processors that support the Neon instruction set architecture (ISA) ...