AI News, NIPS Proceedingsβ

NIPS Proceedingsβ

Part of: Advances in Neural Information Processing Systems 29 (NIPS 2016) We present a framework for efficient inference in structured image models that explicitly reason about objects.

We show that such models learn to identify multiple objects - counting, locating and classifying the elements of a scene - without any supervision, e.g., decomposing 3D images with various numbers of objects in a single forward pass of a neural network at unprecedented speed.

Deep learning

Learning can be supervised, semi-supervised or unsupervised.[1][2][3] Deep learning models are loosely related to information processing and communication patterns in a biological nervous system, such as neural coding that attempts to define a relationship between various stimuli and associated neuronal responses in the brain.[4] Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics and drug design,[5] where they have produced results comparable to and in some cases superior[6] to human experts.[7] Deep learning is a class of machine learning algorithms that:[8](pp199–200) Most modern deep learning models are based on an artificial neural network, although they can also include propositional formulas[9] or latent variables organized layer-wise in deep generative models such as the nodes in Deep Belief Networks and Deep Boltzmann Machines.

Examples of deep structures that can be trained in an unsupervised manner are neural history compressors[12] and deep belief networks.[1][13] Deep neural networks are generally interpreted in terms of the universal approximation theorem[14][15][16][17][18] or probabilistic inference.[8][9][1][2][13][19][20] The universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions.[14][15][16][17][18] In 1989, the first proof was published by Cybenko for sigmoid activation functions[15] and was generalised to feed-forward multi-layer architectures in 1991 by Hornik.[16] The probabilistic interpretation[19] derives from the field of machine learning.

Each layer in the feature extraction module extracted features with growing complexity regarding the previous layer.[38] In 1995, Brendan Frey demonstrated that it was possible to train (over two days) a network containing six fully connected layers and several hundred hidden units using the wake-sleep algorithm, co-developed with Peter Dayan and Hinton.[39] Many factors contribute to the slow speed, including the vanishing gradient problem analyzed in 1991 by Sepp Hochreiter.[40][41] Simpler models that use task-specific handcrafted features such as Gabor filters and support vector machines (SVMs) were a popular choice in the 1990s and 2000s, because of ANNs' computational cost and a lack of understanding of how the brain wires its biological networks.

In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks.[55] Later it was combined with connectionist temporal classification (CTC)[56] in stacks of LSTM RNNs.[57] In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which they made available through Google Voice Search.[58] In the early 2000s, CNNs processed an estimated 10% to 20% of all the checks written in the US.[59] In 2006, Hinton and Salakhutdinov showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then fine-tuning it using supervised backpropagation.[60] Deep learning is part of state-of-the-art systems in various disciplines, particularly computer vision and automatic speech recognition (ASR).

It was believed that pre-training DNNs using generative models of deep belief nets (DBN) would overcome the main difficulties of neural nets.[51] However, they discovered that replacing pre-training with large amounts of training data for straightforward backpropagation when using DNNs with large, context-dependent output layers produced error rates dramatically lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov Model (HMM) and also than more-advanced generative model-based systems.[49][69] The nature of the recognition errors produced by the two types of systems was characteristically different,[50][68] offering technical insights into how to integrate deep learning into the existing highly efficient, run-time speech decoding system deployed by all major speech recognition systems.[8][70][71] Analysis around 2009-2010, contrasted the GMM (and other generative speech models) vs.

While there, Ng determined that GPUs could increase the speed of deep-learning systems by about 100 times.[76] In particular, GPUs are well-suited for the matrix/vector math involved in machine learning.[77][78] GPUs speed up training algorithms by orders of magnitude, reducing running times from weeks to days.[79][80] Specialized hardware and algorithm optimizations can be used for efficient processing.[81] In 2012, a team led by Dahl won the 'Merck Molecular Activity Challenge' using multi-task deep neural networks to predict the biomolecular target of one drug.[82][83] In 2014, Hochreiter's group used deep learning to detect off-target and toxic effects of environmental chemicals in nutrients, household products and drugs and won the 'Tox21 Data Challenge' of NIH, FDA and NCATS.[84][85][86] Significant additional impacts in image or object recognition were felt from 2011 to 2012.

The error rates listed below, including these early results and measured as percent phone error rates (PER), have been summarized over the past 20 years:[clarification needed] The debut of DNNs for speaker recognition in the late 1990s and speech recognition around 2009-2011 and of LSTM around 2003-2007, accelerated progress in eight major areas:[8][52][70] All major commercial speech recognition systems (e.g., Microsoft Cortana, Xbox, Skype Translator, Amazon Alexa, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a range of Nuance speech products, etc.) are based on deep learning.[8][116][117][118] A

DNNs have proven themselves capable, for example, of a) identifying the style period of a given painting, b) 'capturing' the style of a given painting and applying it in a visually pleasing manner to an arbitrary photograph, and c) generating striking imagery based on random visual input fields.[122][123] Neural networks have been used for implementing language models since the early 2000s.[97][124] LSTM helped to improve machine translation and language modeling.[98][99][100] Other key techniques in this field are negative sampling[125] and word embedding.

A compositional vector grammar can be thought of as probabilistic context free grammar (PCFG) implemented by an RNN.[126] Recursive auto-encoders built atop word embeddings can assess sentence similarity and detect paraphrasing.[126] Deep neural architectures provide the best results for constituency parsing,[127] sentiment analysis,[128] information retrieval,[129][130] spoken language understanding,[131] machine translation,[98][132] contextual entity linking,[132] writing style recognition[133] and others.[134] Google Translate (GT) uses a large end-to-end long short-term memory network.[135][136][137][138][139][140] GNMT uses an example-based machine translation method in which the system 'learns from millions of examples.'[136] It translates 'whole sentences at a time, rather than pieces.

These failures are caused by insufficient efficacy (on-target effect), undesired interactions (off-target effects), or unanticipated toxic effects.[142][143] Research has explored use of deep learning to predict biomolecular target,[82][83] off-target and toxic effects of environmental chemicals in nutrients, household products and drugs.[84][85][86] AtomNet is a deep learning system for structure-based rational drug design.[144] AtomNet was used to predict novel candidate biomolecules for disease targets such as the Ebola virus[145] and multiple sclerosis.[146][147] Deep reinforcement learning has been used to approximate the value of possible direct marketing actions, defined in terms of RFM variables.

An autoencoder ANN was used in bioinformatics, to predict gene ontology annotations and gene-function relationships.[151] In medical informatics, deep learning was used to predict sleep quality based on data from wearables[152][153] and predictions of health complications from electronic health record data.[154] Finding the appropriate mobile audience for mobile advertising[155] is always challenging, since many data points must be considered and assimilated before a target segment can be created and used in ad serving by any ad server.

On the one hand, several variants of the backpropagation algorithm have been proposed in order to increase its processing realism.[162][163] Other researchers have argued that unsupervised forms of deep learning, such as those based on hierarchical generative models and deep belief networks, may be closer to biological reality.[164][165] In this respect, generative neural network models have been related to neurobiological evidence about sampling-based processing in the cerebral cortex.[166] Although a systematic comparison between the human brain organization and the neuronal encoding in deep networks has not yet been established, several analogies have been reported.

systems, like Watson (...) use techniques like deep learning as just one element in a very complicated ensemble of techniques, ranging from the statistical technique of Bayesian inference to deductive reasoning.'[179] As an alternative to this emphasis on the limits of deep learning, one author speculated that it might be possible to train a machine vision stack to perform the sophisticated task of discriminating between 'old master' and amateur figure drawings, and hypothesized that such a sensitivity might represent the rudiments of a non-trivial machine empathy.[180] This same author proposed that this would be in line with anthropology, which identifies a concern with aesthetics as a key element of behavioral modernity.[181] In further reference to the idea that artistic sensitivity might inhere within relatively low levels of the cognitive hierarchy, a published series of graphic representations of the internal states of deep (20-30 layers) neural networks attempting to discern within essentially random data the images on which they were trained[182] demonstrate a visual appeal: the original research notice received well over 1,000 comments, and was the subject of what was for a time the most frequently accessed article on The Guardian's[183] web site.

Some deep learning architectures display problematic behaviors,[184] such as confidently classifying unrecognizable images as belonging to a familiar category of ordinary images[185] and misclassifying minuscule perturbations of correctly classified images.[186] Goertzel hypothesized that these behaviors are due to limitations in their internal representations and that these limitations would inhibit integration into heterogeneous multi-component AGI architectures.[184] These issues may possibly be addressed by deep learning architectures that internally form states homologous to image-grammar[187] decompositions of observed entities and events.[184] Learning a grammar (visual or linguistic) from training data would be equivalent to restricting the system to commonsense reasoning that operates on concepts in terms of grammatical production rules and is a basic goal of both human language acquisition[188] and AI.[189] As deep learning moves from the lab into the world, research and experience shows that artificial neural networks are vulnerable to hacks and deception.

ANNs have been trained to defeat ANN-based anti-malware software by repeatedly attacking a defense with malware that was continually altered by a genetic algorithm until it tricked the anti-malware while retaining its ability to damage the target.[190] Another group demonstrated that certain sounds could make the Google Now voice command system open a particular web address that would download malware.[190] In “data poisoning”, false data is continually smuggled into a machine learning system’s training set to prevent it from achieving mastery.[190]

Deploying Deep Neural Networks with NVIDIA TensorRT

Power efficiency and speed of response are two key metrics for deployed deep learning applications, because they directly affect the user experience and the cost of the service provided.

Tensor RT automatically optimizes trained neural networks for run-time performance, delivering up to 16x higher energy efficiency (performance per watt) on a Tesla P100 GPU compared to common CPU-only deep learning inference systems (see Figure 1).

For example, if the target is an embedded device using the trained neural network to perceive its surroundings, then the forward inference pass through the model has a direct impact on the overall response time and the power consumed by the device.

 In this scenario, the need to minimize latency and energy used on large volumes of geographically and temporally disparate requests limits the ability to form large batches.

Tensor RT is a high-performance inference engine designed to deliver maximum inference throughput and efficiency for common deep learning applications such as image classification, segmentation, and object detection.

In the build phase, TensorRT performs optimizations on the network configuration and generates an optimized plan for computing the forward pass through the deep neural network.

The deployment phase generally takes the form of a long running service or user application that accepts batches of input data, performs inference by executing the plan on the input data and returns batches of output data (classification, object detection, etc).

You can define any parameter that varies between networks, including convolution layer weight dimensions and outputs as well as the window size and stride for pooling layers.

Horizontal layer fusion improves performance by combining layers that take the same source tensor and apply the same operations with similar parameters, resulting in a single larger layer for higher computational efficiency.

TensorRT performs its transformations during the build phase transparently to the API user after the TensorRT parser reads in the trained network and configuration file, as Listing 1 shows.

During the build phase TensorRT identifies opportunities to optimize the network, and in the deployment phase TensorRT runs the optimized network in a way that minimizes latency and maximizes throughput.

If you are running web or mobile applications that are backed by data center servers, TensorRT’s low overhead means that you can deploy more varied and complex models to add intelligence to your product that will delight your users.

If you are using deep learning to create the next generation of smart devices, TensorRT helps you deploy networks with high performance, high accuracy, and high energy efficiency.

What’s the Difference Between Deep Learning Training and Inference?

This is the second of a multi-part series explaining the fundamentals of deep learning by long-time tech journalist Michael Copeland.  School’s in session.

More specifically, the trained neural network is put to work out in the digital world using what it has learned — to recognize images, spoken words, a blood disease, or suggest the shoes someone is likely to buy next, you name it — in the streamlined form of an application.

And just as we don’t haul around all our teachers, a few overloaded bookshelves and a red-brick schoolhouse to read a Shakespeare sonnet, inference doesn’t require all the infrastructure of its training regimen to do its job well.

Unlike our brains, where any neuron can connect to any other neuron within a certain physical distance, artificial neural networks have separate layers, connections, and directions of data propagation.

When training a neural network, training data is put into the first layer of the network, and individual neurons assign a weighting to the input — how correct or incorrect it is — based on the task being performed.

Andrew Ng, who honed his AI chops at Google and Stanford and is now chief scientist at Baidu’s Silicon Valley Lab, says training one of Baidu’s Chinese speech recognition models requires not only four terabytes of training data, but also 20 exaflops of compute — that’s 20 billion billion math operations — across the entire training cycle.

What you had to put in place to get that sucker to learn — in our education analogy all those pencils, books, teacher’s dirty looks — is now way more than you need to get any specific task accomplished.

If anyone is going to make use of all that training in the real world, and that’s the whole point, what you need is a speedy application that can retain the learning and apply it quickly to data it’s never seen.

While this is a brand new area of the field of computer science, there are two main approaches to taking that hulking neural network and modifying it for speed and improved latency in applications that run across other networks.

Here too, GPUs — and their parallel computing capabilities — offer benefits, where they run billions of computations based on the trained network to identify known patterns or objects.

Hyperparameter Optimization - The Math of Intelligence #7

Hyperparameters are the magic numbers of machine learning. We're going to learn how to find them in a more intelligent way than just trial-and-error. We'll go over grid search, random search,...

Small Deep Neural Networks - Their Advantages, and Their Design

Deep neural networks (DNNs) have led to significant improvements to the accuracy of machine-learning applications. For many problems, such as object classification and object detection, DNNs...

Neural networks [8.2] : Sparse coding - inference (ISTA algorithm)

Computational Probability and Inference | MITx on edX | Course About Video

Learn fundamentals of probabilistic analysis and inference. Build computer programs that reason with uncertainty and make predictions. Tackle machine learning problems, from recommending movies...

Lecture 16: Dynamic Neural Networks for Question Answering

Lecture 16 addresses the question ""Can all NLP tasks be seen as question answering problems?"". Key phrases: Coreference Resolution, Dynamic Memory Networks for Question Answering over Text...

Lecture 15: Coreference Resolution

Lecture 15 covers what is coreference via a working example. Also includes research highlight "Summarizing Source Code", an introduction to coreference resolution and neural coreference resolution....

Neural Photo Editing with Introspective Adversarial Networks

A simple interface for modifying photos and exploring the latent space of generative models. Github HERE: Paper HERE:

Artificial Intelligence | Deep Learning Pt 1

Follow along with code here: and for more information, visit In this episode we'll learn about the basics of neural.

How To Train an Object Detection Classifier Using TensorFlow 1.5 (GPU) on Windows 10

[These instructions work for TensorFlow 1.6 too!] This tutorial shows you how to train your own object detector for multiple objects using Google's TensorFlow Object Detection API on Windows....

NVIDIA AI Car Demonstration

In contrast to the usual approach to operating self-driving cars, we did not program any explicit object detection, mapping, path planning or control components into this car. Instead, the...