AI News, A Primer on Deep Learning

A Primer on Deep Learning

The game asks its players the following question: given a set of 2-second sound clips from buoys in the ocean, can you classify each sound clip as having a call from a North Atlantic right whale or not?The practical application of the competition is that if we can detect where the whales are migrating by picking up their calls, we can route shipping traffic to avoid them, a positive both for effective shipping and whale preservation.

In a post-competition interview competition's winners notedthe value of focusing on feature generation, also called feature engineering.

A simple example of an engineered feature would involve subtracting two columns and including this new number as an additional descriptor of your data.

In the case of the whales, the winning team represented each sound clip in its spectrogram form and built features based on how well the spectrogram matched some example templates.

Rather than using data science experience, intuition, and trial-and-error, unsupervised feature learning techniques spend computational time automatically developing new ways of representing the data.

The takeaway is that deep learning excels in tasks where the basic unit, a single pixel, a single frequency, or a single word has very little meaning in and of itself, but the combination of such units has a useful meaning.

When presented with 60,000 digits a neural network can learn that it is useful to look for loops and lines when trying to classify which digit it is looking at.

Primate brains do a similar thing in the visual cortex, so the hope was that using more layers in a neural network could allow it to learn better models.

Finally in 2006 three separate groups developed ways of overcoming the difficulties that many in the machine learning world encountered while trying to train deep neural networks.The leaders of these three groups are the fathers of the age of deep learning.

Using different techniques, each of these three groups was able to get these early layers to learn useful representations, which led to much more powerful neural networks.

Now that this problem has been fixed, we ask, what is it that these neural networks learn?This paperillustrates what a deep neural network is capable of learning, and I've included the above picture to make things clearer.

This application of deep neural networks has seen models that successfully learn useful representations of imagery, audio, written language, and even molecular activity.

The key takeaway is that the breakthroughs in 2006 have enabled deep neural networks that are able automatically to learn rich representations of data.

This unsupervised feature learning is proving extremely helpful in domains where individual data points are not very useful but many individual points taken together convey quite a bit of information.

Deep learning

Learning can be supervised, semi-supervised or unsupervised.[1][2][3] Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics and drug design,[4] where they have produced results comparable to and in some cases superior[5] to human experts.[6] Deep learning models are vaguely inspired by information processing and communication patterns in biological nervous systems yet have various differences from the structural and functional properties of biological brains, which make them incompatible with neuroscience evidences.[7][8][9] Deep learning is a class of machine learning algorithms that:[10](pp199–200) Most modern deep learning models are based on an artificial neural network, although they can also include propositional formulas[11] or latent variables organized layer-wise in deep generative models such as the nodes in Deep Belief Networks and Deep Boltzmann Machines.

Examples of deep structures that can be trained in an unsupervised manner are neural history compressors[13] and deep belief networks.[1][14] Deep neural networks are generally interpreted in terms of the universal approximation theorem[15][16][17][18][19] or probabilistic inference.[10][11][1][2][14][20][21] The universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions.[15][16][17][18][19] In 1989, the first proof was published by George Cybenko for sigmoid activation functions[16] and was generalised to feed-forward multi-layer architectures in 1991 by Kurt Hornik.[17] The probabilistic interpretation[20] derives from the field of machine learning.

More specifically, the probabilistic interpretation considers the activation nonlinearity as a cumulative distribution function.[20] The probabilistic interpretation led to the introduction of dropout as regularizer in neural networks.[22] The probabilistic interpretation was introduced by researchers including Hopfield, Widrow and Narendra and popularized in surveys such as the one by Bishop.[23] The term Deep Learning was introduced to the machine learning community by Rina Dechter in 1986,[24][13] and to Artificial Neural Networks by Igor Aizenberg and colleagues in 2000, in the context of Boolean threshold neurons.[25][26] The first general, working learning algorithm for supervised, deep, feedforward, multilayer perceptrons was published by Alexey Ivakhnenko and Lapa in 1965.[27] A 1971 paper described a deep network with 8 layers trained by the group method of data handling algorithm.[28] Other deep learning working architectures, specifically those built for computer vision, began with the Neocognitron introduced by Kunihiko Fukushima in 1980.[29] In 1989, Yann LeCun et al.

Each layer in the feature extraction module extracted features with growing complexity regarding the previous layer.[38] In 1995, Brendan Frey demonstrated that it was possible to train (over two days) a network containing six fully connected layers and several hundred hidden units using the wake-sleep algorithm, co-developed with Peter Dayan and Hinton.[39] Many factors contribute to the slow speed, including the vanishing gradient problem analyzed in 1991 by Sepp Hochreiter.[40][41] Simpler models that use task-specific handcrafted features such as Gabor filters and support vector machines (SVMs) were a popular choice in the 1990s and 2000s, because of ANNs' computational cost and a lack of understanding of how the brain wires its biological networks.

In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks.[51] Later it was combined with connectionist temporal classification (CTC)[52] in stacks of LSTM RNNs.[53] In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which they made available through Google Voice Search.[54] In 2006, publications by Geoff Hinton, Ruslan Salakhutdinov, Osindero and Teh[55] [56][57] showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then fine-tuning it using supervised backpropagation.[58] The papers referred to learning for deep belief nets.

It was believed that pre-training DNNs using generative models of deep belief nets (DBN) would overcome the main difficulties of neural nets.[69] However, it was discovered that replacing pre-training with large amounts of training data for straightforward backpropagation when using DNNs with large, context-dependent output layers produced error rates dramatically lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov Model (HMM) and also than more-advanced generative model-based systems.[59][70] The nature of the recognition errors produced by the two types of systems was characteristically different,[71][68] offering technical insights into how to integrate deep learning into the existing highly efficient, run-time speech decoding system deployed by all major speech recognition systems.[10][72][73] Analysis around 2009-2010, contrasted the GMM (and other generative speech models) vs.

While there, Ng determined that GPUs could increase the speed of deep-learning systems by about 100 times.[79] In particular, GPUs are well-suited for the matrix/vector math involved in machine learning.[80][81] GPUs speed up training algorithms by orders of magnitude, reducing running times from weeks to days.[82][83] Specialized hardware and algorithm optimizations can be used for efficient processing.[84] In 2012, a team led by Dahl won the 'Merck Molecular Activity Challenge' using multi-task deep neural networks to predict the biomolecular target of one drug.[85][86] In 2014, Hochreiter's group used deep learning to detect off-target and toxic effects of environmental chemicals in nutrients, household products and drugs and won the 'Tox21 Data Challenge' of NIH, FDA and NCATS.[87][88][89] Significant additional impacts in image or object recognition were felt from 2011 to 2012.

The error rates listed below, including these early results and measured as percent phone error rates (PER), have been summarized over the past 20 years:[clarification needed] The debut of DNNs for speaker recognition in the late 1990s and speech recognition around 2009-2011 and of LSTM around 2003-2007, accelerated progress in eight major areas:[10][74][72] All major commercial speech recognition systems (e.g., Microsoft Cortana, Xbox, Skype Translator, Amazon Alexa, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a range of Nuance speech products, etc.) are based on deep learning.[10][121][122][123] A

DNNs have proven themselves capable, for example, of a) identifying the style period of a given painting, b) 'capturing' the style of a given painting and applying it in a visually pleasing manner to an arbitrary photograph, and c) generating striking imagery based on random visual input fields.[127][128] Neural networks have been used for implementing language models since the early 2000s.[102][129] LSTM helped to improve machine translation and language modeling.[103][104][105] Other key techniques in this field are negative sampling[130] and word embedding.

A compositional vector grammar can be thought of as probabilistic context free grammar (PCFG) implemented by an RNN.[131] Recursive auto-encoders built atop word embeddings can assess sentence similarity and detect paraphrasing.[131] Deep neural architectures provide the best results for constituency parsing,[132] sentiment analysis,[133] information retrieval,[134][135] spoken language understanding,[136] machine translation,[103][137] contextual entity linking,[137] writing style recognition,[138] Text classifcation[98] and others.[139] Google Translate (GT) uses a large end-to-end long short-term memory network.[140][141][142][143][144][145] GNMT uses an example-based machine translation method in which the system 'learns from millions of examples.'[141] It translates 'whole sentences at a time, rather than pieces.

These failures are caused by insufficient efficacy (on-target effect), undesired interactions (off-target effects), or unanticipated toxic effects.[149][150] Research has explored use of deep learning to predict biomolecular target,[85][86] off-target and toxic effects of environmental chemicals in nutrients, household products and drugs.[87][88][89] AtomNet is a deep learning system for structure-based rational drug design.[151] AtomNet was used to predict novel candidate biomolecules for disease targets such as the Ebola virus[152] and multiple sclerosis.[153][154] Deep reinforcement learning has been used to approximate the value of possible direct marketing actions, defined in terms of RFM variables.

An autoencoder ANN was used in bioinformatics, to predict gene ontology annotations and gene-function relationships.[158] In medical informatics, deep learning was used to predict sleep quality based on data from wearables[159][160] and predictions of health complications from electronic health record data.[161] Deep learning has also showed efficacy in healthcare.[162][163] Finding the appropriate mobile audience for mobile advertising is always challenging, since many data points must be considered and assimilated before a target segment can be created and used in ad serving by any ad server.[164][165] Deep learning has been used to interpret large, many-dimensioned advertising datasets.

On the one hand, several variants of the backpropagation algorithm have been proposed in order to increase its processing realism.[172][173] Other researchers have argued that unsupervised forms of deep learning, such as those based on hierarchical generative models and deep belief networks, may be closer to biological reality.[174][175] In this respect, generative neural network models have been related to neurobiological evidence about sampling-based processing in the cerebral cortex.[176] Although a systematic comparison between the human brain organization and the neuronal encoding in deep networks has not yet been established, several analogies have been reported.

systems, like Watson (...) use techniques like deep learning as just one element in a very complicated ensemble of techniques, ranging from the statistical technique of Bayesian inference to deductive reasoning.'[189] As an alternative to this emphasis on the limits of deep learning, one author speculated that it might be possible to train a machine vision stack to perform the sophisticated task of discriminating between 'old master' and amateur figure drawings, and hypothesized that such a sensitivity might represent the rudiments of a non-trivial machine empathy.[190] This same author proposed that this would be in line with anthropology, which identifies a concern with aesthetics as a key element of behavioral modernity.[191] In further reference to the idea that artistic sensitivity might inhere within relatively low levels of the cognitive hierarchy, a published series of graphic representations of the internal states of deep (20-30 layers) neural networks attempting to discern within essentially random data the images on which they were trained[192] demonstrate a visual appeal: the original research notice received well over 1,000 comments, and was the subject of what was for a time the most frequently accessed article on The Guardian's[193] web site.

Some deep learning architectures display problematic behaviors,[194] such as confidently classifying unrecognizable images as belonging to a familiar category of ordinary images[195] and misclassifying minuscule perturbations of correctly classified images.[196] Goertzel hypothesized that these behaviors are due to limitations in their internal representations and that these limitations would inhibit integration into heterogeneous multi-component AGI architectures.[194] These issues may possibly be addressed by deep learning architectures that internally form states homologous to image-grammar[197] decompositions of observed entities and events.[194] Learning a grammar (visual or linguistic) from training data would be equivalent to restricting the system to commonsense reasoning that operates on concepts in terms of grammatical production rules and is a basic goal of both human language acquisition[198] and AI.[199] As deep learning moves from the lab into the world, research and experience shows that artificial neural networks are vulnerable to hacks and deception.

Feature learning

In machine learning, feature learning or representation learning[1] is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data.

For example, a supervised dictionary learning technique[6] applied dictionary learning on classification problems by jointly optimizing the dictionary elements, weights for representing data points, and parameters of the classifier based on the input data.

In particular, a minimization problem is formulated, where the objective function consists of the classification error, the representation error, an L1 regularization on the representing weights for each data point (to enable sparse representation of data), and an L2 regularization on the parameters of the classifier.

The simplest is to add k binary features to each sample, where each feature j has value one iff the jth centroid learned by k-means is the closest to the sample under consideration.[3] It is also possible to use the distances to the clusters as features, perhaps after transforming them through a radial basis function (a technique that has been used to train RBF networks[9]).

Coates and Ng note that certain variants of k-means behave similarly to sparse coding algorithms.[10] In a comparative evaluation of unsupervised feature learning methods, Coates, Lee and Ng found that k-means clustering with an appropriate transformation outperforms the more recently invented auto-encoders and RBMs on an image classification task.[3] K-means also improves performance in the domain of NLP, specifically for named-entity recognition;[11] there, it competes with Brown clustering, as well as with distributed word representations (also known as neural word embeddings).[8] Principal component analysis (PCA) is often used for dimension reduction.

Given an unlabeled set of n input data vectors, PCA generates p (which is much smaller than the dimension of the input data) right singular vectors corresponding to the p largest singular values of the data matrix, where the kth row of the data matrix is the kth input data vector shifted by the sample mean of the input (i.e., subtracting the sample mean from the data vector).

The first step is for 'neighbor-preserving', where each input data point Xi is reconstructed as a weighted sum of K nearest neighbor data points, and the optimal weights are found by minimizing the average squared reconstruction error (i.e., difference between an input point and its reconstruction) under the constraint that the weights associated with each point sum up to one.

The reconstruction weights obtained in the first step capture the 'intrinsic geometric properties' of a neighborhood in the input data.[13] It is assumed that original data lie on a smooth lower-dimensional manifold, and the 'intrinsic geometric properties' captured by the weights of the original data are also expected to be on the manifold.

proposed algorithm K-SVD for learning a dictionary of elements that enables sparse representation.[16] The hierarchical architecture of the biological neural system inspires deep learning architectures for feature learning by stacking multiple layers of learning nodes.[17] These architectures are often designed based on the assumption of distributed representation: observed data is generated by the interactions of many different factors on multiple levels.

Restricted Boltzmann machines (RBMs) are often used as a building block for multilayer learning architectures.[3][18] An RBM can be represented by an undirected bipartite graph consisting of a group of binary hidden variables, a group of visible variables, and edges connecting the hidden and visible nodes.

An example is provided by Hinton and Salakhutdinov[18] where the encoder uses raw data (e.g., image) as input and produces feature or representation as output and the decoder uses the extracted feature from the encoder as input and reconstructs the original input raw data as output.

Select Your Location

Supervised neural networks are trained to produce desired outputs in response to sample inputs, making them particularly well suited for modeling and controlling dynamic systems, classifying noisy data, and predicting future events.

Neural Network Toolbox includes four types of supervised networks: feedforward, radial basis, dynamic, and learning vector quantization.

Supported feedforward networks include feedforward backpropagation, cascade-forward backpropagation, feedforward input-delay backpropagation, linear, and perceptron networks.

Lecture 6 | Training Neural Networks I

In Lecture 6 we discuss many practical issues for training modern neural networks. We discuss different activation functions, the importance of data ...

What is a Neural Network - Ep. 2 (Deep Learning SIMPLIFIED)

With plenty of machine learning tools currently available, why would you ever choose an artificial neural network over all the rest? This clip and the next could ...

Deep Learning with MATLAB: Using Feature Extraction with Neural Networks in MATLAB

This demo uses MATLAB® to train a SVM classifier with features extracted, using a pretrained CNN for classifying images of four different animal types: cat, dog, ...

Gradient descent, how neural networks learn | Chapter 2, deep learning

Subscribe for more (part 3 will be on backpropagation): Thanks to everybody supporting on Patreon

The Best Way to Prepare a Dataset Easily

In this video, I go over the 3 steps you need to prepare a dataset to be fed into a machine learning model. (selecting the data, processing it, and transforming it).

Lecture 7 | Training Neural Networks II

Lecture 7 continues our discussion of practical issues for training neural networks. We discuss different update rules commonly used to optimize neural networks ...

Neural networks [7.3] : Deep learning - unsupervised pre-training

Processing our own Data - Deep Learning with Neural Networks and TensorFlow part 5

Welcome to part five of the Deep Learning with Neural Networks and TensorFlow tutorials. Now that we've covered a simple example of an artificial neural ...

But what *is* a Neural Network? | Chapter 1, deep learning

Subscribe to stay notified about new videos: Support more videos like this on Patreon: Special .

Andrew Ng: Deep Learning, Self-Taught Learning and Unsupervised Feature Learning

Graduate Summer School: Deep Learning, Feature Learning "Deep Learning, Self-Taught Learning and Unsupervised Feature Learning (Part 1 Slides1-68; Part ...