AI News, The Input Layer, Hidden Layers, and Output Layer

The Input Layer, Hidden Layers, and Output Layer

January 27, 2015: Based on the feedback from commenters, I have updated the source code in the download to include the original MNIST dataset!

Do not install the current version without first checking out the 0.5b1 version! In the future I will post an update on how to use the updated nolearn package!

Now I’m not exactly a wagering man, but I bet that after my long-winded rant on getting off the deep learning bandwagon, the last thing you would expect me to do is write a post on Deep Learning, right?

The models included in nolearn  have implemented the fit  and predict functions just like scikit-learn, and the output predictions are even compatible with the scikit-learn metric  functions.

So in this blog post we’ll review an example of using a Deep Belief Network to classify images from the MNIST dataset, a dataset consisting of handwritten digits.

In general, the goal of deep learning is to take low level inputs (feature vectors) and then construct higher and higher level abstract “concepts”

The assumption here is that the data follows some sort of underlying pattern generated by many interactions between different nodes on many different layers of the network.

Their seminal work demonstrated that each of the hidden layers in a neural net can be treated as an unsupervised Restricted Boltzmann Machine with a supervised back-propagation step for fine-tuning.

This notion of efficiency was further demonstrated in the coming years where Deep Nets have been trained on GPUs rather than CPUs leading to a reduction of training time by over an order of magnitude.

If we use the raw pixel intensities for the images, our feature vector would be of length 28 x 28 = 784, thus there would be 784 nodes in the input layer.

In the most simple terms, each hidden layer is an unsupervised Restricted Boltzmann Machine where the output of each RBM in the hidden layer sequence is used as input to the next.

Of course, we could always sort the output probabilities and choose all class labels that fall within some epsilon of the largest probability —

My goal here is to simply take the example, tweak it slightly, as well as throw in a few extra demonstrations —

We’ll import train_test_split  (to generate our training and testing splits of the MNIST dataset) and classification_report  (to display a nicely formatted table of accuracies) from the scikit-learn package.

Let’s go ahead and download the MNIST dataset: We make a call to the fetch_mldata  function on Line 13 that downloads the original MNIST dataset from the repository.

If you take the time to examine the data, you’ll notice that each feature vector contains 784 entries in the range [0, 255].

Background pixels are black (0) whereas foreground pixels appear to be lighter shades of gray or white.

We’ll want to have an input node for each entry in our feature vector list, so we’ll specify the length of the feature vector for this value.

Finally, the output of the 300 node hidden layer will be fed into the output layer, which consists of an output for each of the class labels.

We can then define our learn_rate , which is the learning rate of the algorithm, the decay of the learn rate ( learn_rate_decays ), the number of epochs , or iterations of the training data, and the verbosity level.

Both learn_rates  and learn_rates_decays  can be specified as a single floating point values or a list of floating point values.

Now that our Deep Belief Network is trained, let’s go ahead and evaluate it: Here we make a call to the predict method of the network on Line 33 which takes our testing data and makes predictions regarding which digit each image contains. If you have worked with scikit-learn at all, then this should feel very natural and comfortable.

Finally, I thought it might be interesting to inspect images individually rather than on aggregate as a further demonstration of the network: On Line 37 we loop over 10 randomly chosen feature vectors from the test data.

Since our data is in the range [0, 1.0], we first multiply by 255 to put it back in the range [0, 255], change the shape to be a 28 x 28 pixel image, and then change the data type from floating point to an unsigned 8-bit integer.

Fire up a shell, navigate to your  file, and issue the following command: If all goes well, you should have something similar to my output below: Here you can see that our Deep Belief Network is trained over 10 epochs (iterations over the training data).

From pre-processing the digit images, utilizing the Histogram of Oriented Gradients (HOG) image descriptor, and training a Linear SVM, this chapter covers handwriting recognition from front-to-back.

Deep belief network

In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables ('hidden units'), with connections between the layers but not between units within each layer.[1] When trained on a set of examples without supervision, a DBN can learn to probabilistically reconstruct its inputs.

The layers then act as feature detectors.[1] After this learning step, a DBN can be further trained with supervision to perform classification.[2] DBNs can be viewed as a composition of simple, unsupervised networks such as restricted Boltzmann machines (RBMs)[1] or autoencoders,[3] where each sub-network's hidden layer serves as the visible layer for the next.

An RBM is an undirected, generative energy-based model with a 'visible' input layer and a hidden layer and connections between but not within layers.

This composition leads to a fast, layer-by-layer unsupervised training procedure, where contrastive divergence is applied to each sub-network in turn, starting from the 'lowest' pair of layers (the lowest visible layer is a training set).

Teh's observation[2] that DBNs can be trained greedily, one layer at a time, led to one of the first effective deep learning algorithms.[4]:6 Overall, there are many attractive implementations and uses of DBNs in real-life applications and scenarios (e.g., electroencephalography[5], drug discovery[6]).

The training method for RBMs proposed by Geoffrey Hinton for use with training 'Product of Expert' models is called contrastive divergence (CD).[7] CD provides an approximation to the maximum likelihood method that would ideally be applied for learning the weights.[8][9] In training a single RBM, weight updates are performed with gradient descent via the following equation:
























{\displaystyle w_{ij}(t+1)=w_{ij}(t)+\eta {\frac {\partial \log(p(v))}{\partial w_{ij}}}}

{\displaystyle p(v)}

is the probability of a visible vector, which is given by

{\displaystyle p(v)={\frac {1}{Z}}\sum _{h}e^{-E(v,h)}}

is the partition function (used for normalizing) and

is the energy function assigned to the state of the network.

A lower energy indicates the network is in a more 'desirable' configuration.

The gradient

{\displaystyle {\frac {\partial \log(p(v))}{\partial w_{ij}}}}

has the simple form

{\displaystyle \langle v_{i}h_{j}\rangle _{\text{data}}-\langle v_{i}h_{j}\rangle _{\text{model}}}

{\displaystyle \langle \cdots \rangle _{p}}

represent averages with respect to distribution

{\displaystyle p}

The issue arises in sampling

{\displaystyle \langle v_{i}h_{j}\rangle _{\text{model}}}

because this requires extended alternating Gibbs sampling.

CD replaces this step by running alternating Gibbs sampling for

{\displaystyle n}

steps (values of

{\displaystyle n=1}

{\displaystyle n}

steps, the data are sampled and that sample is used in place of

{\displaystyle \langle v_{i}h_{j}\rangle _{\text{model}}}

The CD procedure works as follows:[8] Once an RBM is trained, another RBM is 'stacked' atop it, taking its input from the final trained layer.

The new visible layer is initialized to a training vector, and values for the units in the already-trained layers are assigned using the current weights and biases.

The new RBM is then trained with the procedure above.

This whole process is repeated until the desired stopping criterion is met.[10] Although the approximation of CD to maximum likelihood is crude (does not follow the gradient of any function), it is empirically effective.[8]

Deep learning

Learning can be supervised, semi-supervised or unsupervised.[1][2][3] Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics and drug design,[4] where they have produced results comparable to and in some cases superior[5] to human experts.[6] Deep learning models are vaguely inspired by information processing and communication patterns in biological nervous systems yet have various differences from the structural and functional properties of biological brains, which make them incompatible with neuroscience evidences.[7][8][9] Deep learning is a class of machine learning algorithms that:[10](pp199–200) Most modern deep learning models are based on an artificial neural network, although they can also include propositional formulas[11] or latent variables organized layer-wise in deep generative models such as the nodes in Deep Belief Networks and Deep Boltzmann Machines.

Examples of deep structures that can be trained in an unsupervised manner are neural history compressors[13] and deep belief networks.[1][14] Deep neural networks are generally interpreted in terms of the universal approximation theorem[15][16][17][18][19] or probabilistic inference.[10][11][1][2][14][20][21] The universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions.[15][16][17][18][19] In 1989, the first proof was published by George Cybenko for sigmoid activation functions[16] and was generalised to feed-forward multi-layer architectures in 1991 by Kurt Hornik.[17] The probabilistic interpretation[20] derives from the field of machine learning.

More specifically, the probabilistic interpretation considers the activation nonlinearity as a cumulative distribution function.[20] The probabilistic interpretation led to the introduction of dropout as regularizer in neural networks.[22] The probabilistic interpretation was introduced by researchers including Hopfield, Widrow and Narendra and popularized in surveys such as the one by Bishop.[23] The term Deep Learning was introduced to the machine learning community by Rina Dechter in 1986,[24][13] and to Artificial Neural Networks by Igor Aizenberg and colleagues in 2000, in the context of Boolean threshold neurons.[25][26] The first general, working learning algorithm for supervised, deep, feedforward, multilayer perceptrons was published by Alexey Ivakhnenko and Lapa in 1965.[27] A 1971 paper described a deep network with 8 layers trained by the group method of data handling algorithm.[28] Other deep learning working architectures, specifically those built for computer vision, began with the Neocognitron introduced by Kunihiko Fukushima in 1980.[29] In 1989, Yann LeCun et al.

Each layer in the feature extraction module extracted features with growing complexity regarding the previous layer.[38] In 1995, Brendan Frey demonstrated that it was possible to train (over two days) a network containing six fully connected layers and several hundred hidden units using the wake-sleep algorithm, co-developed with Peter Dayan and Hinton.[39] Many factors contribute to the slow speed, including the vanishing gradient problem analyzed in 1991 by Sepp Hochreiter.[40][41] Simpler models that use task-specific handcrafted features such as Gabor filters and support vector machines (SVMs) were a popular choice in the 1990s and 2000s, because of ANNs' computational cost and a lack of understanding of how the brain wires its biological networks.

In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks.[51] Later it was combined with connectionist temporal classification (CTC)[52] in stacks of LSTM RNNs.[53] In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which they made available through Google Voice Search.[54] In 2006, publications by Geoff Hinton, Ruslan Salakhutdinov, Osindero and Teh[55] [56][57] showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then fine-tuning it using supervised backpropagation.[58] The papers referred to learning for deep belief nets.

It was believed that pre-training DNNs using generative models of deep belief nets (DBN) would overcome the main difficulties of neural nets.[69] However, it was discovered that replacing pre-training with large amounts of training data for straightforward backpropagation when using DNNs with large, context-dependent output layers produced error rates dramatically lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov Model (HMM) and also than more-advanced generative model-based systems.[59][70] The nature of the recognition errors produced by the two types of systems was characteristically different,[71][68] offering technical insights into how to integrate deep learning into the existing highly efficient, run-time speech decoding system deployed by all major speech recognition systems.[10][72][73] Analysis around 2009-2010, contrasted the GMM (and other generative speech models) vs.

While there, Ng determined that GPUs could increase the speed of deep-learning systems by about 100 times.[79] In particular, GPUs are well-suited for the matrix/vector math involved in machine learning.[80][81] GPUs speed up training algorithms by orders of magnitude, reducing running times from weeks to days.[82][83] Specialized hardware and algorithm optimizations can be used for efficient processing.[84] In 2012, a team led by Dahl won the 'Merck Molecular Activity Challenge' using multi-task deep neural networks to predict the biomolecular target of one drug.[85][86] In 2014, Hochreiter's group used deep learning to detect off-target and toxic effects of environmental chemicals in nutrients, household products and drugs and won the 'Tox21 Data Challenge' of NIH, FDA and NCATS.[87][88][89] Significant additional impacts in image or object recognition were felt from 2011 to 2012.

The error rates listed below, including these early results and measured as percent phone error rates (PER), have been summarized over the past 20 years:[clarification needed] The debut of DNNs for speaker recognition in the late 1990s and speech recognition around 2009-2011 and of LSTM around 2003-2007, accelerated progress in eight major areas:[10][74][72] All major commercial speech recognition systems (e.g., Microsoft Cortana, Xbox, Skype Translator, Amazon Alexa, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a range of Nuance speech products, etc.) are based on deep learning.[10][121][122][123] A

DNNs have proven themselves capable, for example, of a) identifying the style period of a given painting, b) 'capturing' the style of a given painting and applying it in a visually pleasing manner to an arbitrary photograph, and c) generating striking imagery based on random visual input fields.[127][128] Neural networks have been used for implementing language models since the early 2000s.[102][129] LSTM helped to improve machine translation and language modeling.[103][104][105] Other key techniques in this field are negative sampling[130] and word embedding.

A compositional vector grammar can be thought of as probabilistic context free grammar (PCFG) implemented by an RNN.[131] Recursive auto-encoders built atop word embeddings can assess sentence similarity and detect paraphrasing.[131] Deep neural architectures provide the best results for constituency parsing,[132] sentiment analysis,[133] information retrieval,[134][135] spoken language understanding,[136] machine translation,[103][137] contextual entity linking,[137] writing style recognition,[138] Text classifcation[98] and others.[139] Google Translate (GT) uses a large end-to-end long short-term memory network.[140][141][142][143][144][145] GNMT uses an example-based machine translation method in which the system 'learns from millions of examples.'[141] It translates 'whole sentences at a time, rather than pieces.

These failures are caused by insufficient efficacy (on-target effect), undesired interactions (off-target effects), or unanticipated toxic effects.[149][150] Research has explored use of deep learning to predict biomolecular target,[85][86] off-target and toxic effects of environmental chemicals in nutrients, household products and drugs.[87][88][89] AtomNet is a deep learning system for structure-based rational drug design.[151] AtomNet was used to predict novel candidate biomolecules for disease targets such as the Ebola virus[152] and multiple sclerosis.[153][154] Deep reinforcement learning has been used to approximate the value of possible direct marketing actions, defined in terms of RFM variables.

An autoencoder ANN was used in bioinformatics, to predict gene ontology annotations and gene-function relationships.[158] In medical informatics, deep learning was used to predict sleep quality based on data from wearables[159][160] and predictions of health complications from electronic health record data.[161] Deep learning has also showed efficacy in healthcare.[162][163] Finding the appropriate mobile audience for mobile advertising is always challenging, since many data points must be considered and assimilated before a target segment can be created and used in ad serving by any ad server.[164][165] Deep learning has been used to interpret large, many-dimensioned advertising datasets.

On the one hand, several variants of the backpropagation algorithm have been proposed in order to increase its processing realism.[172][173] Other researchers have argued that unsupervised forms of deep learning, such as those based on hierarchical generative models and deep belief networks, may be closer to biological reality.[174][175] In this respect, generative neural network models have been related to neurobiological evidence about sampling-based processing in the cerebral cortex.[176] Although a systematic comparison between the human brain organization and the neuronal encoding in deep networks has not yet been established, several analogies have been reported.

systems, like Watson (...) use techniques like deep learning as just one element in a very complicated ensemble of techniques, ranging from the statistical technique of Bayesian inference to deductive reasoning.'[189] As an alternative to this emphasis on the limits of deep learning, one author speculated that it might be possible to train a machine vision stack to perform the sophisticated task of discriminating between 'old master' and amateur figure drawings, and hypothesized that such a sensitivity might represent the rudiments of a non-trivial machine empathy.[190] This same author proposed that this would be in line with anthropology, which identifies a concern with aesthetics as a key element of behavioral modernity.[191] In further reference to the idea that artistic sensitivity might inhere within relatively low levels of the cognitive hierarchy, a published series of graphic representations of the internal states of deep (20-30 layers) neural networks attempting to discern within essentially random data the images on which they were trained[192] demonstrate a visual appeal: the original research notice received well over 1,000 comments, and was the subject of what was for a time the most frequently accessed article on The Guardian's[193] web site.

Some deep learning architectures display problematic behaviors,[194] such as confidently classifying unrecognizable images as belonging to a familiar category of ordinary images[195] and misclassifying minuscule perturbations of correctly classified images.[196] Goertzel hypothesized that these behaviors are due to limitations in their internal representations and that these limitations would inhibit integration into heterogeneous multi-component AGI architectures.[194] These issues may possibly be addressed by deep learning architectures that internally form states homologous to image-grammar[197] decompositions of observed entities and events.[194] Learning a grammar (visual or linguistic) from training data would be equivalent to restricting the system to commonsense reasoning that operates on concepts in terms of grammatical production rules and is a basic goal of both human language acquisition[198] and AI.[199] As deep learning moves from the lab into the world, research and experience shows that artificial neural networks are vulnerable to hacks and deception.

Deep Belief Nets - Ep. 7 (Deep Learning SIMPLIFIED)

An RBM can extract features and reconstruct input data, but it still lacks the ability to combat the vanishing gradient. However, through a clever combination of ...

Neural networks [7.3] : Deep learning - unsupervised pre-training

Neural networks [7.7] : Deep learning - deep belief network

Deep Visualization Toolbox

Code and more info:

Autoencoders - Ep. 10 (Deep Learning SIMPLIFIED)

Autoencoders are a family of neural nets that are well suited for unsupervised learning, a method for detecting inherent patterns in a data set. These nets can ...

Deep Belief Network first layer before pre-training phase MNIST

Learnable Parameters in a Convolutional Neural Network (CNN) explained

In this video, we're going to learn about the learnable parameters in a convolutional neural network. Last time, we learned about learnable parameters in a fully ...

3. Learning Sigmoid Belief Nets

Video from Coursera - University of Toronto - Course: Neural Networks for Machine Learning:

Restricted Boltzmann Machines - Ep. 6 (Deep Learning SIMPLIFIED)

So what was the breakthrough that allowed deep nets to combat the vanishing gradient problem? The answer has two parts, the first of which involves the RBM, ...