AI News, Probabilistic models: restricted boltzmann machine (RBM)

Probabilistic models: restricted boltzmann machine (RBM)

Deep learning has become something of a buzzword in recent years with the explosion of 'big data', 'data science', and their derivatives mentioned in the media.

My goal is to give you a layman understanding of what deep learning actually is so you can follow some of my thesis research this year as well as mentally filter out news articles that sensationalize these buzzwords.

What it took for you to correctly recognize the digit, however, is an impressive display of fitting smaller features together to make the whole - noticing contrasted edges to make lines, seeing a horizontal vs.

Ultimately, this is what deep learning or representation learning is meant to do: discover multiple levels of features that work together to define increasingly more abstract aspects of the data (in our case, initial image pixels to lines to full-blown numbers).

This post is going to be a rough summary of two main survey papers: Current machine learning algorithms' performance depends heavily on the particular features of the data chosen as inputs.

Choosing the correct feature representation of input data, or feature engineering, is a way that people can bring prior knowledge of a domain to increase an algorithm's computational performance and accuracy.

To move towards general artificial intelligence, algorithms need to be less dependent on this feature engineering and better learn to identify the explanatory factors of input data on their own.

It turns out that deep learning approaches can find useful abstract representations of data across many domains: it has had great commercial success powering most of Google and Microsoft's current speech recognition, image classification, natural language processing, object recognition, etc.

One good analogue for deep representations is neurons in the brain (a motivation for artificial neural networks) - the output of a group of neurons is agglomerated as the input to more neurons to form a hierarchical layer structure.

There are two main ways to interpret the computation performed by these layered deep architectures: To get started, principal component analysis (PCA) is a simple feature extraction algorithm that can span both of these interpretations.

The columns of the dx x dh matrix W form an orthogonal basis for the dh orthogonal directions of greatest variance in the input training data x.

Very successful deep learning algorithms stack multiple RBMs together, where the hiddens h from the visible input data x become the new input data for another RBM for an arbitrary number of layers.

There are a few drawbacks to the probabilistic approach to deep architectures: To get around the problem of deriving useful feature values, an auto-encoder is a non-probabilistic alternative approach to deep learning where the hidden units produce usable numeric feature values.

Many problems stand in the way of reaching more general AI-level performance: Scaling computations - the more complex the input space (such as harder AI problems), the larger the deep networks have to be to capture its representation.

These computations scale much worse than linearly, and current research in parallelizing the training algorithms and creating convolutional architectures is meant to make these algorithms useful in practice.

Inference and sampling - all probabilistic models except for the RBM require a non-trivial form of inference (guessing values of the latent variables h given the conditional distribution over x).

So far, we have described the application of neural networks to supervised learning, in which we have labeled training examples.

As a concrete example, suppose the inputs \textstyle x are the pixel intensity values from a \textstyle 10 \times 10 image (100 pixels) so \textstyle n=100, and there are \textstyle s_2=50 hidden units in layer \textstyle L_2.

But even when the number of hidden units is large (perhaps even greater than the number of input pixels), we can still discover interesting structure, by imposing other constraints on the network.

In particular, if we impose a ”‘sparsity”’ constraint on the hidden units, then the autoencoder will still discover interesting structure in the data, even if the number of hidden units is large.

We would like to (approximately) enforce the constraint where \textstyle \rho is a ”‘sparsity parameter”’, typically a small value close to zero (say \textstyle \rho = 0.05).

We will choose the following: Here, \textstyle s_2 is the number of neurons in the hidden layer, and the index \textstyle j is summing over the hidden units in our network.

= \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j} is the Kullback-Leibler (KL) divergence between a Bernoulli random variable with mean \textstyle \rho and a Bernoulli random variable with mean \textstyle \hat\rho_j.

\hat\rho_j) = 0 if \textstyle \hat\rho_j = \rho, and otherwise it increases monotonically as \textstyle \hat\rho_j diverges from \textstyle \rho.

The term \textstyle \hat\rho_j (implicitly) depends on \textstyle W,b also, because it is the average activation of hidden unit \textstyle j, and the activation of a hidden unit depends on the parameters \textstyle W,b.

Specifically, where previously for the second layer (\textstyle l=2), during backpropagation you would have computed now instead compute One subtlety is that you’ll need to know \textstyle \hat\rho_i to compute this term.

If your training set is small enough to fit comfortably in computer memory (this will be the case for the programming assignment), you can compute forward passes on all your examples and keep the resulting activations in memory and compute the \textstyle \hat\rho_is.

If your data is too large to fit in memory, you may have to scan through your examples computing a forward pass on each to accumulate (sum up) the activations and compute \textstyle \hat\rho_i (discarding the result of each forward pass after you have taken its activations \textstyle a^{(2)}_i into account for computing \textstyle \hat\rho_i).

Each hidden unit \textstyle i computes a function of the input: We will visualize the function computed by hidden unit \textstyle i—which depends on the parameters \textstyle W^{(1)}_{ij} (ignoring the bias term for now)—using a 2D image.

If we suppose that the input is norm constrained by \textstyle ||x||^2 = \sum_{i=1}^{100} x_i^2 \leq 1, then one can show (try doing this yourself) that the input which maximally activates hidden unit \textstyle i is given by setting pixel \textstyle x_j (for all 100 pixels, \textstyle j=1,\ldots, 100) to By displaying the image formed by these pixel intensity values, we can begin to understand what feature hidden unit \textstyle i is looking for.

Deep learning

Learning can be supervised, semi-supervised or unsupervised.[1][2][3] Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design and board game programs, where they have produced results comparable to and in some cases superior to human experts.[4][5][6] Deep learning models are vaguely inspired by information processing and communication patterns in biological nervous systems yet have various differences from the structural and functional properties of biological brains, which make them incompatible with neuroscience evidences.[7][8][9] Deep learning is a class of machine learning algorithms that:[10](pp199–200) Most modern deep learning models are based on an artificial neural network, although they can also include propositional formulas or latent variables organized layer-wise in deep generative models such as the nodes in Deep Belief Networks and Deep Boltzmann Machines.[11] In deep learning, each level learns to transform its input data into a slightly more abstract and composite representation.

Examples of deep structures that can be trained in an unsupervised manner are neural history compressors[13] and deep belief networks.[1][14] Deep neural networks are generally interpreted in terms of the universal approximation theorem[15][16][17][18][19] or probabilistic inference.[10][11][1][2][14][20][21] The universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions.[15][16][17][18][19] In 1989, the first proof was published by George Cybenko for sigmoid activation functions[16] and was generalised to feed-forward multi-layer architectures in 1991 by Kurt Hornik.[17] The probabilistic interpretation[20] derives from the field of machine learning.

More specifically, the probabilistic interpretation considers the activation nonlinearity as a cumulative distribution function.[20] The probabilistic interpretation led to the introduction of dropout as regularizer in neural networks.[22] The probabilistic interpretation was introduced by researchers including Hopfield, Widrow and Narendra and popularized in surveys such as the one by Bishop.[23] The term Deep Learning was introduced to the machine learning community by Rina Dechter in 1986,[24][13] and to Artificial Neural Networks by Igor Aizenberg and colleagues in 2000, in the context of Boolean threshold neurons.[25][26] The first general, working learning algorithm for supervised, deep, feedforward, multilayer perceptrons was published by Alexey Ivakhnenko and Lapa in 1965.[27] A 1971 paper described a deep network with 8 layers trained by the group method of data handling algorithm.[28] Other deep learning working architectures, specifically those built for computer vision, began with the Neocognitron introduced by Kunihiko Fukushima in 1980.[29] In 1989, Yann LeCun et al.

Each layer in the feature extraction module extracted features with growing complexity regarding the previous layer.[38] In 1995, Brendan Frey demonstrated that it was possible to train (over two days) a network containing six fully connected layers and several hundred hidden units using the wake-sleep algorithm, co-developed with Peter Dayan and Hinton.[39] Many factors contribute to the slow speed, including the vanishing gradient problem analyzed in 1991 by Sepp Hochreiter.[40][41] Simpler models that use task-specific handcrafted features such as Gabor filters and support vector machines (SVMs) were a popular choice in the 1990s and 2000s, because of ANNs' computational cost and a lack of understanding of how the brain wires its biological networks.

In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks.[51] Later it was combined with connectionist temporal classification (CTC)[52] in stacks of LSTM RNNs.[53] In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which they made available through Google Voice Search.[54] In 2006, publications by Geoff Hinton, Ruslan Salakhutdinov, Osindero and Teh[55] [56][57] showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then fine-tuning it using supervised backpropagation.[58] The papers referred to learning for deep belief nets.

It was believed that pre-training DNNs using generative models of deep belief nets (DBN) would overcome the main difficulties of neural nets.[69] However, it was discovered that replacing pre-training with large amounts of training data for straightforward backpropagation when using DNNs with large, context-dependent output layers produced error rates dramatically lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov Model (HMM) and also than more-advanced generative model-based systems.[59][70] The nature of the recognition errors produced by the two types of systems was characteristically different,[71][68] offering technical insights into how to integrate deep learning into the existing highly efficient, run-time speech decoding system deployed by all major speech recognition systems.[10][72][73] Analysis around 2009-2010, contrasted the GMM (and other generative speech models) vs.

While there, Ng determined that GPUs could increase the speed of deep-learning systems by about 100 times.[79] In particular, GPUs are well-suited for the matrix/vector math involved in machine learning.[80][81] GPUs speed up training algorithms by orders of magnitude, reducing running times from weeks to days.[82][83] Specialized hardware and algorithm optimizations can be used for efficient processing.[84] In 2012, a team led by Dahl won the 'Merck Molecular Activity Challenge' using multi-task deep neural networks to predict the biomolecular target of one drug.[85][86] In 2014, Hochreiter's group used deep learning to detect off-target and toxic effects of environmental chemicals in nutrients, household products and drugs and won the 'Tox21 Data Challenge' of NIH, FDA and NCATS.[87][88][89] Significant additional impacts in image or object recognition were felt from 2011 to 2012.

DNNs have proven themselves capable, for example, of a) identifying the style period of a given painting, b) 'capturing' the style of a given painting and applying it in a visually pleasing manner to an arbitrary photograph, and c) generating striking imagery based on random visual input fields.[128][129] Neural networks have been used for implementing language models since the early 2000s.[101][130] LSTM helped to improve machine translation and language modeling.[102][103][104] Other key techniques in this field are negative sampling[131] and word embedding.

A compositional vector grammar can be thought of as probabilistic context free grammar (PCFG) implemented by an RNN.[132] Recursive auto-encoders built atop word embeddings can assess sentence similarity and detect paraphrasing.[132] Deep neural architectures provide the best results for constituency parsing,[133] sentiment analysis,[134] information retrieval,[135][136] spoken language understanding,[137] machine translation,[102][138] contextual entity linking,[138] writing style recognition,[139] Text classifcation and others.[140] Google Translate (GT) uses a large end-to-end long short-term memory network.[141][142][143][144][145][146] GNMT uses an example-based machine translation method in which the system 'learns from millions of examples.'[142] It translates 'whole sentences at a time, rather than pieces.

These failures are caused by insufficient efficacy (on-target effect), undesired interactions (off-target effects), or unanticipated toxic effects.[148][149] Research has explored use of deep learning to predict biomolecular target,[85][86] off-target and toxic effects of environmental chemicals in nutrients, household products and drugs.[87][88][89] AtomNet is a deep learning system for structure-based rational drug design.[150] AtomNet was used to predict novel candidate biomolecules for disease targets such as the Ebola virus[151] and multiple sclerosis.[152][153] Deep reinforcement learning has been used to approximate the value of possible direct marketing actions, defined in terms of RFM variables.

An autoencoder ANN was used in bioinformatics, to predict gene ontology annotations and gene-function relationships.[157] In medical informatics, deep learning was used to predict sleep quality based on data from wearables[158][159] and predictions of health complications from electronic health record data.[160] Deep learning has also showed efficacy in healthcare.[161][162] Finding the appropriate mobile audience for mobile advertising is always challenging, since many data points must be considered and assimilated before a target segment can be created and used in ad serving by any ad server.[163][164] Deep learning has been used to interpret large, many-dimensioned advertising datasets.

On the one hand, several variants of the backpropagation algorithm have been proposed in order to increase its processing realism.[171][172] Other researchers have argued that unsupervised forms of deep learning, such as those based on hierarchical generative models and deep belief networks, may be closer to biological reality.[173][174] In this respect, generative neural network models have been related to neurobiological evidence about sampling-based processing in the cerebral cortex.[175] Although a systematic comparison between the human brain organization and the neuronal encoding in deep networks has not yet been established, several analogies have been reported.

systems, like Watson (...) use techniques like deep learning as just one element in a very complicated ensemble of techniques, ranging from the statistical technique of Bayesian inference to deductive reasoning.'[188] As an alternative to this emphasis on the limits of deep learning, one author speculated that it might be possible to train a machine vision stack to perform the sophisticated task of discriminating between 'old master' and amateur figure drawings, and hypothesized that such a sensitivity might represent the rudiments of a non-trivial machine empathy.[189] This same author proposed that this would be in line with anthropology, which identifies a concern with aesthetics as a key element of behavioral modernity.[190] In further reference to the idea that artistic sensitivity might inhere within relatively low levels of the cognitive hierarchy, a published series of graphic representations of the internal states of deep (20-30 layers) neural networks attempting to discern within essentially random data the images on which they were trained[191] demonstrate a visual appeal: the original research notice received well over 1,000 comments, and was the subject of what was for a time the most frequently accessed article on The Guardian's[192] web site.

Some deep learning architectures display problematic behaviors,[193] such as confidently classifying unrecognizable images as belonging to a familiar category of ordinary images[194] and misclassifying minuscule perturbations of correctly classified images.[195] Goertzel hypothesized that these behaviors are due to limitations in their internal representations and that these limitations would inhibit integration into heterogeneous multi-component AGI architectures.[193] These issues may possibly be addressed by deep learning architectures that internally form states homologous to image-grammar[196] decompositions of observed entities and events.[193] Learning a grammar (visual or linguistic) from training data would be equivalent to restricting the system to commonsense reasoning that operates on concepts in terms of grammatical production rules and is a basic goal of both human language acquisition[197] and AI.[198] As deep learning moves from the lab into the world, research and experience shows that artificial neural networks are vulnerable to hacks and deception.

ANNs have been trained to defeat ANN-based anti-malware software by repeatedly attacking a defense with malware that was continually altered by a genetic algorithm until it tricked the anti-malware while retaining its ability to damage the target.[199] Another group demonstrated that certain sounds could make the Google Now voice command system open a particular web address that would download malware.[199] In “data poisoning”, false data is continually smuggled into a machine learning system’s training set to prevent it from achieving mastery.[199]

Deep Learning with Tensorflow - Convolution and Feature Learning

Enroll in the course for free at: Deep Learning with TensorFlow Introduction The majority of data ..

Deep Learning Overview

See the full course: Follow along with the course eBook: Deep nets are the current state of the art in pattern .

Neural Program Learning from Input-Output Examples

Most deep learning research focuses on learning a single task at a time - on a fixed problem, given an input, predict the corresponding output. How should we ...

Deep Learning with Tensorflow - Activation Functions

Enroll in the course for free at: Deep Learning with TensorFlow Introduction The majority of data ..

Using Deep Learning to Extract Feature Data from Imagery

Vector data collection is the most tedious task in a GIS workflow. Digitizing features from imagery or scanned maps is a manual process that is costly, requiring ...

Neural Networks 6: solving XOR with a hidden layer

Nonlinear ICA using temporal structure: a principled framework for unsupervised deep learning

Unsupervised learning, in particular learning general nonlinear representations, is one of the deepest problems in machine learning. Estimating latent quantities ...

25. Computing a Neural Network's Output

Deep Learning Approach for Extreme Multi-label Text Classification

Extreme classification is a rapidly growing research area focusing on multi-class and multi-label problems involving an extremely large number of labels.

PyTorch in 5 Minutes

I'll explain PyTorch's key features and compare it to the current most popular deep learning framework in the world (Tensorflow). We'll then write out a short ...