AI News, MachineLearning

MachineLearning

I want to compile a comprehensive list of all the available code repos for the NIPS 2016's top papers.

Learning to learn by gradient descent by gradient descent (https://arxiv.org/abs/1606.04474) Repo: https://github.com/deepmind/learning-to-learn 3.

R-FCN: Object Detection via Region-based Fully Convolutional Networks (https://arxiv.org/abs/1605.06409) Repo: https://github.com/Orpine/py-R-FCN 4.

Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences (https://arxiv.org/abs/1610.09513) Repo: https://github.com/dannyneil/public_plstm 7.

Composing graphical models with neural networks for structured representations and fast inference (https://arxiv.org/abs/1603.06277) Repo: https://github.com/mattjj/svae 16.

Fast ε-free Inference of Simulation Models with Bayesian Conditional Density Estimation: (https://arxiv.org/abs/1605.06376) Repo: https://github.com/gpapamak/epsilon_free_inference 18.

Fast Artificial Neural Network Library

Fast Artificial Neural Network (FANN) Library is a free open source neural network library, which implements multilayer artificial neural networks in C with support for both fully connected and sparsely connected networks.

An easy to read introduction article and a reference manual accompanies the library with examples and recommendations on how to use the library.

First you'll want to clone the repository: git clone https://github.com/libfann/fann.git Once that's finished, navigate to the Root directory.

Why are deep neural networks hard to train?

In practice, when solving circuit design problems (or most any kind of algorithmic problem), we usually start by figuring out how to solve sub-problems, and then gradually integrate the solutions.

See Johan Håstad's 2012 paper On the correlation of parity and small-depth circuits for an account of the early history and references.

On the other hand, if you use deeper circuits it's easy to compute the parity using a small circuit: you just compute the parity of pairs of bits, then use those results to compute the parity of pairs of pairs of bits, and so on, building up quickly to the overall parity.

These simple networks have been remarkably useful: in earlier chapters we used networks like this to classify handwritten digits with better than 98 percent accuracy!

For instance, if we're doing visual pattern recognition, then the neurons in the first layer might learn to recognize edges, the neurons in the second layer could learn to recognize more complex shapes, say triangle or rectangles, built up from edges.

These multiple layers of abstraction seem likely to give deep networks a compelling advantage in learning to solve complex pattern recognition problems.

Moreover, just as in the case of circuits, there are theoretical results suggesting that deep networks are intrinsically more powerful than shallow networks* *For certain problems and network architectures this is proved in On the number of response regions of deep feed forward networks with piece-wise linear activations, by Razvan Pascanu, Guido Montúfar, and Yoshua Bengio (2014).

In this chapter, we'll try training deep networks using our workhorse learning algorithm - stochastic gradient descent by backpropagation.

As per usual, we'll use the MNIST digit classification problem as our playground for learning and experimentation* *I introduced the MNIST problem and data here and here..

If you do wish to follow live, then you'll need Python 2.7, Numpy, and a copy of the code, which you can get by cloning the relevant repository from the command line: git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

We use 30 hidden neurons, as well as 10 output neurons, corresponding to the 10 possible classifications for the MNIST digits ('0', '1', '2', $\ldots$, '9').

Let's try training our network for 30 complete epochs, using mini-batches of 10 training examples at a time, a learning rate $\eta = 0.1$, and regularization parameter $\lambda = 5.0$.

As we train we'll monitor the classification accuracy on the validation_data* *Note that the networks is likely to take some minutes to train, depending on the speed of your machine.

So if you're running the code you may wish to continue reading and return later, not wait for the code to finish executing.: >>>

We get a classification accuracy of 96.48 percent (or thereabouts - it'll vary a bit from run to run), comparable to our earlier results with a similar configuration.

Certainly, things shouldn't get worse, since the extra layers can, in the worst case, simply do nothing* *See this later problem to understand how to build a hidden layer that does nothing..

Below, I've plotted part of a $[784, 30, 30, 10]$ network, i.e., a network with two hidden layers, each containing $30$ hidden neurons.

A big bar means the neuron's weights and bias are changing rapidly, while a small bar means the weights and bias are changing slowly.

More precisely, the bars denote the gradient $\partial C / \partial b$ for each neuron, i.e., the rate of change of the cost with respect to the neuron's bias.

Back in Chapter 2 we saw that this gradient quantity controlled not just how rapidly the bias changes during learning, but also how rapidly the weights input to the neuron change, too.

Don't worry if you don't recall the details: the thing to keep in mind is simply that these bars show how quickly each neuron's weights and bias are changing as the network learns.

To do this, let's denote the gradient as $\delta^l_j = \partial C / \partial b^l_j$, i.e., the gradient for the $j$th neuron in the $l$th layer* *Back in Chapter 2 we referred to this as the error, but here we'll adopt the informal term 'gradient'.

I say 'informal' because of course this doesn't explicitly include the partial derivatives of the cost with respect to the weights, $\partial C / \partial w$..

We can think of the gradient $\delta^1$ as a vector whose entries determine how quickly the first hidden layer learns, and $\delta^2$ as a vector whose entries determine how quickly the second hidden layer learns.

If we have three hidden layers, in a $[784, 30, 30, 30, 10]$ network, then the respective speeds of learning turn out to be 0.012, 0.060, and 0.283.

This is a bit different than the way we usually train - I've used no mini-batches, and just 1,000 training images, rather than the full 50,000 image training set.

I'm not trying to do anything sneaky, or pull the wool over your eyes, but it turns out that using mini-batch stochastic gradient descent gives much noisier (albeit very similar, when you average away the noise) results.

The phenomenon is known as the vanishing gradient problem* *See Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, by Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber (2001).

Unstable gradients in deep neural nets To get insight into why the vanishing gradient problem occurs, let's consider the simplest deep neural network: one with just a single neuron in each layer.

Here's a network with three hidden layers: Here, $w_1, w_2, \ldots$ are the weights, $b_1, b_2, \ldots$ are the biases, and $C$ is some cost function.

Just to remind you how this works, the output $a_j$ from the $j$th neuron is $\sigma(z_j)$, where $\sigma$ is the usual sigmoid activation function, and $z_j = w_{j} a_{j-1}+b_j$ is the weighted input to the neuron.

I've drawn the cost $C$ at the end to emphasize that the cost is a function of the network's output, $a_4$: if the actual output from the network is close to the desired output, then the cost will be low, while if it's far away, the cost will be high.

We have $a_1 = \sigma(z_1) = \sigma(w_1 a_0 + b_1)$, so \begin{eqnarray} \Delta a_1 & \approx & \frac{\partial \sigma(w_1 a_0+b_1)}{\partial b_1} \Delta b_1 \tag{115}\\ & = & \sigma'(z_1) \Delta b_1.

That change $\Delta a_1$ in turn causes a change in the weighted input $z_2 = w_2 a_1 + b_2$ to the second hidden neuron: \begin{eqnarray} \Delta z_2 & \approx & \frac{\partial z_2}{\partial a_1} \Delta a_1 \tag{117}\\ & = & w_2 \Delta a_1.

\tag{118}\end{eqnarray} Combining our expressions for $\Delta z_2$ and $\Delta a_1$, we see how the change in the bias $b_1$ propagates along the network to affect $z_2$: \begin{eqnarray} \Delta z_2 & \approx & \sigma'(z_1) w_2 \Delta b_1.

The end result is an expression relating the final change $\Delta C$ in cost to the initial change $\Delta b_1$ in the bias: \begin{eqnarray} \Delta C & \approx & \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4) \frac{\partial C}{\partial a_4} \Delta b_1.

\tag{120}\end{eqnarray} Dividing by $\Delta b_1$ we do indeed get the desired expression for the gradient: \begin{eqnarray} \frac{\partial C}{\partial b_1} = \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4) \frac{\partial C}{\partial a_4}.

Why the vanishing gradient problem occurs: To understand why the vanishing gradient problem occurs, let's explicitly write out the entire expression for the gradient: \begin{eqnarray} \frac{\partial C}{\partial b_1} = \sigma'(z_1) \, w_2 \sigma'(z_2) \, w_3 \sigma'(z_3) \, w_4 \sigma'(z_4) \, \frac{\partial C}{\partial a_4}.

To make this all a bit more explicit, let's compare the expression for $\partial C / \partial b_1$ to an expression for the gradient with respect to a later bias, say $\partial C / \partial b_3$.

Of course, we haven't explicitly worked out an expression for $\partial C / \partial b_3$, but it follows the same pattern described above for $\partial C / \partial b_1$.

That's actually pretty easy to do: all we need do is choose the biases to ensure that the weighted input to each neuron is $z_j = 0$ (and so $\sigma'(z_j) = 1/4$).

The only way to avoid this is if the input activation falls within a fairly narrow range of values (this qualitative explanation is made quantitative in the first problem below).

Show that the set of $a$ satisfying that constraint can range over an interval no greater in width than \begin{eqnarray} \frac{2}{|w|} \ln\left( \frac{|w|(1+\sqrt{1-4/|w|})}{2}-1\right).

\tag{123}\end{eqnarray} (3) Show numerically that the above expression bounding the width of the range is greatest at $|w|

And so even given that everything lines up just perfectly, we still have a fairly narrow range of input activations which can avoid the vanishing gradient problem.Identity neuron: Consider a neuron with a single input, $x$, a corresponding weight, $w_1$, a bias $b$, and a weight $w_2$ on the output.

Show that by choosing the weights and bias appropriately, we can ensure $w_2 \sigma(w_1 x+b) \approx x$ for $x \in [0, 1]$.

Such a neuron can thus be used as a kind of identity neuron, that is, a neuron whose output is the same (up to rescaling by a weight factor) as its input.

Hint: It helps to rewrite $x = 1/2+\Delta$, to assume $w_1$ is small, and to use a Taylor series expansion in $w_1 \Delta$.

In the earlier chapter on backpropagation we saw that the gradient in the $l$th layer of an $L$ layer network is given by: \begin{eqnarray} \delta^l = \Sigma'(z^l) (w^{l+1})^T \Sigma'(z^{l+1}) (w^{l+2})^T \ldots \Sigma'(z^L) \nabla_a C \tag{124}\end{eqnarray} Here, $\Sigma'(z^l)$ is a diagonal matrix whose entries are the $\sigma'(z)$ values for the weighted inputs to the $l$th layer.

I won't comprehensively summarize that work here, but just want to briefly mention a couple of papers, to give you the flavor of some of the questions people are asking.

In particular, they found evidence that the use of sigmoids will cause the activations in the final hidden layer to saturate near $0$ early in training, substantially slowing down learning.

As a second example, in 2013 Sutskever, Martens, Dahl and Hinton* *On the importance of initialization and momentum in deep learning, by Ilya Sutskever, James Martens, George Dahl and Geoffrey Hinton (2013).

The results in the last two paragraphs suggest that there is also a role played by the choice of activation function, the way weights are initialized, and even details of how learning by gradient descent is implemented.

The Best Way to Prepare a Dataset Easily

In this video, I go over the 3 steps you need to prepare a dataset to be fed into a machine learning model. (selecting the data, processing it, and transforming it).

Neural Arithmetic Logic Units

Deepmind released a paper just a few days ago describing a module for neural networks called the Neural Arithmetic Logic Unit (NALU). Although deep neural ...

How to Make a Text Summarizer - Intro to Deep Learning #10

I'll show you how you can turn an article into a one-sentence summary in Python with the Keras machine learning library. We'll go over word embeddings, ...

Learn Machine Learning in 3 Months (with curriculum)

How is a total beginner supposed to get started learning machine learning? I'm going to describe a 3 month curriculum to help you go from beginner to ...

Practical 3.0 – CNN basics

Convolutional Neural Networks – Basics Full project: Torch7-profiling repo: ..

One-Shot Learning - Fresh Machine Learning #1

Welcome to Fresh Machine Learning! This is my new course dedicated to making bleeding edge machine learning accessible to developers everywhere.

How To Train an Object Detection Classifier Using TensorFlow 1.5 (GPU) on Windows 10

These instructions work for newer versions of TensorFlow too! This tutorial shows you how to train your own object detector for multiple objects using Google's ...

Deep Learning Frameworks Compared

In this video, I compare 5 of the most popular deep learning frameworks (SciKit Learn, TensorFlow, Theano, Keras, and Caffe). We go through the pros and cons ...

Predicting the Winning Team with Machine Learning

Can we predict the outcome of a football game given a dataset of past games? That's the question that we'll answer in this episode by using the scikit-learn ...

Introduction to Spell (for Machine Learning in the Cloud)

Spell sign-up: Spell! Spell is a command line and web interface for sending experiments out to the cloud with a single command. There are ..