AI News, What should be my approach to learn Deep learning?

What should be my approach to learn Deep learning?

Deep learning is a subset of machine learning in Artificial Intelligence (AI) that has networks which are capable of learning unsupervised from data that is unstructured or unlabeled.

Deep learning combines advances in computing power and special types of neural networks to learn complicated patterns in large amounts of data.

Deep learning techniques are now currently state of the arts for identifying objects in images and sounds in words.

Apply Auto Encoders in practice You can also refer some text books: Books recommended for deep learning are below(for reference purpose only) ·

Understanding Hinton’s Capsule Networks. Part I: Intuition.

Layers that are deeper (closer to the input) will learn to detect simple features such as edges and color gradients, whereas higher layers will combine simple features into more complex features.

An important thing to understand is that higher-level features combine lower-level features as a weighted sum: activations of a preceding layer are multiplied by the following layer neuron’s weights and added, before being passed to activation nonlinearity.

CNN approach to solve this issue is to use max pooling or successive convolutional layers that reduce spacial size of the data flowing through the network and therefore increase the “field of view” of higher layer’s neurons, thus allowing them to detect higher order features in a larger region of the input image.

Hinton himself stated that the fact that max pooling is working so well is a big mistake and a disaster: Of course, you can do away with max pooling and still get good results with traditional CNNs, but they still do not solve the key problem: In the example above, a mere presence of 2 eyes, a mouth and a nose in a picture does not mean there is a face, we also need to know how these objects are oriented relative to each other.

He calls it inverse graphics: from visual information received by eyes, they deconstruct a hierarchical representation of the world around us and try to match it with already learned patterns and relationships stored in the brain.

Hacker's guide to Neural Networks

Javascript allows one to nicely visualize what’s going on and to play around with the various hyperparameter settings, but I still regularly hear from people who ask for a more thorough treatment of the topic.

In my opinion, the best way to think of Neural Networks is as real-valued circuits, where real values (instead of boolean values {0,1}) “flow” along edges and interact in gates.

Javascript version of this would very simply look something like this: And in math form we can think of this gate as implementing the real-valued function: As with this example, all of our gates will take one or two inputs and produce a single output value.

Why don’t we tweak x and y randomly and keep track of the tweak that works best: When I run this, I get best_x = -1.9928, best_y = 2.9901, and best_out = -5.9588.

Not quite: This is a perfectly fine strategy for tiny problems with a few gates if you can afford the compute time, but it won’t do if we want to eventually consider huge circuits with millions of inputs.

On the other hand, we’d expect a negative force induced on y that pushes it to become lower (since a lower y, such as y = 2, down from the original y = 3 would make output higher: 2 x -2 = -4, again, larger than -6).

Also, if you’re not very familiar with calculus it is important to note that in the left-hand side of the equation above, the horizontal line does not indicate division.

The entire symbol \( \frac{\partial f(x,y)}{\partial x} \) is a single thing: the derivative of the function \( f(x,y) \) with respect to \( x \).

Anyway, I hope it doesn’t look too scary because it isn’t: The circuit was giving some initial output \( f(x,y) \), and then we changed one of the inputs by a tiny amount \(h \) and read the new output \( f(x+h, y) \).

We turned the knob from x to x + h and the circuit responded by giving a higher value (note again that yes, -5.9997 is higher than -6: -5.9997 >

Technically, you want the value of h to be infinitesimal (the precise mathematical definition of the gradient is defined as the limit of the expression as h goes to zero), but in practice h=0.00001 or so works fine in most cases to get a good approximation.

we just add the derivative on top of every input), we can see that the value increases, as expected: As expected, we changed the inputs by the gradient and the circuit now gives a slightly higher value (-5.87 >

Evaluating the gradient requires just three evaluations of the forward pass of our circuit instead of hundreds, and gives the best tug you can hope for (locally) if you are interested in increasing the value of the output.

For example, step_size = 1.0 gives output -1 (higer, better!), and indeed infinite step size would give infinitely good results.

The gradient guarantees that if you have a very small (indeed, infinitesimally small) step size, then you will definitely get a higher number when you follow its direction, and for that infinitesimally small step size there is no other direction that would have worked better.

But in practice we will have hundreds, thousands or (for neural networks) even tens to hundreds of millions of inputs, and the circuits aren’t just one multiply gate but huge expressions that can be expensive to compute.

You may have seen other people who teach Neural Networks derive the gradient in huge and, frankly, scary and confusing mathematical equations (if you’re not well-versed in maths).

That is because we will only ever derive the gradient for very small and simple expressions (think of it as the base case) and then I will show you how we can compose these very simply with chain rule to evaluate the full gradient (think inductive/recursive case).

We invoked powerful mathematics and can now transform our derivative calculation into the following code: To compute the gradient we went from forwarding the circuit hundreds of times (Strategy #1) to forwarding it only on order of number of times twice the number of inputs (Strategy #2), to forwarding it a single time!

And it gets EVEN better, since the more expensive strategies (#1 and #2) only give an approximation of the gradient, while #3 (the fastest one by far) gives you the exact gradient.

That’s because the numerical gradient is very easy to evaluate (but can be a bit expensive to compute), while the analytic gradient can contain bugs at times, but is usually extremely efficient to compute.

Lets structure the code as follows to make the gates explicit as functions: In the above, I am using a and b as the local variables in the gate functions so that we don’t get these confused with our circuit inputs x,y,z.

If we don’t worry about x and y but only about q and z, then we are back to having only a single gate, and as far as that single * gate is concerned, we know what the (analytic) derivates are from previous section.

“Pulling” upwards on this output value induced a force on both q and z: To increase the output value, the circuit “wants” z to increase, as can be seen by the positive value of the derivative(derivative_f_wrt_z = +3).

The multiplication by -4 seen in the chain rule achieves exactly this: instead of applying a positive force of +1 on both x and y (the local derivative), the full circuit’s gradient on both x and y becomes 1 x -4 = -4.

The only difference between the case of a single gate and multiple interacting gates that compute arbitrarily complex expressions is this additional multipy operation that now happens in each gate.

For example, the + gate always takes the gradient on top and simply passes it on to all of its inputs (notice the example with -4 simply passed on to both of the inputs of + gate).

Since the gradient of max(x,y) with respect to its input is +1 for whichever one of x, y is larger and 0 for the other, this gate is during backprop effectively just a gradient “switch”: it will take the gradient from above and “route” it to the input that had a higher value during the forward pass.

Its best thought of as a “squashing function”, because it takes the input and squashes it to be between zero and one: Very negative values are squashed towards zero and positive values get squashed towards one.

Sigmoid function is defined as: The gradient with respect to its single input, as you can check on Wikipedia or derive yourself if you know some calculus is given by this expression: For example, if the input to the sigmoid gate is x = 3, the gate will compute output f = 1.0 / (1.0 + Math.exp(-x)) = 0.95, and then the (local) gradient on its input will simply be dx = (0.95) * (1 - 0.95) = 0.0475.

Another thing to note is that technically, the sigmoid function is made up of an entire series of gates in a line that compute more atomic functions: an exponentiation gate, an addition gate and a division gate.

Treating it so would work perfectly fine but for this example I chose to collapse all of these gates into a single gate that just computes sigmoid in one shot, because the gradient expression turns out to be simple.

If you’re not a Javascript - familiar person, all that’s going on here is that I’m defining a class that has certain properties (accessed with use of this keyword), and some methods (which in Javascript are placed into the function’s prototype).

Then notice that in the backward function call we get the gradient from the output unit we produced during the forward pass (which will by now hopefully have its gradient filled in) and multiply it with the local gradient for this gate (chain rule!).

This gate computes multiplication (u0.value * u1.value) during forward pass, so recall that the gradient w.r.t u0 is u1.value and w.r.t u1 is u0.value.

This will allow us to possibly use the output of one gate multiple times (think of it as a wire branching out), since it turns out that the gradients from these different branches just add up when computing the final gradient with respect to the circuit output.

To fully specify everything lets finally write out the forward and backward flow for our 2-dimensional neuron with some example values: And now lets compute the gradient: Simply iterate in reverse order and call the backward function!

Finally, lets verify that we implemented the backpropagation correctly by checking the numerical gradient: Indeed, these all give the same values as the backpropagated gradients [-0.105, 0.315, 0.105, 0.105, 0.210].

hope it is clear that even though we only looked at an example of a single neuron, the code I gave above generalizes in a very straight-forward way to compute gradients of arbitrary expressions (including very deep expressions #foreshadowing).

All you have to do is write small gates that compute local, simple derivatives w.r.t their inputs, wire it up in a graph, do a forward pass to compute the output value and then a backward pass that chains the gradients all the way to the input.

Our first example was the * gate: In the code above, I’m assuming that the variable dx is given, coming from somewhere above us in the circuit while we’re doing backprop (or it is +1 by default otherwise).

If you remember the backward flow diagram, the + gate simply takes the gradient on top and routes it equally to all of its inputs (because its local gradient is always simply 1.0 for all its inputs, regardless of their actual values).

So we can do it much faster: Okay, how about combining gates?: If you don’t see how the above happened, introduce a temporary variable q = a * b and then compute x = q + c to convince yourself.

In other words nothing changes: In fact, if you know your power rule from calculus you would also know that if you have \( f(a) = a^2 \) then \( \frac{\partial f(a)}{\partial a} = 2a \), which is exactly what we get if we think of it as wire splitting up and being two inputs to a gate.

Lets do another one: Okay now lets start to get more complex: When more complex cases like this come up in practice, I like to split the expression into manageable chunks which are almost always composed of simpler expressions and then I chain them together with chain rule: That wasn’t too difficult!

Here are a few more useful functions and their local gradients that are useful in practice: Here’s what division might look like in practice then: Hopefully you see that we are breaking down expressions, doing the forward pass, and then for every variable (such as a) we derive its gradient da as we go backwards, one by one, applying the simple local gradients and chaining them with gradients from above.

Everything we’ve done in this chapter comes down to this: We saw that we can feed some input through arbitrarily complex real-valued circuit, tug at the end of the circuit with some force, and backpropagation distributes that tug through the entire circuit all the way back to the inputs.

In the last chapter we were concerned with real-valued circuits that computed possibly complex expressions of their inputs (the forward pass), and also we could compute the gradients of these expressions on the original inputs (backward pass).

This is a silly toy example, but in practice a +1/-1 dataset could be very useful things indeed: For example spam/no spam emails, where the vectors somehow measure various features of the content of the email, such as the number of times certain enhancement drugs are mentioned.

We will eventually build up to entire neural networks and complex expressions, but lets start out simple and train a linear classifier very similar to the single neuron we saw at the end of Chapter 1.

The only difference is that we’ll get rid of the sigmoid because it makes things unnecessarily complicated (I only used it as an example in Chapter 1 because sigmoid neurons are historically popular but modern Neural Networks rarely, if ever, use sigmoid non-linearities).

For example, if a = 1, b = -2, c = -1, then the function will take the first datapoint ([1.2, 0.7]) and output 1 * 1.2 + (-2) * 0.7 + (-1) = -1.2.

Over time, the pulls on these parameters will tune these values in such a way that the function outputs high scores for positive examples and low scores for negative examples.

At this point, if you’ve seen an explanation of SVMs you’re probably expecting me to define the SVM loss function and plunge into an explanation of slack variables, geometrical intuitions of large margins, kernels, duality, etc.

Lets write the SVM code and take advantage of the circuit machinery we have from Chapter 1: That’s a circuit that simply computes a*x + b*y + c and can also compute the gradient.

Now lets train the SVM with Stochastic Gradient Descent: This code prints the following output: We see that initially our classifier only had 33% training accuracy, but by the end all training examples are correctly classifier as the parameters a,b,c adjusted their values according to the pulls we exerted.

a negative data point that gets a score +100, its influence will be relatively minor on our classifier because we will only pull with force of -1 regardless of how bad the mistake was.

Of interest is the fact that an SVM is just a particular type of a very simple circuit (circuit that computes score = a*x + b*y + c where a,b,c are weights and x,y are data points).

The forward pass will look like this: The specification above is a 2-layer Neural Network with 3 hidden neurons (n1, n2, n3) that uses Rectified Linear Unit (ReLU) non-linearity on each hidden neuron.

But for now, I hope your takeaway is that a 2-layer Neural Net is really not such a scary thing: we write a forward pass expression, interpret the value at the end as a score, and then we pull on that value in a positive or negative direction depending on what we want that value to be for our current particular example.

We are given a dataset of \( N \) examples \( (x_{i0}, x_{i1}) \) and their corresponding labels \( y_{i} \) which are allowed to be either \( +1/-1 \) for positive or negative example respectively.

Due to this term the cost will never actually become zero (because this would mean all parameters of the model except the bias are exactly zero), but the closer we get, the better our classifier will become.

Deep learning

In the last chapter we learned that deep neural networks are often much harder to train than shallow neural networks.

We'll also look at the broader picture, briefly reviewing recent progress on using deep nets for image recognition, speech recognition, and other applications.

We'll work through a detailed example - code and all - of using convolutional nets to solve the problem of classifying handwritten digits from the MNIST data set:

As we go we'll explore many powerful techniques: convolutions, pooling, the use of GPUs to do far more training than we did with our shallow networks, the algorithmic expansion of our training data (to reduce overfitting), the use of the dropout technique (also to reduce overfitting), the use of ensembles of networks, and others.

We conclude our discussion of image recognition with a survey of some of the spectacular recent progress using networks (particularly convolutional nets) to do image recognition.

We'll briefly survey other models of neural networks, such as recurrent neural nets and long short-term memory units, and how such models can be applied to problems in speech recognition, natural language processing, and other areas.

And we'll speculate about the future of neural networks and deep learning, ranging from ideas like intention-driven user interfaces, to the role of deep learning in artificial intelligence.

For the $28 \times 28$ pixel images we've been using, this means our network has $784$ ($= 28 \times 28$) input neurons.

Our earlier networks work pretty well: we've obtained a classification accuracy better than 98 percent, using training and test data from the MNIST handwritten digit data set.

But the seminal paper establishing the modern subject of convolutional networks was a 1998 paper, 'Gradient-based learning applied to document recognition', by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.

LeCun has since made an interesting remark on the terminology for convolutional nets: 'The [biological] neural inspiration in models like convolutional nets is very tenuous.

That's why I call them 'convolutional nets' not 'convolutional neural nets', and why we call the nodes 'units' and not 'neurons' '.

Despite this remark, convolutional nets use many of the same ideas as the neural networks we've studied up to now: ideas such as backpropagation, gradient descent, regularization, non-linear activation functions, and so on.

In a convolutional net, it'll help to think instead of the inputs as a $28 \times 28$ square of neurons, whose values correspond to the $28 \times 28$ pixel intensities we're using as inputs:

To be more precise, each neuron in the first hidden layer will be connected to a small region of the input neurons, say, for example, a $5 \times 5$ region, corresponding to $25$ input pixels.

So, for a particular hidden neuron, we might have connections that look like this: That region in the input image is called the local receptive field for the hidden neuron.

To illustrate this concretely, let's start with a local receptive field in the top-left corner: Then we slide the local receptive field over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron:

Note that if we have a $28 \times 28$ input image, and $5 \times 5$ local receptive fields, then there will be $24 \times 24$ neurons in the hidden layer.

This is because we can only move the local receptive field $23$ neurons across (or $23$ neurons down), before colliding with the right-hand side (or bottom) of the input image.

In this chapter we'll mostly stick with stride length $1$, but it's worth knowing that people sometimes experiment with different stride lengths* *As was done in earlier chapters, if we're interested in trying different stride lengths then we can use validation data to pick out the stride length which gives the best performance.

The same approach may also be used to choose the size of the local receptive field - there is, of course, nothing special about using a $5 \times 5$ local receptive field.

In general, larger local receptive fields tend to be helpful when the input images are significantly larger than the $28 \times 28$ pixel MNIST images..

In other words, for the $j, k$th hidden neuron, the output is: \begin{eqnarray} \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right).

Informally, think of the feature detected by a hidden neuron as the kind of input pattern that will cause the neuron to activate: it might be an edge in the image, for instance, or maybe some other type of shape.

To see why this makes sense, suppose the weights and bias are such that the hidden neuron can pick out, say, a vertical edge in a particular local receptive field.

To put it in slightly more abstract terms, convolutional networks are well adapted to the translation invariance of images: move a picture of a cat (say) a little ways, and it's still an image of a cat* *In fact, for the MNIST digit classification problem we've been studying, the images are centered and size-normalized.

One of the early convolutional networks, LeNet-5, used $6$ feature maps, each associated to a $5 \times 5$ local receptive field, to recognize MNIST digits.

Let's take a quick peek at some of the features which are learned* *The feature maps illustrated come from the final convolutional network we train, see here.:

Each map is represented as a $5 \times 5$ block image, corresponding to the $5 \times 5$ weights in the local receptive field.

By comparison, suppose we had a fully connected first layer, with $784 = 28 \times 28$ input neurons, and a relatively modest $30$ hidden neurons, as we used in many of the examples earlier in the book.

That, in turn, will result in faster training for the convolutional model, and, ultimately, will help us build deep networks using convolutional layers.

Incidentally, the name convolutional comes from the fact that the operation in Equation (125)\begin{eqnarray} \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right) \nonumber\end{eqnarray}$('#margin_15061212923_reveal').click(function() {$('#margin_15061212923').toggle('slow', function() {});});

A little more precisely, people sometimes write that equation as $a^1 = \sigma(b + w * a^0)$, where $a^1$ denotes the set of output activations from one feature map, $a^0$ is the set of input activations, and $*$ is called a convolution operation.

In particular, I'm using 'feature map' to mean not the function computed by the convolutional layer, but rather the activation of the hidden neurons output from the layer.

In max-pooling, a pooling unit simply outputs the maximum activation in the $2 \times 2$ input region, as illustrated in the following diagram:

Note that since we have $24 \times 24$ neurons output from the convolutional layer, after pooling we have $12 \times 12$ neurons.

So if there were three feature maps, the combined convolutional and max-pooling layers would look like:

Here, instead of taking the maximum activation of a $2 \times 2$ region of neurons, we take the square root of the sum of the squares of the activations in the $2 \times 2$ region.

It's similar to the architecture we were just looking at, but has the addition of a layer of $10$ output neurons, corresponding to the $10$ possible values for MNIST digits ('0', '1', '2', etc):

Problem Backpropagation in a convolutional network The core equations of backpropagation in a network with fully-connected layers are (BP1)\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}$('#margin_518328611346_reveal').click(function() {$('#margin_518328611346').toggle('slow', function() {});});-(BP4)\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}$('#margin_493404030989_reveal').click(function() {$('#margin_493404030989').toggle('slow', function() {});});

Suppose we have a network containing a convolutional layer, a max-pooling layer, and a fully-connected output layer, as in the network discussed above.

The program we'll use to do this is called, and it's an improved version of the programs and developed in earlier chapters* *Note also that incorporates ideas from the Theano library's documentation on convolutional neural nets (notably the implementation of LeNet-5), from Misha Denil's implementation of dropout, and from Chris Olah..

But now that we understand those details, for we're going to use a machine learning library known as Theano* *See Theano: A CPU and GPU Math Expression Compiler in Python, by James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Ravzan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio (2010).

The examples which follow were run using Theano 0.6* *As I release this chapter, the current version of Theano has changed to version 0.7.

Note that the code in the script simply duplicates and parallels the discussion in this section.Note also that throughout the section I've explicitly specified the number of training epochs.

In practice, it's worth using early stopping, that is, tracking accuracy on the validation set, and stopping training when we are confident the validation accuracy has stopped improving.: >>>

Using the validation data to decide when to evaluate the test accuracy helps avoid overfitting to the test data (see this earlier discussion of the use of validation data).

Your results may vary slightly, since the network's weights and biases are randomly initialized* *In fact, in this experiment I actually did three separate runs training a network with this architecture.

This $97.80$ percent accuracy is close to the $98.04$ percent accuracy obtained back in Chapter 3, using a similar network architecture and learning hyper-parameters.

Second, while the final layer in the earlier network used sigmoid activations and the cross-entropy cost function, the current network uses a softmax final layer, and the log-likelihood cost function.

I haven't made this switch for any particularly deep reason - mostly, I've done it because softmax plus log-likelihood cost is more common in modern image classification networks.

In this architecture, we can think of the convolutional and pooling layers as learning about local spatial structure in the input training image, while the later, fully-connected layer learns at a more abstract level, integrating global information from across the entire image.

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2)),

validation_data, test_data)

Can we improve on the $98.78$ percent classification accuracy?

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2)),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2)),

validation_data, test_data)

In fact, you can think of the second convolutional-pooling layer as having as input $12 \times 12$ 'images', whose 'pixels' represent the presence (or absence) of particular localized features in the original input image.

The output from the previous layer involves $20$ separate feature maps, and so there are $20 \times 12 \times 12$ inputs to the second convolutional-pooling layer.

In fact, we'll allow each neuron in this layer to learn from all $20 \times 5 \times 5$ input neurons in its local receptive field.

More informally: the feature detectors in the second convolutional-pooling layer have access to all the features from the previous layer, but only within their particular local receptive field* *This issue would have arisen in the first layer if the input images were in color.

In that case we'd have 3 input features for each pixel, corresponding to red, green and blue channels in the input image.

So we'd allow the feature detectors to have access to all color information, but only within a given local receptive field..

Problem Using the tanh activation function Several times earlier in the book I've mentioned arguments that the tanh function may be a better activation function than the sigmoid function.

Try training the network with tanh activations in the convolutional and fully-connected layers* *Note that you can pass activation_fn=tanh as a parameter to the ConvPoolLayer and FullyConnectedLayer classes..

Try plotting the per-epoch validation accuracies for both tanh- and sigmoid-based networks, all the way out to $60$ epochs.

If your results are similar to mine, you'll find the tanh networks train a little faster, but the final accuracies are very similar.

Can you get a similar training speed with the sigmoid, perhaps by changing the learning rate, or doing some rescaling* *You may perhaps find inspiration in recalling that $\sigma(z) = (1+\tanh(z/2))/2$.?

Try a half-dozen iterations on the learning hyper-parameters or network architecture, searching for ways that tanh may be superior to the sigmoid.

Personally, I did not find much advantage in switching to tanh, although I haven't experimented exhaustively, and perhaps you may find a way.

In any case, in a moment we will find an advantage in switching to the rectified linear activation function, and so we won't go any deeper into the use of tanh.

Using rectified linear units: The network we've developed at this point is actually a variant of one of the networks used in the seminal 1998 paper* *'Gradient-based learning applied to document recognition', by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner (1998).

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),


filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),


However, across all my experiments I found that networks based on rectified linear units consistently outperformed networks based on sigmoid activation functions.

The reason for that recent adoption is empirical: a few people tried rectified linear units, often on the basis of hunches or heuristic arguments* *A common justification is that $\max(0, z)$ doesn't saturate in the limit of large $z$, unlike sigmoid neurons, and this helps rectified linear units continue learning.

A simple way of expanding the training data is to displace each training image by a single pixel, either up one pixel, down one pixel, left one pixel, or right one pixel.

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),


filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),


Just to remind you of the flavour of some of the results in that earlier discussion: in 2003 Simard, Steinkraus and Platt* *Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis, by Patrice Simard, Dave Steinkraus, and John Platt (2003).

improved their MNIST performance to $99.6$ percent using a neural network otherwise very similar to ours, using two convolutional-pooling layers, followed by a hidden fully-connected layer with $100$ neurons.

There were a few differences of detail in their architecture - they didn't have the advantage of using rectified linear units, for instance - but the key to their improved performance was expanding the training data.

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),


filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),


filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),


filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),


Using this, we obtain an accuracy of $99.60$ percent, which is a substantial improvement over our earlier results, especially our main benchmark, the network with $100$ hidden neurons, where we achieved $99.37$ percent.

In fact, I tried experiments with both $300$ and $1,000$ hidden neurons, and obtained (very slightly) better validation performance with $1,000$ hidden neurons.

Why we only applied dropout to the fully-connected layers: If you look carefully at the code above, you'll notice that we applied dropout only to the fully-connected section of the network, not to the convolutional layers.

But apart from that, they used few other tricks, including no convolutional layers: it was a plain, vanilla network, of the kind that, with enough patience, could have been trained in the 1980s (if the MNIST data set had existed), given enough computing power.

In particular, we saw that the gradient tends to be quite unstable: as we move from the output layer to earlier layers the gradient tends to either vanish (the vanishing gradient problem) or explode (the exploding gradient problem).

In particular, in our final experiments we trained for $40$ epochs using a data set $5$ times larger than the raw MNIST training data.

I've occasionally heard people adopt a deeper-than-thou attitude, holding that if you're not keeping-up-with-the-Joneses in terms of number of hidden layers, then you're not really doing deep learning.

To speed that process up you may find it helpful to revisit Chapter 3's discussion of how to choose a neural network's hyper-parameters, and perhaps also to look at some of the further reading suggested in that section.

Here's the code (discussion below)* *Note added November 2016: several readers have noted that in the line initializing self.w, I set scale=np.sqrt(1.0/n_out), when the arguments of Chapter 3 suggest a better initialization may be scale=np.sqrt(1.0/n_in).


loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),



I use the name inpt rather than input because input is a built-in function in Python, and messing with built-ins tends to cause unpredictable behavior and difficult-to-diagnose bugs.

So self.inpt_dropout and self.output_dropout are used during training, while self.inpt and self.output are used for all other purposes, e.g., evaluating accuracy on the validation and test data.

prev_layer, layer = self.layers[j-1], self.layers[j]

prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)

Now, this isn't a Theano tutorial, and so we won't get too deeply into what it means that these are symbolic variables* *The Theano documentation provides a good introduction to Theano.

# define the (regularized) cost function, symbolic gradients, and updates


for param, grad in zip(self.params, grads)]


training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],


training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]


validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],


validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]


test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],


test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]


test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

iteration = num_training_batches*epoch+minibatch_index

if iteration

print("Training mini-batch number {0}".format(iteration))

cost_ij = train_mb(minibatch_index)

if (iteration+1)

validation_accuracy = np.mean(

[validate_mb_accuracy(j) for j in xrange(num_validation_batches)])

print("Epoch {0}: validation accuracy {1:.2

epoch, validation_accuracy))

if validation_accuracy >= best_validation_accuracy:

print("This is the best validation accuracy to date.")

best_validation_accuracy = validation_accuracy

best_iteration = iteration

if test_data:

test_accuracy = np.mean(

[test_mb_accuracy(j) for j in xrange(num_test_batches)])

print('The corresponding test accuracy is {0:.2


# define the (regularized) cost function, symbolic gradients, and updates


for param, grad in zip(self.params, grads)]

In these lines we symbolically set up the regularized log-likelihood cost function, compute the corresponding derivatives in the gradient function, as well as the corresponding parameter updates.

With all these things defined, the stage is set to define the train_mb function, a Theano symbolic function which uses the updates to update the Network parameters, given a mini-batch index.

The remainder of the SGD method is self-explanatory - we simply iterate over the epochs, repeatedly training the network on mini-batches of training data, and computing the validation and test accuracies.

prev_layer, layer = self.layers[j-1], self.layers[j]

prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)

# define the (regularized) cost function, symbolic gradients, and updates


for param, grad in zip(self.params, grads)]


training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],


training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]


validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],


validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]


test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],


test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]


test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

iteration = num_training_batches*epoch+minibatch_index

if iteration % 1000 == 0:

print("Training mini-batch number {0}".format(iteration))

cost_ij = train_mb(minibatch_index)

if (iteration+1) % num_training_batches == 0:

validation_accuracy = np.mean(

[validate_mb_accuracy(j) for j in xrange(num_validation_batches)])

print("Epoch {0}: validation accuracy {1:.2%}".format(

epoch, validation_accuracy))

if validation_accuracy >= best_validation_accuracy:

print("This is the best validation accuracy to date.")

best_validation_accuracy = validation_accuracy

best_iteration = iteration

if test_data:

test_accuracy = np.mean(

[test_mb_accuracy(j) for j in xrange(num_test_batches)])

print('The corresponding test accuracy is {0:.2%}'.format(



of filters, the number of input feature maps, the filter height, and the

`poolsize` is a tuple of length 2, whose entries are the y and

np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), size=filter_shape),


np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)),


pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))


loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),



Earlier in the book we discussed an automated way of selecting the number of epochs to train for, known as early stopping.

Hint: After working on this problem for a while, you may find it useful to see the discussion at this link.

Earlier in the chapter I described a technique for expanding the training data by applying (small) rotations, skewing, and translation.

Note: Unless you have a tremendous amount of memory, it is not practical to explicitly generate the entire expanded data set.

Show that rescaling all the weights in the network by a constant factor $c > 0$ simply rescales the outputs by a factor $c^{L-1}$, where $L$ is the number of layers.

Still, considering the problem will help you better understand networks containing rectified linear units.

Note: The word good in the second part of this makes the problem a research problem.

In 1998, the year MNIST was introduced, it took weeks to train a state-of-the-art workstation to achieve accuracies substantially worse than those we can achieve using a GPU and less than an hour of training.

With that said, the past few years have seen extraordinary improvements using deep nets to attack extremely difficult image recognition tasks.

They will identify the years 2011 to 2015 (and probably a few years beyond) as a time of huge breakthroughs, driven by deep convolutional nets.

The 2012 LRMD paper: Let me start with a 2012 paper* *Building high-level features using large scale unsupervised learning, by Quoc Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg Corrado, Jeff Dean, and Andrew Ng (2012).

Note that the detailed architecture of the network used in the paper differed in many details from the deep convolutional networks we've been studying.

Details about ImageNet are available in the original ImageNet paper, ImageNet: a large-scale hierarchical image database, by Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei (2009).:

If you're looking for a challenge, I encourage you to visit ImageNet's list of hand tools, which distinguishes between beading planes, block planes, chamfer planes, and about a dozen other types of plane, amongst other categories.

The 2012 KSH paper: The work of LRMD was followed by a 2012 paper of Krizhevsky, Sutskever and Hinton (KSH)* *ImageNet classification with deep convolutional neural networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E.

By this top-$5$ criterion, KSH's deep convolutional network achieved an accuracy of $84.7$ percent, vastly better than the next-best contest entry, which achieved an accuracy of $73.8$ percent.

The input layer contains $3 \times 224 \times 224$ neurons, representing the RGB values for a $224 \times 224$ image.

The feature maps are split into two groups of $48$ each, with the first $48$ feature maps residing on one GPU, and the second $48$ feature maps residing on the other GPU.

Their respectives parameters are: (3) $384$ feature maps, with $3 \times 3$ local receptive fields, and $256$ input channels;

A Theano-based implementation has also been developed* *Theano-based large-scale visual recognition with multiple GPUs, by Weiguang Ding, Ruoyan Wang, Fei Mao, and Graham Taylor (2014)., with the code available here.

As in 2012, it involved a training set of $1.2$ million images, in $1,000$ categories, and the figure of merit was whether the top $5$ predictions included the correct category.

The winning team, based primarily at Google* *Going deeper with convolutions, by Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich (2014)., used a deep convolutional network with $22$ layers of neurons.

GoogLeNet achieved a top-5 accuracy of $93.33$ percent, a giant improvement over the 2013 winner (Clarifai, with $88.3$ percent), and the 2012 winner (KSH, with $84.7$ percent).

In 2014 a team of researchers wrote a survey paper about the ILSVRC competition* *ImageNet large scale visual recognition challenge, by Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.

...the task of labeling images with 5 out of 1000 categories quickly turned out to be extremely challenging, even for some friends in the lab who have been working on ILSVRC and its classes for a while.

In the end I realized that to get anywhere competitively close to GoogLeNet, it was most efficient if I sat down and went through the painfully long training process and the subsequent careful annotation process myself...

Some images are easily recognized, while some images (such as those of fine-grained breeds of dogs, birds, or monkeys) can require multiple minutes of concentrated effort.

In other words, an expert human, working painstakingly, was with great effort able to narrowly beat the deep neural network.

In fact, Karpathy reports that a second human expert, trained on a smaller sample of images, was only able to attain a $12.0$ percent top-5 error rate, significantly below GoogLeNet's performance.

One encouraging practical set of results comes from a team at Google, who applied deep convolutional networks to the problem of recognizing street numbers in Google's Street View imagery* *Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks, by Ian J.

And they go on to make the broader claim: 'We believe with this model we have solved [optical character recognition] for short sequences [of characters] for many applications.'

For instance, a 2013 paper* *Intriguing properties of neural networks, by Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus (2013) showed that deep networks may suffer from what are effectively blind spots.

The existence of the adversarial negatives appears to be in contradiction with the network’s ability to achieve high generalization performance.

The explanation is that the set of adversarial negatives is of extremely low probability, and thus is never (or rarely) observed in the test set, yet it is dense (much like the rational numbers), and so it is found near virtually every test case.

For example, one recent paper* *Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images, by Anh Nguyen, Jason Yosinski, and Jeff Clune (2014).

shows that given a trained network it's possible to generate images which look to a human like white noise, but which the network classifies as being in a known category with a very high degree of confidence.

If you read the neural networks literature, you'll run into many ideas we haven't discussed: recurrent neural networks, Boltzmann machines, generative models, transfer learning, reinforcement learning, and so on, on and on $\ldots$ and on!

One way RNNs are currently being used is to connect neural networks more closely to traditional ways of thinking about algorithms, ways of thinking based on concepts such as Turing machines and (conventional) programming languages.

A 2014 paper developed an RNN which could take as input a character-by-character description of a (very, very simple!) Python program, and use that description to predict the output.

For example, an approach based on deep nets has achieved outstanding results on large vocabulary continuous speech recognition.

And another system based on deep nets has been deployed in Google's Android operating system (for related technical work, see Vincent Vanhoucke's 2012-2015 papers).

Many other ideas used in feedforward nets, ranging from regularization techniques to convolutions to the activation and cost functions used, are also useful in recurrent nets.

Deep belief nets, generative models, and Boltzmann machines: Modern interest in deep learning began in 2006, with papers explaining how to train a type of neural network known as a deep belief network (DBN)* *See A fast learning algorithm for deep belief nets, by Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh (2006), as well as the related work in Reducing the dimensionality of data with neural networks, by Geoffrey Hinton and Ruslan Salakhutdinov (2006)..

A generative model like a DBN can be used in a similar way, but it's also possible to specify the values of some of the feature neurons and then 'run the network backward', generating values for the input activations.

And the ability to do unsupervised learning is extremely interesting both for fundamental scientific reasons, and - if it can be made to work well enough - for practical applications.

Active areas of research include using neural networks to do natural language processing (see also this informative review paper), machine translation, as well as perhaps more surprising applications such as music informatics.

In many cases, having read this book you should be able to begin following recent work, although (of course) you'll need to fill in gaps in presumed background knowledge.

It combines deep convolutional networks with a technique known as reinforcement learning in order to learn to play video games well (see also this followup).

The idea is to use the convolutional network to simplify the pixel data from the game screen, turning it into a simpler set of features, which can be used to decide which action to take: 'go left', 'go down', 'fire', and so on.

What is particularly interesting is that a single network learned to play seven different classic video games pretty well, outperforming human experts on three of the games.

But looking past the surface gloss, consider that this system is taking raw pixel data - it doesn't even know the game rules!

Google CEO Larry Page once described the perfect search engine as understanding exactly what [your queries] mean and giving you back exactly what you want.

In this vision, instead of responding to users' literal queries, search will use machine learning to take vague user input, discern precisely what was meant, and take action on the basis of those insights.

Over the next few decades, thousands of companies will build products which use machine learning to make user interfaces that can tolerate imprecision, while discerning and acting on the user's true intent.

Inspired user interface design is hard, and I expect many companies will take powerful machine learning technology and use it to build insipid user interfaces.

Machine learning, data science, and the virtuous circle of innovation: Of course, machine learning isn't just being used to build intention-driven interfaces.

But I do want to mention one consequence of this fashion that is not so often remarked: over the long run it's possible the biggest breakthrough in machine learning won't be any single conceptual breakthrough.

If a company can invest 1 dollar in machine learning research and get 1 dollar and 10 cents back reasonably rapidly, then a lot of money will end up in machine learning research.

So, for example, Conway's law suggests that the design of a Boeing 747 aircraft will mirror the extended organizational structure of Boeing and its contractors at the time the 747 was designed.

If the application's dashboard is supposed to be integrated with some machine learning algorithm, the person building the dashboard better be talking to the company's machine learning expert.

I won't define 'deep ideas' precisely, but loosely I mean the kind of idea which is the basis for a rich field of enquiry.

The backpropagation algorithm and the germ theory of disease are both good examples.: think of things like the germ theory of disease, for instance, or the understanding of how antibodies work, or the understanding that the heart, lungs, veins and arteries form a complete cardiovascular system.

Instead of a monolith, we have fields within fields within fields, a complex, recursive, self-referential social structure, whose organization mirrors the connections between our deepest insights.

Deep learning is the latest super-special weapon I've heard used in such arguments* *Interestingly, often not by leading experts in deep learning, who have been quite restrained.

And there is paper after paper leveraging the same basic set of ideas: using stochastic gradient descent (or a close variation) to optimize a cost function.

An Intuitive Guide to Deep Network Architectures

Over the past few years, much of the progress in deep learning for computer vision can be boiled down to just a handful of neural network architectures.

At the time of writing, Keras ships with six of these pre-trained models already built into the library: The VGG networks, along with the earlier AlexNet from 2012, follow the now archetypal layout of basic conv nets: a series of convolutional, max-pooling, and activation layers before some fully-connected classification layers at the end.

This gives rise to the famous ResNet (or “residual network”) block you’ve probably seen: Each “block” in ResNet consists of a series of layers and a “shortcut” connection adding the input of the block to its output.

The “add” operation is performed element-wise, and if the input and output are of different sizes, zero-padding or projections (via 1x1 convolutions) can be used to create matching dimensions.

Previously, deep neural nets often suffered from the problem of vanishing gradients, in which gradient signals from the error function decreased exponentially as they backpropogated to earlier layers.

However, because the gradient signal in ResNets could travel back directly to early layers via shortcut connections, we could suddenly build 50-layer, 101-layer, 152-layer, and even (apparently) 1000+ layer nets that still performed well.

So many deep learning papers come out with minor improvements from hacking away at the math, the optimizations, and the training process without thought to the underlying task of the model.

The original paper focused on a new building block for deep nets, a block now known as the “Inception module.” At its core, this module is the product of two key insights.

5x5) convolutional filters inherently expensive to compute, stacking multiple different filters side by side greatly increases the number of feature maps per layer.

In other words, as the authors note, “any uniform increase in the number of [filters] results in a quadratic increase of computation.” Our naive Inception module just tripled or quadrupled the number of filters.

By reducing the number of input maps, the authors of Inception were able to stack different layer transformations in parallel, resulting in nets that were simultaneously deep (many layers) and “wide” (many parallel operations).

But what *is* a Neural Network? | Deep learning, chapter 1

Subscribe to stay notified about new videos: Support more videos like this on Patreon: Or don'

Weight Initialization in a Deep Network (C2W1L11)

Understanding and Applying Self-Attention for NLP - Ivan Bilan

PyData Berlin 2018 Understanding attention mechanisms and self-attention, presented in Google's "Attention is all you need" paper, is a beneficial skill for ...

How Convolutional Neural Networks work

A gentle guided tour of Convolutional Neural Networks. Come lift the curtain and see how the magic is done. For slides and text, check out the accompanying ...

A visual and intuitive understanding of deep learning

Presentation at O'Reilly AI conference in San Francisco, September 19 2017.

Lecture 12 | Visualizing and Understanding

In Lecture 12 we discuss methods for visualizing and understanding the internal mechanisms of convolutional networks. We also discuss the use of ...

Lecture 4 | Introduction to Neural Networks

In Lecture 4 we progress from linear classifiers to fully-connected neural networks. We introduce the backpropagation algorithm for computing gradients and ...

Beginner Intro to Neural Networks 10: Finding Linear Regression Parameters

Hey everyone! In the last video we came up with this cost function: cost = (wags(1)-2)^2 + (wags(2)-4)^2 + (wags(4)-5)^2 It changes when we change w and b.

Building a Gesture Recognition System using Deep Learning - Joanna Materzyńska

Description In this talk I will introduce a Python-based, deep learning gesture recognition model. The model is deployed on an embedded system, works in ...

Convolutional Neural Networks - The Math of Intelligence (Week 4)

Convolutional Networks allow us to classify images, generate them, and can even be applied to other types of data. We're going to build one in numpy that can ...