# AI News, Improving the way neural networks learn

- On Monday, June 4, 2018
- By Read More

## Improving the way neural networks learn

When a golf player is first learning to play golf, they usually spend most of their time developing a basic swing.

The techniques we'll develop in this chapter include: a better choice of cost function, known as the cross-entropy cost function;

four so-called 'regularization' methods (L1 and L2 regularization, dropout, and artificial expansion of the training data), which make our networks better at generalizing beyond the training data;

As you can see, the neuron rapidly learns a weight and bias that drives down the cost, and gives an output from the neuron of about $0.09$.

Although this example uses the same learning rate ($\eta = 0.15$), we can see that learning starts out much more slowly.

To understand the origin of the problem, consider that our neuron learns by changing the weight and bias at a rate determined by the partial derivatives of the cost function, $\partial C/\partial w$ and $\partial C / \partial b$.

y(x) - a\|^2 \nonumber\end{eqnarray}$('#margin_761523280146_reveal').click(function() {$('#margin_761523280146').toggle('slow', function() {});});, is given by \begin{eqnarray} C = \frac{(y-a)^2}{2}, \tag{54}\end{eqnarray} where $a$ is the neuron's output when the training input $x = 1$ is used, and $y = 0$ is the corresponding desired output.

Using the chain rule to differentiate with respect to the weight and bias we get \begin{eqnarray} \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \tag{55}\\ \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z), \tag{56}\end{eqnarray} where I have substituted $x = 1$ and $y = 0$.

var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))};

var y = d3.scale.linear() .domain([0, 1]) .range([height, 0]);

}) var graph = d3.select('#sigmoid_graph') .append('svg') .attr('width', width + m[1] + m[3]) .attr('height', height + m[0] + m[2]) .append('g') .attr('transform', 'translate(' + m[3] + ',' + m[0] + ')');

var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(-4, 5, 1)) .orient('bottom') graph.append('g') .attr('class', 'x axis') .attr('transform', 'translate(0, ' + height + ')') .call(xAxis);

var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(0, 1.01, 0.2)) .orient('left') .ticks(5) graph.append('g') .attr('class', 'y axis') .call(yAxis);

graph.append('text') .attr('class', 'x label') .attr('text-anchor', 'end') .attr('x', width/2) .attr('y', height+35) .text('z');

graph.append('text') .attr('x', (width / 2)) .attr('y', -10) .attr('text-anchor', 'middle') .style('font-size', '16px') .text('sigmoid function');

Equations (55)\begin{eqnarray} \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \nonumber\end{eqnarray}$('#margin_631292671952_reveal').click(function() {$('#margin_631292671952').toggle('slow', function() {});});

and (56)\begin{eqnarray} \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}$('#margin_485541744696_reveal').click(function() {$('#margin_485541744696').toggle('slow', function() {});});

We'll suppose instead that we're trying to train a neuron with several input variables, $x_1, x_2, \ldots$, corresponding weights $w_1, w_2, \ldots$, and a bias, $b$: The output from the neuron is, of course, $a = \sigma(z)$, where $z = \sum_j w_j x_j+b$ is the weighted sum of the inputs.

We define the cross-entropy cost function for this neuron by \begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right], \tag{57}\end{eqnarray} where $n$ is the total number of items of training data, the sum is over all training inputs, $x$, and $y$ is the corresponding desired output.

It's not obvious that the expression (57)\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}$('#margin_243826766400_reveal').click(function() {$('#margin_243826766400').toggle('slow', function() {});});

To see this, notice that: (a) all the individual terms in the sum in (57)\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}$('#margin_693180266972_reveal').click(function() {$('#margin_693180266972').toggle('slow', function() {});});

Second, if the neuron's actual output is close to the desired output for all training inputs, $x$, then the cross-entropy will be close to zero* *To prove this I will need to assume that the desired outputs $y$ are all either $0$ or $1$.

We see that the first term in the expression (57)\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}$('#margin_577453671937_reveal').click(function() {$('#margin_577453671937').toggle('slow', function() {});});

We substitute $a = \sigma(z)$ into (57)\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}$('#margin_799354263895_reveal').click(function() {$('#margin_799354263895').toggle('slow', function() {});});, and apply the chain rule twice, obtaining: \begin{eqnarray} \frac{\partial C}{\partial w_j} & = & -\frac{1}{n} \sum_x \left( \frac{y }{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right) \frac{\partial \sigma}{\partial w_j} \tag{58}\\ & = & -\frac{1}{n} \sum_x \left( \frac{y}{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right)\sigma'(z) x_j.

\tag{59}\end{eqnarray} Putting everything over a common denominator and simplifying this becomes: \begin{eqnarray} \frac{\partial C}{\partial w_j} & = & \frac{1}{n} \sum_x \frac{\sigma'(z) x_j}{\sigma(z) (1-\sigma(z))} (\sigma(z)-y).

We see that the $\sigma'(z)$ and $\sigma(z)(1-\sigma(z))$ terms cancel in the equation just above, and it simplifies to become: \begin{eqnarray} \frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j(\sigma(z)-y).

In particular, it avoids the learning slowdown caused by the $\sigma'(z)$ term in the analogous equation for the quadratic cost, Equation (55)\begin{eqnarray} \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \nonumber\end{eqnarray}$('#margin_171199604282_reveal').click(function() {$('#margin_171199604282').toggle('slow', function() {});});.

\tag{62}\end{eqnarray} Again, this avoids the learning slowdown caused by the $\sigma'(z)$ term in the analogous equation for the quadratic cost, Equation (56)\begin{eqnarray} \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}$('#margin_35523854850_reveal').click(function() {$('#margin_35523854850').toggle('slow', function() {});});.

It's that steepness which the cross-entropy buys us, preventing us from getting stuck just when we'd expect our neuron to learn fastest, i.e., when the neuron starts out badly wrong.

In particular, suppose $y = y_1, y_2, \ldots$ are the desired values at the output neurons, i.e., the neurons in the final layer, while $a^L_1, a^L_2, \ldots$ are the actual output values.

Then we define the cross-entropy by \begin{eqnarray} C = -\frac{1}{n} \sum_x \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right].

\tag{63}\end{eqnarray} This is the same as our earlier expression, Equation (57)\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}$('#margin_585805985912_reveal').click(function() {$('#margin_585805985912').toggle('slow', function() {});});, except now we've got the $\sum_j$ summing over all the output neurons.

I won't explicitly work through a derivation, but it should be plausible that using the expression (63)\begin{eqnarray} C = -\frac{1}{n} \sum_x \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right] \nonumber\end{eqnarray}$('#margin_236533012867_reveal').click(function() {$('#margin_236533012867').toggle('slow', function() {});});

This definition may be connected to (57)\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}$('#margin_282511343902_reveal').click(function() {$('#margin_282511343902').toggle('slow', function() {});});, if we treat a single sigmoid neuron as outputting a probability distribution consisting of the neuron's activation $a$ and its complement $1-a$.

Instead, you can think of (63)\begin{eqnarray} C = -\frac{1}{n} \sum_x \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right] \nonumber\end{eqnarray}$('#margin_471619892614_reveal').click(function() {$('#margin_471619892614').toggle('slow', function() {});});

as a summed set of per-neuron cross-entropies, with the activation of each neuron being interpreted as part of a two-element probability distribution* *Of course, in our networks there are no probabilistic elements, so they're not really probabilities..

In this sense, (63)\begin{eqnarray} C = -\frac{1}{n} \sum_x \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right] \nonumber\end{eqnarray}$('#margin_555495604283_reveal').click(function() {$('#margin_555495604283').toggle('slow', function() {});});

It may happen that those initial choices result in the network being decisively wrong for some training input - that is, an output neuron will have saturated near $1$, when it should be $0$, or vice versa.

It's easy to get confused about whether the right form is $-[y \ln a + (1-y) \ln (1-a)]$ or $-[a \ln y + (1-a) \ln (1-y)]$.

In the single-neuron discussion at the start of this section, I argued that the cross-entropy is small if $\sigma(z) \approx y$ for all training inputs.

This is usually true in classification problems, but for other problems (e.g., regression problems) $y$ can sometimes take values intermediate between $0$ and $1$.

When this is the case the cross-entropy has the value: \begin{eqnarray} C = -\frac{1}{n} \sum_x [y \ln y+(1-y) \ln(1-y)].

\tag{64}\end{eqnarray} The quantity $-[y \ln y+(1-y)\ln(1-y)]$ is sometimes known as the binary entropy.

Problems Many-layer multi-neuron networks In the notation introduced in the last chapter, show that for the quadratic cost the partial derivative with respect to weights in the output layer is \begin{eqnarray} \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j) \sigma'(z^L_j).

\tag{65}\end{eqnarray} The term $\sigma'(z^L_j)$ causes a learning slowdown whenever an output neuron saturates on the wrong value.

Show that for the cross-entropy cost the output error $\delta^L$ for a single training example $x$ is given by \begin{eqnarray} \delta^L = a^L-y.

\tag{66}\end{eqnarray} Use this expression to show that the partial derivative with respect to the weights in the output layer is given by \begin{eqnarray} \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j).

\tag{67}\end{eqnarray} The $\sigma'(z^L_j)$ term has vanished, and so the cross-entropy avoids the problem of learning slowdown, not just when used with a single neuron, as we saw earlier, but also in many-layer multi-neuron networks.

If this is not obvious to you, then you should work through that analysis as well.Using the quadratic cost when we have linear neurons in the output layer Suppose that we have a many-layer multi-neuron network.

Suppose all the neurons in the final layer are linear neurons, meaning that the sigmoid activation function is not applied, and the outputs are simply $a^L_j = z^L_j$.

Show that if we use the quadratic cost function then the output error $\delta^L$ for a single training example $x$ is given by \begin{eqnarray} \delta^L = a^L-y.

\tag{68}\end{eqnarray} Similarly to the previous problem, use this expression to show that the partial derivatives with respect to the weights and biases in the output layer are given by \begin{eqnarray} \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j) \tag{69}\\ \frac{\partial C}{\partial b^L_{j}} & = & \frac{1}{n} \sum_x (a^L_j-y_j).

\tag{70}\end{eqnarray} This shows that if the output neurons are linear neurons then the quadratic cost will not give rise to any problems with a learning slowdown.

We'll do that later in the chapter, developing an improved version of our earlier program for classifying the MNIST handwritten digits, network.py.

The new program is called network2.py, and incorporates not just the cross-entropy, but also several other techniques developed in this chapter* *The code is available on GitHub..

We set the learning rate to $\eta = 0.5$* *In Chapter 1 we used the quadratic cost and a learning rate of $\eta = 3.0$.

For both cost functions I experimented to find a learning rate that provides near-optimal performance, given the other hyper-parameter choices.

There is, incidentally, a very rough general heuristic for relating the learning rate for the cross-entropy and the quadratic cost.

As we saw earlier, the gradient terms for the quadratic cost have an extra $\sigma' = \sigma(1-\sigma)$ term in them.

Suppose we'd discovered the learning slowdown described earlier, and understood that the origin was the $\sigma'(z)$ terms in Equations (55)\begin{eqnarray} \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \nonumber\end{eqnarray}$('#margin_56695342286_reveal').click(function() {$('#margin_56695342286').toggle('slow', function() {});});

and (56)\begin{eqnarray} \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}$('#margin_816298921133_reveal').click(function() {$('#margin_816298921133').toggle('slow', function() {});});.

In that case, the cost $C = C_x$ for a single training example $x$ would satisfy \begin{eqnarray} \frac{\partial C}{\partial w_j} & = & x_j(a-y) \tag{71}\\ \frac{\partial C}{\partial b } & = & (a-y).

\tag{73}\end{eqnarray} Using $\sigma'(z) = \sigma(z)(1-\sigma(z)) = a(1-a)$ the last equation becomes \begin{eqnarray} \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a} a(1-a).

\tag{74}\end{eqnarray} Comparing to Equation (72)\begin{eqnarray} \frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray}$('#margin_420607753552_reveal').click(function() {$('#margin_420607753552').toggle('slow', function() {});});

\tag{75}\end{eqnarray} Integrating this expression with respect to $a$ gives \begin{eqnarray} C = -[y \ln a + (1-y) \ln (1-a)]+ {\rm constant}, \tag{76}\end{eqnarray} for some constant of integration.

To get the full cost function we must average over training examples, obtaining \begin{eqnarray} C = -\frac{1}{n} \sum_x [y \ln a +(1-y) \ln(1-a)] + {\rm constant}, \tag{77}\end{eqnarray} where the constant here is the average of the individual constants for each training example.

And so we see that Equations (71)\begin{eqnarray} \frac{\partial C}{\partial w_j} & = & x_j(a-y) \nonumber\end{eqnarray}$('#margin_241436552498_reveal').click(function() {$('#margin_241436552498').toggle('slow', function() {});});

and (72)\begin{eqnarray} \frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray}$('#margin_449702528350_reveal').click(function() {$('#margin_449702528350').toggle('slow', function() {});});

Problem We've discussed at length the learning slowdown that can occur when output neurons saturate, in networks using the quadratic cost to train.

Another factor that may inhibit learning is the presence of the $x_j$ term in Equation (61)\begin{eqnarray} \frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j(\sigma(z)-y) \nonumber\end{eqnarray}$('#margin_302567879556_reveal').click(function() {$('#margin_302567879556').toggle('slow', function() {});});.

However, softmax is still worth understanding, in part because it's intrinsically interesting, and in part because we'll use softmax layers in Chapter 6, in our discussion of deep neural networks.

It begins in the same way as with a sigmoid layer, by forming the weighted inputs* *In describing the softmax we'll make frequent use of notation introduced in the last chapter.

According to this function, the activation $a^L_j$ of the $j$th output neuron is \begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}}, \tag{78}\end{eqnarray} where in the denominator we sum over all the output neurons.

To better understand Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}$('#margin_539761460435_reveal').click(function() {$('#margin_539761460435').toggle('slow', function() {});});, suppose we have a network with four output neurons, and four corresponding weighted inputs, which we'll denote $z^L_1, z^L_2, z^L_3$, and $z^L_4$.

$z^L_1 = $ $a^L_1 = $ $z^L_2$ = $a^L_2 = $ $z^L_3$ = $a^L_3 = $ $z^L_4$ = $a^L_4 = $ As you increase $z^L_4$, you'll see an increase in the corresponding output activation, $a^L_4$, and a decrease in the other output activations.

The reason is that the output activations are guaranteed to always sum up to $1$, as we can prove using Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}$('#margin_321011740331_reveal').click(function() {$('#margin_321011740331').toggle('slow', function() {});});

Exercise Construct an example showing explicitly that in a network with a sigmoid output layer, the output activations $a^L_j$ won't always sum to $1$.

As a consequence, increasing $z^L_j$ is guaranteed to increase the corresponding output activation, $a^L_j$, and will decrease all the other output activations.

We already saw this empirically with the sliders, but this is a rigorous proof.Non-locality of softmax A nice thing about sigmoid layers is that the output $a^L_j$ is a function of the corresponding weighted input, $a^L_j = \sigma(z^L_j)$.

I won't go through the derivation explicitly - I'll ask you to do in the problems, below - but with a little algebra you can show that* *Note that I'm abusing notation here, using $y$ in a slightly different way to last paragraph.

In the last paragraph we used $y$ to denote the desired output from the network - e.g., output a '$7$' if an image of a $7$ was input.

But in the equations which follow I'm using $y$ to denote the vector of output activations which corresponds to $7$, that is, a vector which is all $0$s, except for a $1$ in the $7$th location.

\begin{eqnarray} \frac{\partial C}{\partial b^L_j} & = & a^L_j-y_j \tag{81}\\ \frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j) \tag{82}\end{eqnarray} These equations are the same as the analogous expressions obtained in our earlier analysis of the cross-entropy.

to Equation (67)\begin{eqnarray} \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j).

Suppose we change the softmax function so the output activations are given by \begin{eqnarray} a^L_j = \frac{e^{c z^L_j}}{\sum_k e^{c z^L_k}}, \tag{83}\end{eqnarray} where $c$ is a positive constant.

This is the origin of the term 'softmax'.Backpropagation with softmax and the log-likelihood cost In the last chapter we derived the backpropagation algorithm for a network containing sigmoid layers.

To apply the algorithm to a network with a softmax layer we need to figure out an expression for the error $\delta^L_j \equiv \partial C / \partial z^L_j$ in the final layer.

\tag{84}\end{eqnarray} Using this expression we can apply the backpropagation algorithm to a network using a softmax output layer and the log-likelihood cost.

Fermi replied* *The quote comes from a charming article by Freeman Dyson, who is one of the people who proposed the flawed model.

Our 100 hidden neuron network has nearly 80,000 parameters, and state-of-the-art deep neural nets sometimes contain millions or even billions of parameters.

Using the results we can plot the way the cost changes as the network learns* *This and the next four graphs were generated by the program overfitting.py.

We can see that the cost on the test data improves until around epoch 15, but after that it actually starts to get worse, even though the cost on the training data is continuing to get better.

From a practical point of view, what we really care about is improving classification accuracy on the test data, while the cost on the test data is no more than a proxy for classification accuracy.

I wouldn't be surprised if more learning could have occurred even after epoch 400, although the magnitude of any further improvement would likely be small.

In fact, this is part of a more general strategy, which is to use the validation_data to evaluate different trial choices of hyper-parameters such as the number of epochs to train for, the learning rate, the best network architecture, and so on.

Now, in practice, even after evaluating performance on the test_data we may change our minds and want to try another approach - perhaps a different network architecture - which will involve finding a new set of hyper-parameters.

We'll keep all the other parameters the same (30 hidden neurons, learning rate 0.5, mini-batch size of 10), but train using all 50,000 images for 30 epochs.

In particular, the best classification accuracy of $97.86$ percent on the training data is only $2.53$ percent higher than the $95.33$ percent on the test data.

This is scaled by a factor $\lambda / 2n$, where $\lambda > 0$ is known as the regularization parameter, and $n$ is, as usual, the size of our training set.

\tag{86}\end{eqnarray} In both cases we can write the regularized cost function as \begin{eqnarray} C = C_0 + \frac{\lambda}{2n} \sum_w w^2, \tag{87}\end{eqnarray} where $C_0$ is the original, unregularized cost function.

The relative importance of the two elements of the compromise depends on the value of $\lambda$: when $\lambda$ is small we prefer to minimize the original cost function, but when $\lambda$ is large we prefer small weights.

In particular, we need to know how to compute the partial derivatives $\partial C / \partial w$ and $\partial C / \partial b$ for all the weights and biases in the network.

gives \begin{eqnarray} \frac{\partial C}{\partial w} & = & \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} w \tag{88}\\ \frac{\partial C}{\partial b} & = & \frac{\partial C_0}{\partial b}.

\tag{89}\end{eqnarray} The $\partial C_0 / \partial w$ and $\partial C_0 / \partial b$ terms can be computed using backpropagation, as described in the last chapter.

And so we see that it's easy to compute the gradient of the regularized cost function: just use backpropagation, as usual, and then add $\frac{\lambda}{n} w$ to the partial derivative of all the weight terms.

The partial derivatives with respect to the biases are unchanged, and so the gradient descent learning rule for the biases doesn't change from the usual rule: \begin{eqnarray} b & \rightarrow & b -\eta \frac{\partial C_0}{\partial b}.

\tag{90}\end{eqnarray} The learning rule for the weights becomes: \begin{eqnarray} w & \rightarrow & w-\eta \frac{\partial C_0}{\partial w}-\frac{\eta \lambda}{n} w \tag{91}\\ & = & \left(1-\frac{\eta \lambda}{n}\right) w -\eta \frac{\partial C_0}{\partial w}.

\tag{92}\end{eqnarray} This is exactly the same as the usual gradient descent learning rule, except we first rescale the weight $w$ by a factor $1-\frac{\eta \lambda}{n}$.

Equation (20)\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \nonumber\end{eqnarray}$('#margin_408706750470_reveal').click(function() {$('#margin_408706750470').toggle('slow', function() {});});) \begin{eqnarray} w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial w}, \tag{93}\end{eqnarray} where the sum is over training examples $x$ in the mini-batch, and $C_x$ is the (unregularized) cost for each training example.

Equation (21)\begin{eqnarray} b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}$('#margin_927410667596_reveal').click(function() {$('#margin_927410667596').toggle('slow', function() {});});), \begin{eqnarray} b \rightarrow b - \frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial b}, \tag{94}\end{eqnarray} where the sum is over training examples $x$ in the mini-batch.

The cost on the training data decreases over the whole time, much as it did in the earlier, unregularized case* *This and the next two graphs were produced with the program overfitting.py.:

What's more, the accuracy is considerably higher, with a peak classification accuracy of $87.1$ percent, compared to the peak of $82.27$ percent obtained in the unregularized case.

The reason is because the size $n$ of the training set has changed from $n = 1,000$ to $n = 50,000$, and this changes the weight decay factor $1 - \frac{\eta \lambda}{n}$.

In fact, tuning just a little more, to run for 60 epochs at $\eta = 0.1$ and $\lambda = 5.0$ we break the $98$ percent barrier, achieving $98.04$ percent classification accuracy on the validation data.

Empirically, when doing multiple runs of our MNIST networks, but with different (random) weight initializations, I've found that the unregularized runs will occasionally get 'stuck', apparently caught in local minima of the cost function.

This can cause the weight vector to get stuck pointing in more or less the same direction, since changes due to gradient descent only make tiny changes to the direction, when the length is long.

A standard story people tell to explain what's going on is along the following lines: smaller weights are, in some sense, lower complexity, and so provide a simpler and more powerful explanation for the data, and should thus be preferred.

Now, there are ten points in the graph above, which means we can find a unique $9$th-order polynomial $y = a_0 x^9 + a_1 x^8 + \ldots + a_9$ which fits the data exactly.

Here's the graph of that polynomial* *I won't show the coefficients explicitly, although they are easy to find using a routine such as Numpy's polyfit.

But let's consider two possibilities: (1) the $9$th order polynomial is, in fact, the model which truly describes the real-world phenomenon, and the model will therefore generalize perfectly;

If we try to do that there will be a dramatic difference between the predictions of the two models, as the $9$th order polynomial model comes to be dominated by the $x^9$ term, while the linear model remains, well, linear.

In the case at hand, the model $y = 2x+{\rm noise}$ seems much simpler than $y = a_0 x^9 + a_1 x^8 + \ldots$.

And so while the 9th order model works perfectly for these particular data points, the model will fail to generalize to other data points, and the noisy linear model will have greater predictive power.

In a nutshell, regularized networks are constrained to build relatively simple models based on patterns seen often in the training data, and are resistant to learning peculiarities of the noise in the training data.

Subsequent work confirmed that Nature agreed with Bethe, and Schein's particle is no more* *The story is related by the physicist Richard Feynman in an interview with the historian Charles Weiner..

I've included the stories above merely to help convey why no-one has yet developed an entirely convincing theoretical explanation for why regularization helps networks generalize.

Indeed, researchers continue to write papers where they try different approaches to regularization, compare them to see which works better, and attempt to understand why different approaches work better or worse.

Regularization may give us a computational magic wand that helps our networks generalize better, but it doesn't give us a principled understanding of how generalization works, nor of what the best approach is* *These issues go back to the problem of induction, famously discussed by the Scottish philosopher David Hume in 'An Enquiry Concerning Human Understanding' (1748).

The problem of induction has been given a modern machine learning form in the no-free lunch theorem (link) of David Wolpert and William Macready (1997)..

I expect that in years to come we will develop more powerful techniques for regularization in artificial neural networks, techniques that will ultimately enable neural nets to generalize well even from small data sets.

At the same time, allowing large biases gives our networks more flexibility in behaviour - in particular, large biases make it easier for neurons to saturate, which is sometimes desirable.

L1 regularization: In this approach we modify the unregularized cost function by adding the sum of the absolute values of the weights: \begin{eqnarray} C = C_0 + \frac{\lambda}{n} \sum_w |w|.

we obtain: \begin{eqnarray} \frac{\partial C}{\partial w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} \, {\rm sgn}(w), \tag{96}\end{eqnarray} where ${\rm sgn}(w)$ is the sign of $w$, that is, $+1$ if $w$ is positive, and $-1$ if $w$ is negative.

The resulting update rule for an L1 regularized network is \begin{eqnarray} w \rightarrow w' = w-\frac{\eta \lambda}{n} \mbox{sgn}(w) - \eta \frac{\partial C_0}{\partial w}, \tag{97}\end{eqnarray} where, as per usual, we can estimate $\partial C_0 / \partial w$ using a mini-batch average, if we wish.

Equation (93)\begin{eqnarray} w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial w}, \nonumber\end{eqnarray}$('#margin_50828971379_reveal').click(function() {$('#margin_50828971379').toggle('slow', function() {});});), \begin{eqnarray} w \rightarrow w' = w\left(1 - \frac{\eta \lambda}{n} \right) - \eta \frac{\partial C_0}{\partial w}.

The net result is that L1 regularization tends to concentrate the weight of the network in a relatively small number of high-importance connections, while the other weights are driven toward zero.

To put it more precisely, we'll use Equations (96)\begin{eqnarray} \frac{\partial C}{\partial w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} \, {\rm sgn}(w) \nonumber\end{eqnarray}$('#margin_791839509784_reveal').click(function() {$('#margin_791839509784').toggle('slow', function() {});});

and (97)\begin{eqnarray} w \rightarrow w' = w-\frac{\eta \lambda}{n} \mbox{sgn}(w) - \eta \frac{\partial C_0}{\partial w} \nonumber\end{eqnarray}$('#margin_31147978331_reveal').click(function() {$('#margin_31147978331').toggle('slow', function() {});});

We then repeat the process, first restoring the dropout neurons, then choosing a new random subset of hidden neurons to delete, estimating the gradient for a different mini-batch, and updating the weights and biases in the network.

related heuristic explanation for dropout is given in one of the earliest papers to use the technique* *ImageNet Classification with Deep Convolutional Neural Networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).: 'This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons.

The original paper* *Improving neural networks by preventing co-adaptation of feature detectors by Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov (2012).

We train using a mini-batch size of 10, a learning rate $\eta = 0.5$, a regularization parameter $\lambda = 5.0$, and the cross-entropy cost function.

We will train for 30 epochs when the full training data set is used, and scale up the number of epochs proportionally when smaller training sets are used.

To ensure the weight decay factor remains the same across training sets, we will use a regularization parameter of $\lambda = 5.0$ when the full training data set is used, and scale down $\lambda$ proportionally when smaller training sets are used* *This and the next two graph are produced with the program more_data.py..

This suggests that if we used vastly more training data - say, millions or even billions of handwriting samples, instead of just 50,000 - then we'd likely get considerably better performance, even from this very small network.

One of the neural network architectures they considered was along similar lines to what we've been using, a feedforward network with 800 hidden neurons and using the cross-entropy cost function.

They also experimented with what they called 'elastic distortions', a special type of image distortion intended to emulate the random oscillations found in hand muscles.

These techniques are not always used - for instance, instead of expanding the training data by adding noise, it may well be more efficient to clean up the input to the network by first applying a noise reduction filter.

An aside on big data and what it means to compare classification accuracies: Let's look again at how our neural network's accuracy varies with training set size:

I've plotted the neural net results as well, to make comparison easy* *This graph was produced with the program more_data.py (as were the last few graphs).:

A more subtle but more interesting fact about the graph is that if we train our SVM using 50,000 images then it actually has better performance (94.48 percent accuracy) than our neural network does when trained using 5,000 images (93.24 percent accuracy).

We don't see that above - it would require the two graphs to cross - but it does happen* *Striking examples may be found in Scaling to very very large corpora for natural language disambiguation, by Michele Banko and Eric Brill (2001)..

It's fine to look for better algorithms, but make sure you're not focusing on better algorithms to the exclusion of easy wins getting more or better training data.

For any given algorithm it's natural to attempt to define a notion of asymptotic performance in the limit of truly big data.

A quick-and-dirty approach to this problem is to simply try fitting curves to graphs like those shown above, and then to extrapolate the fitted curves out to infinity.

Just to remind you, that prescription was to choose both the weights and biases using independent Gaussian random variables, normalized to have mean $0$ and standard deviation $1$.

While this approach has worked well, it was quite ad hoc, and it's worth revisiting to see if we can find a better way of setting our initial weights and biases, and perhaps help our neural networks learn faster.

We'll suppose for simplicity that we're trying to train using a training input $x$ in which half the input neurons are on, i.e., set to $1$, and half the input neurons are off, i.e., set to $0$.

And so $z$ is a sum over a total of $501$ normalized Gaussian random variables, accounting for the $500$ weight terms and the $1$ extra bias term.

That miniscule change in the activation of the hidden neuron will, in turn, barely affect the rest of the neurons in the network at all, and we'll see a correspondingly miniscule change in the cost function.

As a result, those weights will only learn very slowly when we use the gradient descent algorithm* *We discussed this in more detail in Chapter 2, where we used the equations of backpropagation to show that weights input to saturated neurons learned slowly..

Of course, similar arguments apply also to later hidden layers: if the weights in later hidden layers are initialized using normalized Gaussians, then activations will often be very close to $0$ or $1$, and learning will proceed very slowly.

With these choices, the weighted sum $z = \sum_j w_j x_j + b$ will again be a Gaussian random variable with mean $0$, but it'll be much more sharply peaked than it was before.

It may help to know that: (a) the variance of a sum of independent random variables is the sum of the variances of the individual random variables;

As before, we'll use $30$ hidden neurons, a mini-batch size of $10$, a regularization parameter $\lambda = 5.0$, and the cross-entropy cost function.

At the end of the first epoch of training the old approach to weight initialization has a classification accuracy under 87 percent, while the new approach is already almost 93 percent.

If you're interested in looking further, I recommend looking at the discussion on pages 14 and 15 of a 2012 paper by Yoshua Bengio* *Practical Recommendations for Gradient-Based Training of Deep Architectures, by Yoshua Bengio (2012)., as well as the references therein.

Problem Connecting regularization and the improved method of weight initialization L2 regularization sometimes automatically gives us something similar to the new approach to weight initialization.

Sketch a heuristic argument that: (1) supposing $\lambda$ is not too small, the first epochs of training will be dominated almost entirely by weight decay;

(2) provided $\eta \lambda \ll n$ the weights will decay by a factor of $\exp(-\eta \lambda / m)$ per epoch;

and (3) supposing $\lambda$ is not too large, the weight decay will tail off when the weights are down to a size around $1/\sqrt{n}$, where $n$ is the total number of weights in the network.

As we've seen, in that approach the weights input to a neuron are initialized as Gaussian random variables with mean 0 and standard deviation $1$ divided by the square root of the number of connections input to the neuron.

for x, y in zip(self.sizes[:-1], self.sizes[1:])]

This method initializes the weights and biases using the old approach from Chapter 1, with both weights and biases initialized as Gaussian random variables with mean $0$ and standard deviation $1$.

for x, y in zip(self.sizes[:-1], self.sizes[1:])]

To understand how that works, let's look at the class we use to represent the cross-entropy cost* *If you're not familiar with Python's static methods you can ignore the @staticmethod decorators, and just treat fn and delta as ordinary methods.

If you're curious about details, all @staticmethod does is tell the Python interpreter that the method which follows doesn't depend on the object in any way.

(Note, by the way, that the np.nan_to_num call inside CrossEntropyCost.fn ensures that Numpy deals correctly with the log of numbers very close to zero.) But there's also a second way the cost function enters our network.

You don't need to read all the code in detail, but it is worth understanding the broad structure, and in particular reading the documentation strings, so you understand what each piece of the program is doing.

for x, y in zip(self.sizes[:-1], self.sizes[1:])]

for x, y in zip(self.sizes[:-1], self.sizes[1:])]

training_data[k:k+mini_batch_size]

for k in xrange(0, n, mini_batch_size)]

self.update_mini_batch(

mini_batch, eta, lmbda, len(training_data))

cost = self.total_cost(training_data, lmbda)

training_cost.append(cost)

print "Cost on training data: {}".format(cost)

accuracy = self.accuracy(training_data, convert=True)

training_accuracy.append(accuracy)

print "Accuracy on training data: {} / {}".format(

accuracy, n)

cost = self.total_cost(evaluation_data, lmbda, convert=True)

evaluation_cost.append(cost)

print "Cost on evaluation data: {}".format(cost)

accuracy = self.accuracy(evaluation_data)

evaluation_accuracy.append(accuracy)

print "Accuracy on evaluation data: {} / {}".format(

self.accuracy(evaluation_data), n_data)

delta_nabla_b, delta_nabla_w = self.backprop(x, y)

nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]

nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]

for w, nw in zip(self.weights, nabla_w)]

for b, nb in zip(self.biases, nabla_b)]

activations = [x] # list to store all the activations, layer by layer

zs = [] # list to store all the z vectors, layer by layer

# l = 1 means the last layer of neurons, l = 2 is the

delta = np.dot(self.weights[-l+1].transpose(), delta) * sp

for (x, y) in data]

for (x, y) in data]

"weights": [w.tolist() for w in self.weights],

"biases": [b.tolist() for b in self.biases],

"cost": str(self.cost.__name__)}

Problems Modify the code above to implement L1 regularization, and use L1 regularization to classify MNIST digits using a $30$ hidden neuron network.

In network2.py we've eliminated the Network.cost_derivative method entirely, instead incorporating its functionality into the CrossEntropyCost.delta method.

Let's suppose that by good fortune in our first experiments we choose many of the hyper-parameters in the same way as was done earlier this chapter: 30 hidden neurons, a mini-batch size of 10, training for 30 epochs using the cross-entropy.

Broad strategy: When using neural networks to attack a new problem the first challenge is to get any non-trivial learning, i.e., for the network to achieve results better than chance.

Not only is that an inherently easier problem than distinguishing all ten digits, it also reduces the amount of training data by 80 percent, speeding up training by a factor of 5.

With 50,000 images per epoch, that means waiting a little while - about ten seconds per epoch, on my laptop, when training a [784, 30, 10] network - before getting feedback on how well the network is learning.

Of course, ten seconds isn't very long, but if you want to trial dozens of hyper-parameter choices it's annoying, and if you want to trial hundreds or thousands of choices it starts to get debilitating.

Furthermore, instead of using the full 10,000 image validation set to monitor performance, we can get a much faster estimate using just 100 validation images.

(As you perhaps realize, that's a silly guess, for reasons we'll discuss shortly, but please bear with me.) So to test our guess we try dialing $\eta$ up to $100.0$: >>>

However, many of the remarks apply also to other hyper-parameters, including those associated to network architecture, other forms of regularization, and some hyper-parameters we'll meet later in the book, such as the momentum co-efficient.

Learning rate: Suppose we run three MNIST networks with three different learning rates, $\eta = 0.025$, $\eta = 0.25$ and $\eta = 2.5$, respectively.

With $\eta = 0.25$ the cost initially decreases, but after about $20$ epochs it is near saturation, and thereafter most of the changes are merely small and apparently random oscillations.

That's likely* *This picture is helpful, but it's intended as an intuition-building illustration of what may go on, not as a complete, exhaustive explanation.

Briefly, a more complete explanation is as follows: gradient descent uses a first-order approximation to the cost function as a guide to how to decrease the cost.

For large $\eta$, higher-order terms in the cost function become more important, and may dominate the behaviour, causing gradient descent to break down.

This is especially likely as we approach minima and quasi-minima of the cost function, since near such points the gradient becomes small, making it easier for higher-order terms to dominate behaviour.

If the cost decreases during the first few epochs, then you should successively try $\eta = 0.1, 1.0, \ldots$ until you find a value for $\eta$ where the cost oscillates or increases during the first few epochs.

Alternately, if the cost oscillates or increases during the first few epochs when $\eta = 0.01$, then try $\eta = 0.001, 0.0001, \ldots$ until you find a value for $\eta$ where the cost decreases during the first few epochs.

You may optionally refine your estimate, to pick out the largest value of $\eta$ at which the cost decreases during the first few epochs, say $\eta = 0.5$ or $\eta = 0.2$ (there's no need for this to be super-accurate).

In fact, I found that using $\eta = 0.5$ worked well enough over $30$ epochs that for the most part I didn't worry about using a lower value of $\eta$.

However, using the training cost to pick $\eta$ appears to contradict what I said earlier in this section, namely, that we'd pick hyper-parameters by evaluating performance using our held-out validation data.

In fact, we'll use validation accuracy to pick the regularization hyper-parameter, the mini-batch size, and network parameters such as the number of layers and hidden neurons, and so on.

Its primary purpose is really to control the step size in gradient descent, and monitoring the training cost is the best way to detect if the step size is too big.

Use early stopping to determine the number of training epochs: As we discussed earlier in the chapter, early stopping means that at the end of each epoch we should compute the classification accuracy on the validation data.

In that case, I suggest using the no-improvement-in-ten rule for initial experimentation, and gradually adopting more lenient rules, as you better understand the way your network trains: no-improvement-in-twenty, no-improvement-in-fifty, and so on.

However, it's well worth modifying network2.py to implement early stopping: Problem Modify network2.py so that it implements early stopping using a no-improvement-in-$n$ epochs strategy, where $n$ is a parameter that can be set.

Add your rule to network2.py, and run three experiments comparing the validation accuracies and number of epochs of training to no-improvement-in-$10$.

Later, if you want to obtain the best performance from your network, it's worth experimenting with a learning schedule, along the lines I've described* *A readable recent paper which demonstrates the benefits of variable learning rates in attacking MNIST is Deep, Big, Simple Neural Nets Excel on Handwritten Digit Recognition, by Dan Claudiu Cireșan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber (2010)..

Exercise Modify network2.py so that it implements a learning schedule that: halves the learning rate each time the validation accuracy satisfies the no-improvement-in-$10$ rule;

If anyone knows of a good principled discussion of where to start with $\lambda$, I'd appreciate hearing it (mn@michaelnielsen.org)., and then increase or decrease by factors of $10$, as needed to improve performance on the validation data.

How I selected hyper-parameters earlier in this book: If you use the recommendations in this section you'll find that you get values for $\eta$ and $\lambda$ which don't always exactly match the values I've used earlier in the book.

Think of all the comparisons we've made of different approaches to learning, e.g., comparing the quadratic and cross-entropy cost functions, comparing the old and new methods of weight initialization, running with and without regularization, and so on.

In a problem in the last chapter I pointed out that it's possible to use matrix techniques to compute the gradient update for all examples in a mini-batch simultaneously, rather than looping over them.

Depending on the details of your hardware and linear algebra library this can make it quite a bit faster to compute the gradient estimate for a mini-batch of (for example) size $100$, rather than computing the mini-batch gradient estimate by looping over the $100$ training examples separately.

With our mini-batch of size $100$ the learning rule for the weights looks like: \begin{eqnarray} w \rightarrow w' = w-\eta \frac{1}{100} \sum_x \nabla C_x, \tag{100}\end{eqnarray} where the sum is over training examples in the mini-batch.

Suppose, however, that in the mini-batch case we increase the learning rate by a factor $100$, so the update rule becomes \begin{eqnarray} w \rightarrow w' = w-\eta \sum_x \nabla C_x.

Of course, it's not truly the same as $100$ instances of online learning, since in the mini-batch the $\nabla C_x$'s are all evaluated for the same set of weights, as opposed to the cumulative learning that occurs in the online case.

Fortunately, the choice of mini-batch size at which the speed is maximized is relatively independent of the other hyper-parameters (apart from the overall architecture), so you don't need to have optimized those hyper-parameters in order to find a good mini-batch size.

Plot the validation accuracy versus time (as in, real elapsed time, not epoch!), and choose whichever mini-batch size gives you the most rapid improvement in performance.

In practical implementations, however, we would most certainly implement the faster approach to mini-batch updates, and then make an effort to optimize the mini-batch size, in order to maximize our overall speed.

A review of both the achievements and the limitations of grid search (with suggestions for easily-implemented alternatives) may be found in a 2012 paper* *Random search for hyper-parameter optimization, by James Bergstra and Yoshua Bengio (2012).

I won't review all that work here, but do want to mention a particularly promising 2012 paper which used a Bayesian approach to automatically optimize hyper-parameters* *Practical Bayesian optimization of machine learning algorithms, by Jasper Snoek, Hugo Larochelle, and Ryan Adams..

The difficulty of choosing hyper-parameters is exacerbated by the fact that the lore about how to choose hyper-parameters is widely spread, across many research papers and software programs, and often is only available inside the heads of individual practitioners.

Yoshua Bengio has a 2012 paper* *Practical recommendations for gradient-based training of deep architectures, by Yoshua Bengio (2012).

Both these papers appear in an extremely useful 2012 book that collects many tricks commonly used in neural nets* *Neural Networks: Tricks of the Trade, edited by Grégoire Montavon, Geneviève Orr, and Klaus-Robert Müller..

Instead, we're just going to consider the abstract problem of minimizing a cost function $C$ which is a function of many variables, $w = w_1, w_2, \ldots$, so $C = C(w)$.

By Taylor's theorem, the cost function can be approximated near a point $w$ by \begin{eqnarray} C(w+\Delta w) & = & C(w) + \sum_j \frac{\partial C}{\partial w_j} \Delta w_j \nonumber \\ & & + \frac{1}{2} \sum_{jk} \Delta w_j \frac{\partial^2 C}{\partial w_j \partial w_k} \Delta w_k + \ldots \tag{103}\end{eqnarray} We can rewrite this more compactly as \begin{eqnarray} C(w+\Delta w) = C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w + \ldots, \tag{104}\end{eqnarray} where $\nabla C$ is the usual gradient vector, and $H$ is a matrix known as the Hessian matrix, whose $jk$th entry is $\partial^2 C / \partial w_j \partial w_k$.

Suppose we approximate $C$ by discarding the higher-order terms represented by $\ldots$ above, \begin{eqnarray} C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w.

\tag{105}\end{eqnarray} Using calculus we can show that the expression on the right-hand side can be minimized* *Strictly speaking, for this to be a minimum, and not merely an extremum, we need to assume that the Hessian matrix is positive definite.

\tag{106}\end{eqnarray} Provided (105)\begin{eqnarray} C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w \nonumber\end{eqnarray}$('#margin_111149152883_reveal').click(function() {$('#margin_111149152883').toggle('slow', function() {});});

$\ldots$ In practice, (105)\begin{eqnarray} C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w \nonumber\end{eqnarray}$('#margin_298650340883_reveal').click(function() {$('#margin_298650340883').toggle('slow', function() {});});

We introduce velocity variables $v = v_1, v_2, \ldots$, one for each corresponding $w_j$ variable* *In a neural net the $w_j$ variables would, of course, include all weights and biases..

Then we replace the gradient descent update rule $w \rightarrow w'= w-\eta \nabla C$ by \begin{eqnarray} v & \rightarrow & v' = \mu v - \eta \nabla C \tag{107}\\ w & \rightarrow & w' = w+v'.

That's the reason for the $\mu$ hyper-parameter in (107)\begin{eqnarray} v & \rightarrow & v' = \mu v - \eta \nabla C \nonumber\end{eqnarray}$('#margin_447453385169_reveal').click(function() {$('#margin_447453385169').toggle('slow', function() {});});.

By contrast, when $\mu = 0$ there's a lot of friction, the velocity can't build up, and Equations (107)\begin{eqnarray} v & \rightarrow & v' = \mu v - \eta \nabla C \nonumber\end{eqnarray}$('#margin_825904886358_reveal').click(function() {$('#margin_825904886358').toggle('slow', function() {});});

and (108)\begin{eqnarray} w & \rightarrow & w' = w+v' \nonumber\end{eqnarray}$('#margin_555404039918_reveal').click(function() {$('#margin_555404039918').toggle('slow', function() {});});

Another technique which has recently shown promising results* *See, for example, On the importance of initialization and momentum in deep learning, by Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton (2012).

However, for many problems, plain stochastic gradient descent works well, especially if momentum is used, and so we'll stick to stochastic gradient descent through the remainder of this book.

The output of a tanh neuron with input $x$, weight vector $w$, and bias $b$ is given by \begin{eqnarray} \tanh(w \cdot x+b), \tag{109}\end{eqnarray} where $\tanh$ is, of course, the hyperbolic tangent function.

\tag{110}\end{eqnarray} With a little algebra it can easily be verified that \begin{eqnarray} \sigma(z) = \frac{1+\tanh(z/2)}{2}, \tag{111}\end{eqnarray} that is, $\tanh$ is just a rescaled version of the sigmoid function.

var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))};

var y = d3.scale.linear() .domain([-1,1]) .range([height, 0]);

}) var graph = d3.select('#tanh') .append('svg') .attr('width', width + m[1] + m[3]) .attr('height', height + m[0] + m[2]) .append('g') .attr('transform', 'translate(' + m[3] + ',' + m[0] + ')');

var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(-4, 5, 1)) .orient('bottom') graph.append('g') .attr('class', 'x axis') .attr('transform', 'translate(0, ' + height/2 + ')') .call(xAxis);

var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(-1, 1.01, 0.5)) .orient('left') graph.append('g') .attr('class', 'y axis') .call(yAxis);

graph.append('text') .attr('class', 'x label') .attr('text-anchor', 'end') .attr('x', width+20) .attr('y', height/2+8) .text('z');

graph.append('text') .attr('x', (width / 2)) .attr('y', -10) .attr('text-anchor', 'middle') .style('font-size', '16px') .text('tanh function');

This means that if you're going to build a network based on tanh neurons you may need to normalize your outputs (and, depending on the details of the application, possibly your inputs) a little differently than in sigmoid networks.

Similar to sigmoid neurons, a network of tanh neurons can, in principle, compute any function* *There are some technical caveats to this statement for both tanh and sigmoid neurons, as well as for the rectified linear neurons discussed below.

However, informally it's usually fine to think of neural networks as being able to approximate any function to arbitrary accuracy.

However, there are theoretical arguments and some empirical evidence to suggest that the tanh sometimes performs better* *See, for example, Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998), and Understanding the difficulty of training deep feedforward networks, by Xavier Glorot and Yoshua Bengio (2010)..

What this means is that if $\delta^{l+1}_j$ is positive then all the weights $w^{l+1}_{jk}$ will decrease during gradient descent, while if $\delta^{l+1}_j$ is negative then all the weights $w^{l+1}_{jk}$ will increase during gradient descent.

Indeed, because $\tanh$ is symmetric about zero, $\tanh(-z) = -\tanh(z)$, we might even expect that, roughly speaking, the activations in hidden layers would be equally balanced between positive and negative.

The output of a rectified linear unit with input $x$, weight vector $w$, and bias $b$ is given by \begin{eqnarray} \max(0, w \cdot x+b).

var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))};

var y = d3.scale.linear() .domain([-5, 5]) .range([height, 0]);

}) var graph = d3.select('#relu') .append('svg') .attr('width', width + m[1] + m[3]) .attr('height', height + m[0] + m[2]) .append('g') .attr('transform', 'translate(' + m[3] + ',' + m[0] + ')');

var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(-4, 5.01, 1)) .orient('bottom') graph.append('g') .attr('class', 'x axis') .attr('transform', 'translate(0, ' + height/2 + ')') .call(xAxis);

var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(-4, 5.01, 1)) .orient('left') graph.append('g') .attr('class', 'y axis') .call(yAxis);

graph.append('text') .attr('class', 'x label') .attr('text-anchor', 'end') .attr('x', width+20) .attr('y', height/2+8) .text('z');

graph.append('text') .attr('x', (width / 2)) .attr('y', -10) .attr('text-anchor', 'middle') .style('font-size', '16px') .text('max(0, z)');

However, like the sigmoid and tanh neurons, rectified linear units can be used to compute any function, and they can be trained using ideas such as backpropagation and stochastic gradient descent.

Some recent work on image recognition* *See, for example, What is the Best Multi-Stage Architecture for Object Recognition?, by Kevin Jarrett, Koray Kavukcuoglu, Marc'Aurelio Ranzato and Yann LeCun (2009), Deep Sparse Rectiﬁer Neural Networks, by Xavier Glorot, Antoine Bordes, and Yoshua Bengio (2011), and ImageNet Classification with Deep Convolutional Neural Networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).

Note that these papers fill in important details about how to set up the output layer, cost function, and regularization in networks using rectified linear units.

Another informative paper is Rectified Linear Units Improve Restricted Boltzmann Machines, by Vinod Nair and Geoffrey Hinton (2010), which demonstrates the benefits of using rectified linear units in a somewhat different approach to neural networks.

Through the remainder of this book I'll continue to use sigmoid neurons as our go-to neuron, since they're powerful and provide concrete illustrations of the core ideas about neural nets.

On stories in neural networks Question: How do you approach utilizing and researching machine learning techniques that are supported almost entirely empirically, as opposed to mathematically?

Sometimes our intuition ends up being wrong [...] The questions become: how well does my method work on this particular problem, and how large is the set of problems on which it works well.

Sometimes our intuition ends up being wrong [...] The questions become: how well does my method work on this particular problem, and how large is the set of problems on which it works well.

Once, attending a conference on the foundations of quantum mechanics, I noticed what seemed to me a most curious verbal habit: when talks finished, questions from the audience often began with 'I'm very sympathetic to your point of view, but [...]'.

Quantum foundations was not my usual field, and I noticed this style of questioning because at other scientific conferences I'd rarely or never heard a questioner express their sympathy for the point of view of the speaker.

If you look through the research literature you'll see that stories in a similar style appear in many research papers on neural nets, often with thin supporting evidence.

In many parts of science - especially those parts that deal with simple phenomena - it's possible to obtain very solid, very reliable evidence for quite general hypotheses.

For example, consider the statement I quoted earlier, explaining why dropout works* *From ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).: 'This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons.

It's going to take decades (or longer) for the community of neural networks researchers to develop a really powerful, evidence-based theory of how neural networks learn.

When you understand something poorly - as the explorers understood geography, and as we understand neural nets today - it's more important to explore boldly than it is to be rigorously correct in every step of your thinking.

And so you should view these stories as a useful guide to how to think about neural nets, while retaining a healthy awareness of the limitations of such stories, and carefully keeping track of just how strong the evidence is for any given line of reasoning.

- On Sunday, January 20, 2019

**Gradient descent, how neural networks learn | Chapter 2, deep learning**

Subscribe for more (part 3 will be on backpropagation): Thanks to everybody supporting on Patreon

**Lecture 9.1 — Neural Networks Learning | Cost Function — [ Machine Learning | Andrew Ng]**

Copyright Disclaimer Under Section 107 of the Copyright Act 1976, allowance is made for "FAIR USE" for purposes such as criticism, comment, news reporting, ...

**Loss in a Neural Network explained**

In this video, we explain the concept of loss in an artificial neural network and show how to specify the loss function in code with Keras. Follow deeplizard on ...

**Neural networks [7.3] : Deep learning - unsupervised pre-training**

**How to Predict Stock Prices Easily - Intro to Deep Learning #7**

We're going to predict the closing price of the S&P 500 using a special type of recurrent neural network called an LSTM network. I'll explain why we use ...

**Neural networks [2.2] : Training neural networks - loss function**

**Beginner Intro to Neural Networks 7: Slope of Cost + Simple Train in Python**

In the last video we saw three ways of finding the derivative of our cost function Let's look at those ways using python Start up the environment Let's define our ...

**Forecasting with Neural Networks: Part A**

What is a neural network, neural network terminology, and setting up a network for time series forecasting This video supports the textbook Practical Time Series ...

**Neural Network Fundamentals (Part 4): Prediction**

From In this part we see how to present data to a neural network to predict data

**Neural Network Stock Price Prediction in Excel**

This demo shows an example of forecasting stock prices using NeuroXL Predictor excel add-in