AI News, BOOK REVIEW: Deep Learning Basics: Neural Networks, Backpropagation and Stochastic Gradient Descent

Deep Learning Basics: Neural Networks, Backpropagation and Stochastic Gradient Descent

They are called neurons because they share some similarities with the behaviour of the neurons present in the human brain (though this comparison has drawn a lot of criticism from neuroscientists).

neuron is a function that maps an input vector \(\{x_1,...,x_K\}\) to a scalar output \(y\) via a weight vector \(\{w_1,...,w_K\}\) and a nonlinear function \(f\).

A common choice for this link function is the logistic function which is defined as $$f(u)=\frac{1}{1+e^{u}}$$ With the appropriate substitutions the final formula for the single neuron model becomes $$y=\frac{1}{1+e^{\mathbf{w^Tx}}}$$ If you plot the logistic function,

Our objective function \(E\) is already defined in terms of a single data point so let's procede to compute its gradient with respect to an aribtrary element of our weight vector \(w_i\).

Now we are able to obtain the stochastic gradient descent update equation (in vector notation) $$\mathbf{w}^{new}=\mathbf{w}^{old}- \eta \cdot (y-t) \cdot y(1-y) \cdot \mathbf{x}$$

As stated previously, \((\mathbf{x},y)\) data points are sequentially fed into this update equation until the weights \(\mathbf{w}\) converge to their optimal value.

Every element in the input layer is connected to every neuron in the hidden layer with \(w_{ki}\) indicating the weight associated with the connection between the \(k^{th}\) input element and the \(i^{th}\) hidden neuron.

The same connection structure is present between the hidden and output layers with \(w'_{ij}\) indicating the weight associated with the connection between the \(i^{th}\) hidden neuron and the \(j^{th}\) output neuron.

It is helpful to think of the weight \(w_{ki}\) as the the \((k,i)^{th}\) entry in a \(K \times N\) weight matrix \(\mathbf{W}\) and similarly weight \(w'_{ij}\) as the \((i,j)^{th}\) entry in a \(N \times M\) weight matrix \(\mathbf{W'}\).

For example, the output of an arbitrary neuron in the hidden layer \(h_i\) is $$h_i=f(u_i)=f(\sum^K_{k=1}w_{ki}x_k)$$ and similarly for the output of an arbitrary output neuron \(y_j\) is $$y_j=f(u'_j)=f(\sum^N_{i=1}w'_{ij}h_i)$$ The objective function is also the same as before except now it is summed over all elements in the output layer.

We first take the derivative of the logistic function with respect to its input \(u'_j\), then finally we can take the derivative of this input with respect to \(w'_{ij}\) and we arrive at our desired value.

Using the chain rule, the full gradient is $$\frac{\partial E}{\partial w_{ki}}=\sum^M_{j=1}(\frac{\partial E}{\partial y_j}\cdot \frac{\partial y_j}{\partial u'_j} \cdot \frac{\partial u'_j}{\partial h_i} )\cdot \frac{\partial h_i}{\partial u_i} \cdot \frac{\partial u_i}{\partial w_{ki}}$$

We have already computed both \(\frac{\partial E}{\partial y_j}\) and \(\frac{\partial y_j}{\partial u'_j}\) which means that $$\frac{\partial E}{\partial y_j}\cdot \frac{\partial y_j}{\partial u'_j} = (y_j-t_j) \cdot y_j(1-y_j)$$

And the update equation becomes $$w^{new}_{ki}=w^{old}_{ki} - \eta \cdot \sum^M_{j=1}[(y_j-t_j) \cdot y_j(1-y_j) \cdot w'_{ij}] \cdot h_i(1-h_i) \cdot x_k$$

This process is known as backpropagation because we begin with the final output error \(y_j-t_j\) for the output neuron \(j\) and this error gets propagated backwards throughout the network in order to update the weights.

In this blog post we started with the simple single neuron model and we learned the model weights by computing the gradient of the objective function and using it in the stochastic gradient descent update equation.

Neural Nets

With distributed representations, we instead have neurons that learn the different colors and other neurons which learn the different types (for instance, with three types and three colors, we have six neurons).

One challenge with neural networks is that it is often hard to understand what a neural network has learned - they may be black boxes which do what we want, but we can't peer in and clearly see how it's doing it.

Note that the term 'unit' is often used instead of 'neuron' when discussing artificial neural networks to dissociate these from the biological version - while there is some basis in biological neural networks, there are vast differences, so it is a deceit to present them as analogous.

In the simplest artificial neuron, a 'binary' or 'classic spin' neuron, the neuron 'fires' an output of '1' if the weighted sum of these inputs is above some threshold, or '-1' if otherwise.

However, while the perceptron has a binary output, the sigmoid neuron has a continuous output, $\sigma(w \cdot x+b)$, defined by a special activation function known as the sigmoid function $\sigma$ (also known as the logistic function): $$\begin{aligned} \sigma(z) \equiv \frac{1}{1+e^{-z}}. \end{aligned} $$

Note that if $z = w \cdot x+b$ is a large positive number, then $e^{-z} \approx 0$ and thus $\sigma(z) \approx 1$.

Here is the sigmoid function visualized: Which is a smoothed out step function (which is how a perceptron operates): Sigmoid neurons are useful because small changes in weights and biases will only produce small changes in output from a given neuron (rather than switching between binary output values as is the case with the step function, which is typically too drastic).

An activation function can generally be described as some function: $$\text{output} = f(w \cdot x + b)$$ where $b$ is the bias (see below).

common activation function is the sigmoid function, which takes input and squashes it to be in $[0,1]$, it has the form: $$\sigma(x) = \frac{1}{1 + e^{-x}} $$

In backpropagation, this local gradient is multiplied with the gradient of the node's output against the total error - if this local gradient is near 0, it 'kills' the gradient preventing any signal from going further backwards in the network.

For this reason, when using the sigmoid activation function you must be careful of how you initialize the weights - if they are too large, you will 'saturate' the network and kill the gradient in this way.

Furthermore, sigmoid outputs are not zero-centered: This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered.

x>0 elementwise in f=wTx+b)), then the gradient on the weights w will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression f).

However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue.

Though there is not the same saturation issue as with the sigmoid/tanh functions, ReLUs can still 'die' in a different sense - their weights can be updated such that the neuron never activates again, which causes the gradient through that neuron to be zero from then on, thus resulting in the same 'killing' of the gradient as with sigmoid/tanh.

It has the benefits of ReLU but does not suffer the dying ReLU problem, but it's main drawback is that it doubles the number of parameters for each neuron (since there are two weight vectors and two bias units).

feed-forward neural network is a simple neural network with an input layer, and output layer, and one or more intermediate layers of neurons.

Backpropagation is just the calculation of partial derivatives (the gradient) by moving backwards through the network (from output to input), accumulating them by applying the chain rule.

We compute the total error for the network on the training data and then want to know how a change in an individual weight within the network affects this total error (i.e.

This is because a neural network can be thought of as a composition of functions, in which case to compute the derivative of the overall function, you just apply the chain rule for computing derivatives.

To elaborate on thinking of neural network as a 'composition of functions': each layer represents a function taking in the inputs of the previous layer's output, e.g.

The general procedure for training a neural network with backpropagation is: Consider the following simple neural net: Here's a single neuron expanded: Remember that a neuron processes its inputs by computing the dot product of its weights and inputs (i.e.

To update the weight $w_{2,1}$, for example, we are looking for the partial derivative $\frac{\partial E_{\text{total}}}{\partial w_{2,1}}$, which by the chain rule is equal to: $$\frac{\partial E_{\text{total}}}{\partial w_{2,1}} = \frac{\partial E_\text{total}}{\partial o_{2,1}} \times \frac{\partial o_{2,1}}{\partial i_{2,1}} \times \frac{\partial i_{2,1}}{\partial w_{2,1}} $$

Then we take this value and subtract it, multiplied by a learning rate $\eta$ (sometimes notated $\alpha$), from the current weight $w_{2,1}$ to get $w_{2,1}$'s updated weight, though updates are only actually applied after these update values have been computed for all of the network's weights.

If we wanted to calculate the update value for $w_{1,1}$, we do something similar: $$\frac{\partial E_{\text{total}}}{\partial w_{1,1}} = \frac{\partial E_\text{total}}{\partial o_{1,1}} \times \frac{\partial o_{1,1}}{\partial i_{1,1}} \times \frac{\partial i_{1,1}}{\partial w_{1,1}} $$

The derivative of $f$ with respect to $x$ is called a partial derivative, since it is only with respect to one of the variables, is notated $\frac{\partial f}{\partial x}$ and is just a function that tells you how much $f$ changes due to $x$ at any point.

This means that, at any given point, increasing $x$ by a infinitesimal amount will change the output of the function by $y$ times the amount that $x$ changed.

We can compute the gradient of $f$ in this form (note that it is the same as $f(x,y) = xy$ from before): $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$.

We can get the missing partial derivatives wrt to $x$ and $y$ by using the chain rule, which just requires that we multiply the appropriate partials: $$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x}, \frac{\partial f}{\partial y} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial y} $$

In code (adapted from the CS231 notes cited below) So essentially you can decompose any function into smaller, simpler functions, compute the gradients for those, then use the chain rule to aggregate them into the original function's gradient.

For instance, we could manually calculate the partial derivatives of the cost function with respect to each individual weight, which, if we had $w$ weights, would require computing the cost function $w$ times, which requires $w$ forward passes.

Note that for $\frac{\partial J}{\partial \text{OUT}^N}$, we are computing the derivative of the cost function $J$ with respect to each training example's corresponding output, and then we average them (in some situations, such as when the total number of training examples is not fixed, their sum is used).

It is costly to do this across all training examples if you have a large training set, in which case, the minibatch stochastic variant of gradient descent may be more appropriate.

(TODO this may need clarification/revision) For the hidden layer prior to the output, $L^{N-1}$, we would need to connect that layer's net input, $\text{NET}^{N-1}$, to the cost function $J$: $$\delta^{N-1} = \frac{\partial J}{\partial \text{NET}^{N-1}} = \frac{\partial J}{\partial \text{OUT}^N} \frac{\partial \text{OUT}^N}{\partial \text{NET}^N} \frac{\partial \text{NET}^N}{\partial \text{OUT}^{N-1}} \frac{\partial \text{OUT}^{N-1}}{\partial \text{NET}^{N-1}} $$

This is because with many dimensions, it is unlikely that a point is a minimum is all dimensions (if we consider that a point is a minimum in one dimensions with probability $p$, then it has probability $p^n$ to be a minimum in all $n$ dimensions);

Statistical (or 'stochastic') training methods, contrasted with deterministic training methods (such as backpropagation as described abov), involve some randomness to avoid local minima.

Simulated annealing applied as a training method to a neural network is called Boltzmann training (neural networks trained in this way are called Boltzmann machines): $$P(c) = exp(\frac{-c}{kT}) $$

The random weight change can be selected in a few ways, but one is just choosing it from a Gaussian distribution, $P(w) = exp(\frac{-w^2}{T^2})$, where $P(w)$ is the probability of a weight change of size $w$ and $T$ is the artificial temperature.

Here we can just select a random number from a uniform distribution in $(-\frac{\pi}{2}, \frac{\pi}{2})$, then substitute this for $P(x)$ and solve for $x$ in the above, using the current temperature.

Cauchy training still may be slow so we can also use a method based on artificial specific heat (in annealing, there are discrete energy levels where phase changes occur, at which abrupt changes in the 'specific heat' occur).

The idea is that there are parts where the objective function is sensitive to small changes in temperature, where the average value of the objective function makes an abrupt change, so the temperature must be changed slowly here so as not to get stuck in a local minima.

Cauchy training may be combined with backprop to get the best of both worlds - it simply involves computing both the backprop and Cauchy weight updates and applying their weighted sum as the update.

Then, the objective function's change is computed, and like with Cauchy training, if there is an improvement, the weight change is kept, otherwise, it is kept with a probability determined by the Boltzmann distribution.

The weighted sum of the individual weight updates is controlled by a coefficient $\eta$, such that the sum is $\eta [\alpha \Delta W_{t-1} + (1-\alpha)\delta \text{OUT}] + (1 - \eta) x_c$, so that if $\eta=0$, the training is purely Cauchy, and if $\eta=1$, it becomes purely backprop.

There is still the issue of the possibility of retaining a massive weight change due to the Cauchy distribution's infinite variance, which creates the possibility of network paralysis.

The recommended approach here is to detect saturated neurons by looking at their $\text{OUT}$ values - if it is approaching the saturation point (positive or negative), apply some squashing function to its weights (note that this squashing function is not restricted to the range $[-1, 1]$ and in fact may work better with a larger range).

The amount weights and biases are adjusted is determined by a learning rate $\eta$ (sometimes called a delta rule or delta function).

This learning constant can help 'jiggle' the network out of local optima, but you want to take care that it isn't set so high that the network will also jiggle out of the global optima.

As a simple example: In some cases, a simulated annealing approach is used, where the learning constant may be tempered (made smaller, less drastic) as the network evolves, to avoid jittering the network out of the global optima.

The fan-in of a neuron (number of inputs) also has an effect, determining the size of 'overshoot' effects (the more inputs there are, the more weights are changed simultaneously, all to adjust the same error, which is what can cause the overshooting).

So what you can do is manually set a global learning rate, then for each weight multiply this global learning rate by a local gain, determined empirically per weight.

We incorporate a matrix of velocity values $V$, with the same shape as the matrix of weights and biases for the network (for simplicity, we will roll the weights and matrices together into a matrix $W$).

Another hyperparameter, $\mu \in [0, 1]$, is also introduced - this controls the 'friction' of the system ($\mu = 1$ is no friction).

This allows it to identify frequently updated parameters and infrequently updated parameters - as a result, learning rates can be adapted per parameter over time (e.g.

More formally, for each parameter $\theta_i$ we have $g_{t,i} = \nabla J(\theta_i)$, so we re-write the update for a parameter $\theta_i$ at time $t$ to be: $$\theta_{t+1, i} = \theta_{t, i} - \eta \cdot g_{t, i} $$

In particular, we divide $\eta$ by the square root of this matrix (empirically, the square root improves performance): $$\theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{G_{t, ii} + \epsilon}} \cdot g_{t, i} $$

Note that as training progresses, the denominator term $G_t$ will grow very large (the sum of squared gradients accumulate), such that the learning rate eventually becomes very small and learning virtually ceases.

We keep a running average $E[g^2]_t$ ($\gamma$ is like a momentum term, usually around 0.9): $$E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) g^2_t $$

And then update the Adagrad equation to replace the matrix $G_t$ with this running (exponentially decaying) average: $$\Delta \theta_t = - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_{t} $$

This is resolved by defining another exponentially decaying average, this one of squared parameter updates: $$E[\Delta \theta^2]_t = \gamma E[\Delta \theta^2]_{t-1} + (1 - \gamma) \Delta \theta^2_t $$

Like Adadelta and RMSprop, Adam keeps track of an exponentially decaying average of past squared gradients (here, it is $v_t$), but it also keeps track of an exponentially decaying average of past (non-squared) gradients $m_t$, similar to momentum: $$\begin{aligned}m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t

As such, $m_t, v_t$ tend to be a biased towards zero, so bias-corrected versions of each are computed: $$\begin{aligned}\hat{m}_t = \frac{m_t}{1 - \beta^t_1} \\ \hat{v}_t

Batch normalization is a normalization method for mini-batch training, which can improve training time (it allows for higher learning rates), act as a regularizer, and reduce the importance of proper parameter initialization.

For a mini-batch $x$ (of size $m$), the sample mean and variance for each feature $k$ is computed: $$\begin{aligned}\bar x_k = \frac{1}{m} \sum_{i=1}^m x_{i,k} \\ \sigma_k^2

Standardizing intermediate representations in this way can weaken the representational power of the layer, so two additional learnable parameters $\gamma$ and $\beta$ are introduced to scale and/or shift the data.

Given a layer with some activation function $\phi$, which would typically be defined as $\phi(Wx+b)$, we can redefine it with batch normalization: $$\phi(\text{BN}(Wx))$$ The bias is dropped because its effect is cancelled by the standardization.

Thus, when the output layer's function has this property, and outputs values near 0 and 1 (in the case of sigmoid), this reduces the entire partial derivative, leading to small updates, which has the effect of slow learning.

When slow learning of this sort occurs (that is, the kind caused by the activation functions outputting at their minimum or maximum), it is called saturation, and it is a common problem with neural networks.

For binary classification, a common cost function is the cross-entropy cost, also known as 'log loss' or 'logistic loss': $$J(\theta) = -\frac{1}{m} \sum_i^m \sum_j^k [y^{(i)} \ln h_{\theta}(X^{(i)}) + (1 - y^{(i)}) \ln(1 - h_{\theta}(X^{(i)}))] $$

The partial derivatives of the cross-entropy cost with respect to $W^N$ and $b^N$ are (for brevity, we'll notate $f^N(\text{NET}^N)$ as simply $f(n)$): $$\begin{aligned}\frac{\partial J}{\partial W^N} = \sum_j^k \frac{(y-f(n))}{f(n)(1-f(n))} f'(n) X \\ \frac{\partial

This has the advantage that for some activation functions $f$, such as the sigmoid function, the activation function's derivative $f'$ cancels out, thus avoiding the training slowdown that can occur with the MSE.

This isn't a problem, for instance, with linear activation functions, in which case quadratic cost is appropriate (though neural nets with linear activation functions are limited in what they can learn).

That is, given an example that belongs to class $y$, we take the natural log of the value outputted by the output node corresponding to the class $y$ (typically this is the $y$th node, since you'd have an output node for each class).

Thus softmax output activations and the log-likelihood cost functions are a good pairing for problems requiring probability-like outputs (such as with classification problems).

Alternatively, we could set each neuron's initial weights to be a random vector from a standard multidimensional normal distribution (mean of 0, standard deviation of 1), scaled by some value, e.g.

This can be controlled for (calibrated) by scaling its weight vector by the square root of its 'fan-in' (its number of inputs), so you should divide the standard multidimensional distribution sampled random vector by $\sqrt{n}$, where $n$ is the number of the neuron's inputs.

(Karpathy's CS231n notes provides more detail on why this is.) An alternative to this fan-in scaling for the uncalibrated variances problem is sparse initialization, which is to set all weights to 0, and then break symmetry by randomly connecting every neuron to some fixed number (e.g.

Biases are commonly initialized to be zero, though if using ReLUs, then you can set them to a small value like 0.01 so all the ReLUs fire at the start and are included in the gradient backpropagation update.

They suspect that the added noise gives the model more chances to escape and find new local minima, which are more frequent for deeper models.'

For instance, if using an image classification net trained for one classification task, you can use that same network for another, truncating the output layer, that is, take the vectors from the second-to-last layer and use those as feature vectors for other tasks.

However with softmax (more than just two classes) you have one output node per class label, with each node outputting the probability the input is of the class associated with the node.

rule of thumb for deciding the size of the hidden layer is that the size should be between the size between the input size and output size (for example, the mean of their sizes).

There is a great deal of variance across these local minima, so the outcome is quite sensitive to the random initialization - some times you land in a good local minima, sometimes not.

Another simple, but possibly expensive way of reducing overfitting is by increasing the amount of training data - it's unlikely to overfit many, many examples.

we add $\sum \lambda w^2$ to the objective function (this additional term is called the regularization term, and $\lambda$ is an additional hyperparameter, the regularization parameter).

use $\frac{1}{2} \sum \lambda w^2$, so the gradient of this term wrt to $w$ is just $\lambda w$ instead of $2 \lambda w$$.

L2 regularization is sometimes called weight decay since the added regularization term penalizes large weights, favoring smaller weights.

This affects the partial derivative of the cost function with respect to weights in a simple way (again, biases are not included, so it does not change that partial derivative): $$\frac{\partial J}{\partial w} = \frac{\partial J_0}{\partial w} + \frac{\lambda}{m} w $$

The main difference between L1 and L2 regularization is that L1 regularization shrinks weights by a constant amount, whereas L2 regularization shrinks weights by an amount proportional to the weights themselves.

This ends up leading to the following update rule: $$w \to w' = w - \frac{\eta \lambda}{m}\text{sign}(w) - \frac{\eta}{m} \sum_i^m \frac{\partial J_i}{\partial w} $$

Compare this with the update rule for L2 regularization: $$w \to w' = w - \frac{\eta \lambda}{m}w - \frac{\eta}{m} \sum_i^m \frac{\partial J_i}{\partial w} $$

With dropout, the neuron's output has a chance $p$ of being set to 0, so its expected output becomes $px$ (more verbosely, it has $1-p$ chance of becoming 0, so its output is $px + (1-p)0$, which simplifies to $px$).

For comparison, here is an implementation of regular dropout and an implementation of inverted dropout (source from: It is most common to use a single, global L2 regularization strength that is cross-validated.

related technique is training on adversarial examples (detailed elsewhere), in which training examples are modified to be deliberately hard for the network to classify, so that it can be trained on more ambiguous/difficult examples.

This method for selecting hyperparameters is called Bayesian (hyperparameter) optimization, and is a better approach than by picking hyperparameters by hand (less prone to human error).

Some places recommend using a learning rate in the form: $$\eta_t = \eta_0 (1 + \eta_0 \lambda t)^{-1} $$

The network computes $(XW_1)W_2$, which is equivalent to $X(W_1W_2)$, so the network is equivalent to a single layer network with weight vectors $W_1W_2$ Training deep neural networks (that is, neural networks with more than one hidden layer) is not as straightforward as it is with a single hidden layer - a simple stochastic gradient descent + backpropagation approach is not as effective or quick.

This has two ways of showing up: These unstable gradients occur because gradients in earlier layers are the products of the later layers (refer to backpropagation for details, but remember that the $\delta^i$ for layer $i$ is computed from $\delta^{i+1}$).

Certain neural networks, such as RNNs, can have unstable gradients, in which gradients may grow exponentially (an exploding gradient) or shrink exponentially until it reaches zero (a vanishing gradient).

Unstable gradients can occur as a result of drastic changes in the cost surface, as illustrated in the accompanying figure (from Pascanu et al via

Intuitively this means that frequently occurring features get a smaller learning rate (because the sum of their gradients is larger), and rare features get a larger learning rate. In a regular neural network, the relationship between a pixel and one that is next to it is the same as its relationship with a pixel far away - the structural information of the image is totally lost.

(Note: the following images are from TODO replace the graphics) We do not fully connect this input layer to the hidden layer (which we'll call a convolutional layer).

Note that, as depicted above, narrow convolution will yield a smaller feature map of size: input shape - filter shape + 1.

For a border mode of 'half'/'same', we pad the input with a symmetric border of $r//2$ rows and $c//2$ columns (where $//$ indicates integer division).

This is equivalent to applying the filter anywhere it overlaps with a pixel and yields a feature map of size: input shape + filter shape - 1.

For a border mode of 'half'/'same', we the padded image would look like (padding is indicated with o): For a border mode of 'full', the padded image would instead be: Another change here is that the hidden layer has one set of weights and a bias that is shared across the entire layer (these weights and biases are accordingly referred to as shared weights and the shared bias, and together, they define a filter or a kernel).

As a result, if we have receptive fields of $m \times m$ size, the output of the $i,j$th neuron in the hidden layer looks like: $$f(b + \sum^{m-1}_{k=0} \sum^{m-1}_{l=0} W_{k,l} \text{OUT}^0_{i+k,j+l}) $$

If an edge shows up in the lower-left part of the image, the corresponding input neuron for that receptive field will also fire, due to the fact that they all share weights and a bias.

Each position in the filter lines up with a pixel as the filter slides across the image (as depicted above) Let's say that the weights learned by the filter end up being the following: $$\begin{bmatrix}-1, -1, -1 \\ -1,

We do so by convolving them as the sum of the element-wise product: $$(-1 \times 0) + (-1 \times 0) + (-1 \times 0) + (-1

Another benefit to sharing weights and biases across the layer is that it introduces some resilience to overfitting - the sharing of weights means that the layer cannot favor peculiarities in particular parts of the training data;

Pooling layers produced a condensed version of the feature map they are given (for this reason, this process is also known as subsampling, so pooling layers are sometimes called subsampling layers).

Each of this new layer's input neurons (that is, the neurons in its first set of convolutional layers) takes as its input all of the outputs (within its local receptive field) from the preceding convolutional-pooling layer.

For example, if the preceding convolutional-pooling layer has 20 layers in it, and we have receptive fields of size $5 \times 5$, then each of the input neurons for the new convolutional-pooling layer would have $20 \times 5 \times 5$ inputs.

Another way of putting this is that the core difference of an RNN from a regular feedforward network is that the output of a neuron is a function of its inputs and of its past state, e.g.

In the most basic RNN, the hidden layer have two inputs: the input from the previous layer, and the layer's own output from the previous time step (so it loops back onto itself): This simple network can be visualized over time as well: Say we have a hidden layer $L_1$ of size 3 and another hidden layer $L_2$ of size 2.

they train significantly faster on GPUs.) RNNs are trained using a variant of backpropagation called backpropagation through time, which just involves unfolding the RNN a certain number of time steps, which results in what is essentially a regular feedforward network, and then applying backpropagation: $$\frac{\partial E}{\partial \text{OUT}_{t-1}} = \frac{\partial E}{\partial \text{OUT}_t} \frac{\partial \text{OUT}_t}{\partial \text{OUT}_{t-1}} = \frac{\partial E}{\partial \text{OUT}_t}W_r $$

The gradients of the cost function wrt to the weights is computed by summing the weight gradients in each layer: $$\begin{aligned}\frac{\partial E}{\partial W_x} = \sum^n_{k=0} \frac{\partial E}{\partial \text{OUT}_t} X_t \\ \frac{\partial

This summing of the weight gradients at each time step is the main difference from regular feedforward networks, aside from that BPTT is basically just backpropagation on an RNN unrolled up to some time step $t$.

The vanishing gradient problem in RNNs means long-term dependencies won't be learned - the effect of earlier steps 'vanish' over time steps (this is the same problem of vanishing gradients in deep feedforward networks, given that an RNN is basically a deep neural net).

Exploding gradients are more easily dealt with - it's obvious when they occur (you'll see NaNs, for instance), and you can clip them at some maximum value, which can be quite effective (refer to this paper) Some strategies for dealing with vanishing gradients: Generally, however, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures are used instead of vanilla RNNs, which were designed for mitigating vanishing gradients (for the purpose of better learning long-range dependencies).

These LSTM units have a three gates (in contrast to the single activation function vanilla RNNs have): TODO include Chris Olah's LSTM diagrams: These gates are sigmoid functions combined with a pointwise multiplication operation.

Like the forget gate, this is a vector of values in $[0, 1]$ which determine how much information gets through - 0 means none, 1 means all of it.

Rather, we have yet another gate, the output gate (sometimes called a read gate) that outputs another vector with values in $[0, 1]$, $o_t$, which determines how much of the cell state is outputted.

This LSTM variant just passes on the previous cell state, $C_{t-1}$, to the forget and input gates, and the new cell state, $C_t$, to the output gate, that is, all that is changed is that: $$\begin{aligned}f_t = \text{sigmoid}(W_f \dot [C_{t-1}, \text{OUT}_{t-1}, X_t] + b_f) \\ i_t

The forward state $s_i^f$ is based on $x_1, x_2, \dots, x_i$, whereas the backward state $s_i^b$ is based on $x_n, x_{n-1}, \dots, x_i$.

In people, 'attention' is a mechanism by which we focus on one particular element of our environment, such that our perception of the focused element is in high-fidelity/resolution, whereas surrounding elements are at a lower resolution.

There are some architectures such as the bidirectional variant that help with this, but attention mechanisms can help so that the decoder has access to the full spread of inputs and can 'focus' more on translating individual parts when appropriate.

each unit activates only rarely), or feed the network corrupted versions of its inputs and make it reconstruct the clean ones (this is known as a denoising autoencoder).

Essentially what happens is the hidden layer learns a compressed representation of the input (given that it is a smaller size than the input/output layers, this is called an undercomplete hidden layer, the learned representation is called an undercomplete representation), since it needs to be reconstructed by the decoder back to its original form.

On the other hand, the hidden layer may be larger than the input/output layers, in which case it is called an overcomplete hidden layer and the learned representation of the input is an overcomplete representation.

Rather counterintuitively, a larger hidden layer helps, where some hidden units are randomly turned off during a training iteration - that way, the output isn't a mere copy of the input, and learning is easier since there is more 'room' to represent the input.

So instead of inputting $x$, we input $\tilde x$, which is just $x$ with noise added (sometimes called a corrupted input), and the network tries to reconstruct the noiseless $x$ as its output.

For binary inputs, we can use cross-entropy (more precisely, the sum of Bernoulli cross-entropies): $$l(f(x)) = - \sum_k (x_k \log(\hat x_k)) + (1 - x_k)(\log(1-\hat x_k)) $$

Intuitively, the term we're adding to the loss (the squared Frobenius norm of the Jacobian) increases the loss if we have non-zero partial derivatives with the encoder $h(x^{(t)})$ with respect to the input;

We balance this out with the original loss function which, as usual, encourages the encoder to keep good information (information that is useful for reconstructing the original input).

By combining these two conflicting priorities, the result is that the encoder keeps only the good information (the latter term encourages it to throw all information away, the former term encourages it to keep only the good stuff).

We can train an RBM (or another unsupervised learning method) on the unlabeled data to learn useful features to use with the supervised training set - this approach is called semi-supervised learning.

Instead we might want to group words in some way: This kind of model is a 'recursive neural network' (sometimes 'tree-structured neural network') because it has modules feeding into modules of the same type.

Like a Turing machine, it can simulate any arbitrary procedure - in fact, given an input sequence and a target output sequence, it can learn a procedure to map between the two on its own, trainable via gradient descent (as the entire thing is differentiable).

LSTM, or a standard feedforward network), read/write heads (the write 'head' actually consists of two heads, an erase and an add head, but referred to as a single head), and a memory matrix $M_t \in \mathcal R^{N \times M}$.

Unlike a normal Turing machine, the read and write operations are 'blurry' in that they interact in some way with all elements in memory (normal Turing machines address one element at a time).

There is an attentional 'focus' mechanism that constrains the memory interaction to a smaller portion - each head outputs a weighting vector which determines how much it interacts (i.e.

At time $t$, the write head emits a weighting vector $w_t$ (note that the write and read heads each emit their own $w_t$ that is used in the context of that head) and an erase vector $e_t$ that have $M$ elements which line in the range (0,1)$.

The write head also produces an $M$ length add vector $a_t$, which is added to the memory after the erase step: $$M_t(i) = \tilde M_t(i) + w_t(i) a_t $$

(TODO not totally clear on this part) Next, the head also emits a shift weighting $s_t$ which specifies a normalized distribution over the allowed integer shifts.

For example, if shifts between -1 and 1 are allowed, $s_t$ has three elements describing how much the shifts of -1, 0, and 1 are performed.

For example, with permitted shifts of -1, 0, 1 and $s_t = [0.1, 0.8, 0.1]$, the single point gets slightly blurred across the three points To counter this, each head also emits a scalar $\gamma_t \geq 1$ that is used to (re)sharpen the final weighting: $$w_t(i) = \frac{\tilde w_t(i)^{\gamma_t}}{\sum_j \tilde w_t(j)^{\gamma_t}} $$

images), CPPNs may be used to evolved indirectly encoded neural networks - they 'exploit geometric domain properties to compactly describe the connectivity pattern of a large-scale ANN' (Riesi

Instead, staging (also called incremental evolution) may be preferred, where the network is evolved on simpler problems that gradually increase towards the original complex task.

similar method called cooperative coevolution, where fitness is instead based on its performance in collaboration with other players, may make more sense in other contexts.

as a linear combination - but another way is cascading elitism, where 'each generation contains separate selection events for each fitness function, ensuring equal selection pressure' (Riesi

These algorithms try to satisfy all their given objectives (fitness functions) and can also manage conflicts between objectives by identifying (mapping) them and deciding on tradeoffs.

Other ways humans can intervene include shaping, where the human can shape the environment to influence training, and demonstration, in which the human takes direct control and the network learns from that example.

Denoising autoencoders learn to reconstruct empirical data $X$ from noised inputs $\tilde X$ and can be sampled from by using a Markov chain, alternating between sampling reconstructed values $P(X |

The conditional GAN objective function becomes: $$\min_G \max_D (E_{x,y \sim p_{\text{data}}(x,y)} [\log D(x,y)] + E_{y \sim p_y, z \sim p_z(z)} [\log(1-D(G(z,y), y))]) $$

The cost equation for the discriminator $D$ is a simple logistic cost expression (to give a positive label to input truly from the data distribution and a negative label to counterfeit examples): $$J_D = -\frac{1}{2n} (\sum_{i=1}^n \log D(x_i, y_i) + \sum_{i=1}^n \log (1-D(G(z_i, y_i), y_i))) $$

Using neural nets to recognize handwritten digits

Simple intuitions about how we recognize shapes - 'a 9 has a loop at the top, and a vertical stroke in the bottom right' - turn out to be not so simple to express algorithmically.

As a prototype it hits a sweet spot: it's challenging - it's no small feat to recognize handwritten digits - but it's not so difficult as to require an extremely complicated solution, or tremendous computational power.

But along the way we'll develop many key ideas about neural networks, including two important types of artificial neuron (the perceptron and the sigmoid neuron), and the standard learning algorithm for neural networks, known as stochastic gradient descent.

Today, it's more common to use other models of artificial neurons - in this book, and in much modern work on neural networks, the main neuron model used is one called the sigmoid neuron.

A perceptron takes several binary inputs, $x_1, x_2, \ldots$, and produces a single binary output: In the example shown the perceptron has three inputs, $x_1, x_2, x_3$.

The neuron's output, $0$ or $1$, is determined by whether the weighted sum $\sum_j w_j x_j$ is less than or greater than some threshold value.

To put it in more precise algebraic terms: \begin{eqnarray} \mbox{output} & = & \left\{ \begin{array}{ll} 0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\ 1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold} \end{array} \right.

And it should seem plausible that a complex network of perceptrons could make quite subtle decisions: In this network, the first column of perceptrons - what we'll call the first layer of perceptrons - is making three very simple decisions, by weighing the input evidence.

The first change is to write $\sum_j w_j x_j$ as a dot product, $w \cdot x \equiv \sum_j w_j x_j$, where $w$ and $x$ are vectors whose components are the weights and inputs, respectively.

Using the bias instead of the threshold, the perceptron rule can be rewritten: \begin{eqnarray} \mbox{output} = \left\{ \begin{array}{ll} 0 & \mbox{if } w\cdot x + b \leq 0 \\ 1 & \mbox{if } w\cdot x + b > 0 \end{array} \right.

This requires computing the bitwise sum, $x_1 \oplus x_2$, as well as a carry bit which is set to $1$ when both $x_1$ and $x_2$ are $1$, i.e., the carry bit is just the bitwise product $x_1 x_2$: To get an equivalent network of perceptrons we replace all the NAND gates by perceptrons with two inputs, each with weight $-2$, and an overall bias of $3$.

Note that I've moved the perceptron corresponding to the bottom right NAND gate a little, just to make it easier to draw the arrows on the diagram: One notable aspect of this network of perceptrons is that the output from the leftmost perceptron is used twice as input to the bottommost perceptron.

(If you don't find this obvious, you should stop and prove to yourself that this is equivalent.) With that change, the network looks as follows, with all unmarked weights equal to -2, all biases equal to 3, and a single weight of -4, as marked: Up to now I've been drawing inputs like $x_1$ and $x_2$ as variables floating to the left of the network of perceptrons.

In fact, it's conventional to draw an extra layer of perceptrons - the input layer - to encode the inputs: This notation for input perceptrons, in which we have an output, but no inputs, is a shorthand.

Then the weighted sum $\sum_j w_j x_j$ would always be zero, and so the perceptron would output $1$ if $b > 0$, and $0$ if $b \leq 0$.

Instead of explicitly laying out a circuit of NAND and other gates, our neural networks can simply learn to solve problems, sometimes problems where it would be extremely difficult to directly design a conventional circuit.

If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want.

In fact, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from $0$ to $1$.

We'll depict sigmoid neurons in the same way we depicted perceptrons: Just like a perceptron, the sigmoid neuron has inputs, $x_1, x_2, \ldots$.

Instead, it's $\sigma(w \cdot x+b)$, where $\sigma$ is called the sigmoid function* *Incidentally, $\sigma$ is sometimes called the logistic function, and this new class of neurons called logistic neurons.

\tag{3}\end{eqnarray} To put it all a little more explicitly, the output of a sigmoid neuron with inputs $x_1,x_2,\ldots$, weights $w_1,w_2,\ldots$, and bias $b$ is \begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)}.

In fact, there are many similarities between perceptrons and sigmoid neurons, and the algebraic form of the sigmoid function turns out to be more of a technical detail than a true barrier to understanding.

var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))};

var y = d3.scale.linear() .domain([0, 1]) .range([height, 0]);

}) var graph ='#sigmoid_graph') .append('svg') .attr('width', width + m[1] + m[3]) .attr('height', height + m[0] + m[2]) .append('g') .attr('transform', 'translate(' + m[3] + ',' + m[0] + ')');

var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(-4, 5, 1)) .orient('bottom') graph.append('g') .attr('class', 'x axis') .attr('transform', 'translate(0, ' + height + ')') .call(xAxis);

var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(0, 1.01, 0.2)) .orient('left') .ticks(5) graph.append('g') .attr('class', 'y axis') .call(yAxis);

graph.append('text') .attr('class', 'x label') .attr('text-anchor', 'end') .attr('x', width/2) .attr('y', height+35) .text('z');

graph.append('text') .attr('x', (width / 2)) .attr('y', -10) .attr('text-anchor', 'middle') .style('font-size', '16px') .text('sigmoid function');

var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))};

var y = d3.scale.linear() .domain([0,1]) .range([height, 0]);

}) var graph ='#step_graph') .append('svg') .attr('width', width + m[1] + m[3]) .attr('height', height + m[0] + m[2]) .append('g') .attr('transform', 'translate(' + m[3] + ',' + m[0] + ')');

var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(-4, 5, 1)) .orient('bottom') graph.append('g') .attr('class', 'x axis') .attr('transform', 'translate(0, ' + height + ')') .call(xAxis);

var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(0, 1.01, 0.2)) .orient('left') .ticks(5) graph.append('g') .attr('class', 'y axis') .call(yAxis);

graph.append('text') .attr('class', 'x label') .attr('text-anchor', 'end') .attr('x', width/2) .attr('y', height+35) .text('z');

graph.append('text') .attr('x', (width / 2)) .attr('y', -10) .attr('text-anchor', 'middle') .style('font-size', '16px') .text('step function');

If $\sigma$ had in fact been a step function, then the sigmoid neuron would be a perceptron, since the output would be $1$ or $0$ depending on whether $w\cdot x+b$ was positive or negative* *Actually, when $w \cdot x +b = 0$ the perceptron outputs $0$, while the step function outputs $1$.

The smoothness of $\sigma$ means that small changes $\Delta w_j$ in the weights and $\Delta b$ in the bias will produce a small change $\Delta \mbox{output}$ in the output from the neuron.

In fact, calculus tells us that $\Delta \mbox{output}$ is well approximated by \begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b, \tag{5}\end{eqnarray} where the sum is over all the weights, $w_j$, and $\partial \, \mbox{output} / \partial w_j$ and $\partial \, \mbox{output} /\partial b$ denote partial derivatives of the $\mbox{output}$ with respect to $w_j$ and $b$, respectively.

While the expression above looks complicated, with all the partial derivatives, it's actually saying something very simple (and which is very good news): $\Delta \mbox{output}$ is a linear function of the changes $\Delta w_j$ and $\Delta b$ in the weights and bias.

If it's the shape of $\sigma$ which really matters, and not its exact form, then why use the particular form used for $\sigma$ in Equation (3)\begin{eqnarray} \sigma(z) \equiv \frac{1}{1+e^{-z}} \nonumber\end{eqnarray}$('#margin_778862672352_reveal').click(function() {$('#margin_778862672352').toggle('slow', function() {});});?

In fact, later in the book we will occasionally consider neurons where the output is $f(w \cdot x + b)$ for some other activation function $f(\cdot)$.

The main thing that changes when we use a different activation function is that the particular values for the partial derivatives in Equation (5)\begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b \nonumber\end{eqnarray}$('#margin_726336021933_reveal').click(function() {$('#margin_726336021933').toggle('slow', function() {});});

It turns out that when we compute those partial derivatives later, using $\sigma$ will simplify the algebra, simply because exponentials have lovely properties when differentiated.

But in practice we can set up a convention to deal with this, for example, by deciding to interpret any output of at least $0.5$ as indicating a '9', and any output less than $0.5$ as indicating 'not a 9'.

Exercises Sigmoid neurons simulating perceptrons, part I $\mbox{}$ Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant, $c > 0$.

Show that the behaviour of the network doesn't change.Sigmoid neurons simulating perceptrons, part II $\mbox{}$ Suppose we have the same setup as the last problem - a network of perceptrons.

Suppose the weights and biases are such that $w \cdot x + b \neq 0$ for the input $x$ to any particular perceptron in the network.

Now replace all the perceptrons in the network by sigmoid neurons, and multiply the weights and biases by a positive constant $c > 0$.

Suppose we have the network: As mentioned earlier, the leftmost layer in this network is called the input layer, and the neurons within the layer are called input neurons.

The term 'hidden' perhaps sounds a little mysterious - the first time I heard the term I thought it must have some deep philosophical or mathematical significance - but it really means nothing more than 'not an input or an output'.

For example, the following four-layer network has two hidden layers: Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons.

If the image is a $64$ by $64$ greyscale image, then we'd have $4,096 = 64 \times 64$ input neurons, with the intensities scaled appropriately between $0$ and $1$.

The output layer will contain just a single neuron, with output values of less than $0.5$ indicating 'input image is not a 9', and values greater than $0.5$ indicating 'input image is a 9 '.

A trial segmentation gets a high score if the individual digit classifier is confident of its classification in all segments, and a low score if the classifier is having a lot of trouble in one or more segments.

So instead of worrying about segmentation we'll concentrate on developing a neural network which can solve the more interesting and difficult problem, namely, recognizing individual handwritten digits.

As discussed in the next section, our training data for the network will consist of many $28$ by $28$ pixel images of scanned handwritten digits, and so the input layer contains $784 = 28 \times 28$ neurons.

The input pixels are greyscale, with a value of $0.0$ representing white, a value of $1.0$ representing black, and in between values representing gradually darkening shades of grey.

A seemingly natural way of doing that is to use just $4$ output neurons, treating each neuron as taking on a binary value, depending on whether the neuron's output is closer to $0$ or to $1$.

The ultimate justification is empirical: we can try out both network designs, and it turns out that, for this particular problem, the network with $10$ output neurons learns to recognize digits better than the network with $4$ output neurons.

In a similar way, let's suppose for the sake of argument that the second, third, and fourth neurons in the hidden layer detect whether or not the following images are present:

Of course, that's not the only sort of evidence we can use to conclude that the image was a $0$ - we could legitimately get a $0$ in many other ways (say, through translations of the above images, or slight distortions).

Assume that the first $3$ layers of neurons are such that the correct output in the third layer (i.e., the old output layer) has activation at least $0.99$, and incorrect outputs have activation less than $0.01$.

We'll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications.

To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students).

For example, if a particular training image, $x$, depicts a $6$, then $y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T$ is the desired output from the network.

We use the term cost function throughout this book, but you should note the other terminology, since it's often used in research papers and other discussions of neural networks.

\tag{6}\end{eqnarray} Here, $w$ denotes the collection of all weights in the network, $b$ all the biases, $n$ is the total number of training inputs, $a$ is the vector of outputs from the network when $x$ is input, and the sum is over all training inputs, $x$.

If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost.

Even given that we want to use a smooth cost function, you may still wonder why we choose the quadratic function used in Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \|

This is a well-posed problem, but it's got a lot of distracting structure as currently posed - the interpretation of $w$ and $b$ as weights and biases, the $\sigma$ function lurking in the background, the choice of network architecture, MNIST, and so on.

And for neural networks we'll often want far more variables - the biggest neural networks have cost functions which depend on billions of weights and biases in an extremely complicated way.

We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of $C$ - those derivatives would tell us everything we need to know about the local 'shape' of the valley, and therefore how our ball should roll.

So rather than get into all the messy details of physics, let's simply ask ourselves: if we were declared God for a day, and could make up our own laws of physics, dictating to the ball how it should roll, what law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley?

To make this question more precise, let's think about what happens when we move the ball a small amount $\Delta v_1$ in the $v_1$ direction, and a small amount $\Delta v_2$ in the $v_2$ direction.

Calculus tells us that $C$ changes as follows: \begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2.

To figure out how to make such a choice it helps to define $\Delta v$ to be the vector of changes in $v$, $\Delta v \equiv (\Delta v_1, \Delta v_2)^T$, where $T$ is again the transpose operation, turning row vectors into column vectors.

We denote the gradient vector by $\nabla C$, i.e.: \begin{eqnarray} \nabla C \equiv \left( \frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2} \right)^T.

In fact, it's perfectly fine to think of $\nabla C$ as a single mathematical object - the vector defined above - which happens to be written using two symbols.

With these definitions, the expression (7)\begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2 \nonumber\end{eqnarray}$('#margin_832985330775_reveal').click(function() {$('#margin_832985330775').toggle('slow', function() {});});

\tag{9}\end{eqnarray} This equation helps explain why $\nabla C$ is called the gradient vector: $\nabla C$ relates changes in $v$ to changes in $C$, just as we'd expect something called a gradient to do.

In particular, suppose we choose \begin{eqnarray} \Delta v = -\eta \nabla C, \tag{10}\end{eqnarray} where $\eta$ is a small, positive parameter (known as the learning rate).

\nabla C \|^2 \geq 0$, this guarantees that $\Delta C \leq 0$, i.e., $C$ will always decrease, never increase, if we change $v$ according to the prescription in (10)\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray}$('#margin_39079991636_reveal').click(function() {$('#margin_39079991636').toggle('slow', function() {});});.

to compute a value for $\Delta v$, then move the ball's position $v$ by that amount: \begin{eqnarray} v \rightarrow v' = v -\eta \nabla C.

To make gradient descent work correctly, we need to choose the learning rate $\eta$ to be small enough that Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_663076476028_reveal').click(function() {$('#margin_663076476028').toggle('slow', function() {});});

In practical implementations, $\eta$ is often varied so that Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_362932327599_reveal').click(function() {$('#margin_362932327599').toggle('slow', function() {});});

Then the change $\Delta C$ in $C$ produced by a small change $\Delta v = (\Delta v_1, \ldots, \Delta v_m)^T$ is \begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v, \tag{12}\end{eqnarray} where the gradient $\nabla C$ is the vector \begin{eqnarray} \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots, \frac{\partial C}{\partial v_m}\right)^T.

\tag{13}\end{eqnarray} Just as for the two variable case, we can choose \begin{eqnarray} \Delta v = -\eta \nabla C, \tag{14}\end{eqnarray} and we're guaranteed that our (approximate) expression (12)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_398945612724_reveal').click(function() {$('#margin_398945612724').toggle('slow', function() {});});

This gives us a way of following the gradient to a minimum, even when $C$ is a function of many variables, by repeatedly applying the update rule \begin{eqnarray} v \rightarrow v' = v-\eta \nabla C.

The rule doesn't always work - several things can go wrong and prevent gradient descent from finding the global minimum of $C$, a point we'll return to explore in later chapters.

But, in practice gradient descent often works extremely well, and in neural networks we'll find that it's a powerful way of minimizing the cost function, and so helping the net learn.

It can be proved that the choice of $\Delta v$ which minimizes $\nabla C \cdot \Delta v$ is $\Delta v = - \eta \nabla C$, where $\eta = \epsilon / \|\nabla C\|$ is determined by the size constraint $\|\Delta v\|

Hint: If you're not already familiar with the Cauchy-Schwarz inequality, you may find it helpful to familiarize yourself with it.

If there are a million such $v_j$ variables then we'd need to compute something like a trillion (i.e., a million squared) second partial derivatives* *Actually, more like half a trillion, since $\partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial v_k \partial v_j$.

The idea is to use gradient descent to find the weights $w_k$ and biases $b_l$ which minimize the cost in Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \|

In other words, our 'position' now has components $w_k$ and $b_l$, and the gradient vector $\nabla C$ has corresponding components $\partial C / \partial w_k$ and $\partial C / \partial b_l$.

Writing out the gradient descent update rule in terms of components, we have \begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \tag{16}\\ b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l}.

In practice, to compute the gradient $\nabla C$ we need to compute the gradients $\nabla C_x$ separately for each training input, $x$, and then average them, $\nabla C = \frac{1}{n} \sum_x \nabla C_x$.

To make these ideas more precise, stochastic gradient descent works by randomly picking out a small number $m$ of randomly chosen training inputs.

Provided the sample size $m$ is large enough we expect that the average value of the $\nabla C_{X_j}$ will be roughly equal to the average over all $\nabla C_x$, that is, \begin{eqnarray} \frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \frac{\sum_x \nabla C_x}{n} = \nabla C, \tag{18}\end{eqnarray} where the second sum is over the entire set of training data.

Swapping sides we get \begin{eqnarray} \nabla C \approx \frac{1}{m} \sum_{j=1}^m \nabla C_{X_{j}}, \tag{19}\end{eqnarray} confirming that we can estimate the overall gradient by computing gradients just for the randomly chosen mini-batch.

Then stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those, \begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{20}\\ b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l}, \tag{21}\end{eqnarray} where the sums are over all the training examples $X_j$ in the current mini-batch.

And, in a similar way, the mini-batch update rules (20)\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \nonumber\end{eqnarray}$('#margin_255037324417_reveal').click(function() {$('#margin_255037324417').toggle('slow', function() {});});

and (21)\begin{eqnarray} b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}$('#margin_141169455106_reveal').click(function() {$('#margin_141169455106').toggle('slow', function() {});});

We can think of stochastic gradient descent as being like political polling: it's much easier to sample a small mini-batch than it is to apply gradient descent to the full batch, just as carrying out a poll is easier than running a full election.

For example, if we have a training set of size $n = 60,000$, as in MNIST, and choose a mini-batch size of (say) $m = 10$, this means we'll get a factor of $6,000$ speedup in estimating the gradient!

Of course, the estimate won't be perfect - there will be statistical fluctuations - but it doesn't need to be perfect: all we really care about is moving in a general direction that will help decrease $C$, and that means we don't need an exact computation of the gradient.

In practice, stochastic gradient descent is a commonly used and powerful technique for learning in neural networks, and it's the basis for most of the learning techniques we'll develop in this book.

That is, given a training input, $x$, we update our weights and biases according to the rules $w_k \rightarrow w_k' = w_k - \eta \partial C_x / \partial w_k$ and $b_l \rightarrow b_l' = b_l - \eta \partial C_x / \partial b_l$.

Name one advantage and one disadvantage of online learning, compared to stochastic gradient descent with a mini-batch size of, say, $20$.

In neural networks the cost $C$ is, of course, a function of many variables - all the weights and biases - and so in some sense defines a surface in a very high-dimensional space.

I won't go into more detail here, but if you're interested then you may enjoy reading this discussion of some of the techniques professional mathematicians use to think in high dimensions.

We'll leave the test images as is, but split the 60,000-image MNIST training set into two parts: a set of 50,000 images, which we'll use to train our neural network, and a separate 10,000 image validation set.

We won't use the validation data in this chapter, but later in the book we'll find it useful in figuring out how to set certain hyper-parameters of the neural network - things like the learning rate, and so on, which aren't directly selected by our learning algorithm.

When I refer to the 'MNIST training data' from now on, I'll be referring to our 50,000 image data set, not the original 60,000 image data set* *As noted earlier, the MNIST data set is based on two data sets collected by NIST, the United States' National Institute of Standards and Technology.

for x, y in zip(sizes[:-1], sizes[1:])]

So, for example, if we want to create a Network object with 2 neurons in the first layer, 3 neurons in the second layer, and 1 neuron in the final layer, we'd do this with the code: net = Network([2, 3, 1])

The biases and weights in the Network object are all initialized randomly, using the Numpy np.random.randn function to generate Gaussian distributions with mean $0$ and standard deviation $1$.

Note that the Network initialization code assumes that the first layer of neurons is an input layer, and omits to set any biases for those neurons, since biases are only ever used in computing the outputs from later layers.

The big advantage of using this ordering is that it means that the vector of activations of the third layer of neurons is: \begin{eqnarray} a' = \sigma(w a + b).

(This is called vectorizing the function $\sigma$.) It's easy to verify that Equation (22)\begin{eqnarray} a' = \sigma(w a + b) \nonumber\end{eqnarray}$('#margin_552886241220_reveal').click(function() {$('#margin_552886241220').toggle('slow', function() {});});

gives the same result as our earlier rule, Equation (4)\begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}$('#margin_7421600236_reveal').click(function() {$('#margin_7421600236').toggle('slow', function() {});});, for computing the output of a sigmoid neuron.

in component form, and verify that it gives the same result as the rule (4)\begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}$('#margin_347257101140_reveal').click(function() {$('#margin_347257101140').toggle('slow', function() {});});

We then add a feedforward method to the Network class, which, given an input a for the network, returns the corresponding output* *It is assumed that the input a is an (n, 1) Numpy ndarray, not a (n,) vector.

Although using an (n,) vector appears the more natural choice, using an (n, 1) ndarray makes it particularly easy to modify the code to feedforward multiple inputs at once, and that is sometimes convenient.

All the method does is applies Equation (22)\begin{eqnarray} a' = \sigma(w a + b) \nonumber\end{eqnarray}$('#margin_335258165235_reveal').click(function() {$('#margin_335258165235').toggle('slow', function() {});});


for k in xrange(0, n, mini_batch_size)]

self.update_mini_batch(mini_batch, eta)

print "Epoch {0}: {1} / {2}".format(

j, self.evaluate(test_data), n_test)

print "Epoch {0} complete".format(j)

This is done by the code self.update_mini_batch(mini_batch, eta), which updates the network weights and biases according to a single iteration of gradient descent, using just the training data in mini_batch.

delta_nabla_b, delta_nabla_w = self.backprop(x, y)

nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]

nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]

for w, nw in zip(self.weights, nabla_w)]

for b, nb in zip(self.biases, nabla_b)]

Most of the work is done by the line delta_nabla_b, delta_nabla_w = self.backprop(x, y)

The self.backprop method makes use of a few extra functions to help in computing the gradient, namely sigmoid_prime, which computes the derivative of the $\sigma$ function, and self.cost_derivative, which I won't describe here.

for x, y in zip(sizes[:-1], sizes[1:])]


for k in xrange(0, n, mini_batch_size)]

self.update_mini_batch(mini_batch, eta)

print "Epoch {0}: {1} / {2}".format(

j, self.evaluate(test_data), n_test)

print "Epoch {0} complete".format(j)

delta_nabla_b, delta_nabla_w = self.backprop(x, y)

nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]

nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]

for w, nw in zip(self.weights, nabla_w)]

for b, nb in zip(self.biases, nabla_b)]

activations = [x] # list to store all the activations, layer by layer

zs = [] # list to store all the z vectors, layer by layer

# l = 1 means the last layer of neurons, l = 2 is the

delta =[-l+1].transpose(), delta) * sp

for (x, y) in test_data]

Finally, we'll use stochastic gradient descent to learn from the MNIST training_data over 30 epochs, with a mini-batch size of 10, and a learning rate of $\eta = 3.0$, >>>

As was the case earlier, if you're running the code as you read along, you should be warned that it takes quite a while to execute (on my machine this experiment takes tens of seconds for each training epoch), so it's wise to continue reading in parallel while the code executes.

At least in this case, using more hidden neurons helps us get better results* *Reader feedback indicates quite some variation in results for this experiment, and some training runs give results quite a bit worse.

Using the techniques introduced in chapter 3 will greatly reduce the variation in performance across different training runs for our networks..

(If making a change improves things, try doing more!) If we do that several times over, we'll end up with a learning rate of something like $\eta = 1.0$ (and perhaps fine tune to $3.0$), which is close to our earlier experiments.

Exercise Try creating a network with just two layers - an input and an output layer, no hidden layer - with 784 and 10 neurons, respectively.

The data structures used to store the MNIST data are described in the documentation strings - it's straightforward stuff, tuples and lists of Numpy ndarray objects (think of them as vectors if you're not familiar with ndarrays): """mnist_loader~~~~~~~~~~~~A library to load the MNIST image data.

In some sense, the moral of both our results and those in more sophisticated papers, is that for some problems: sophisticated algorithm $\leq$ simple learning algorithm + good training data.

We could attack this problem the same way we attacked handwriting recognition - by using the pixels in the image as input to a neural network, with the output from the network a single neuron indicating either 'Yes, it's a face' or 'No, it's not a face'.

The end result is a network which breaks down a very complicated question - does this image show a face or not - into very simple questions answerable at the level of single pixels.

It does this through a series of many layers, with early layers answering very simple and specific questions about the input image, and later layers building up a hierarchy of ever more complex and abstract concepts.

Comparing a deep network to a shallow network is a bit like comparing a programming language with the ability to make function calls to a stripped down language with no ability to make such calls.

Radial basis function network

In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions.

The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters.

Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control.

They were first formulated in a 1988 paper by Broomhead and Lowe, both researchers at the Royal Signals and Radar Establishment.[1][2][3]

Radial basis function (RBF) networks typically have three layers: an input layer, a hidden layer with a non-linear RBF activation function and a linear output layer.

The input can be modeled as a vector of real numbers

{\displaystyle \mathbf {x} \in \mathbb {R} ^{n}}

The output of the network is then a scalar function of the input vector,

{\displaystyle \varphi :\mathbb {R} ^{n}\to \mathbb {R} }

and is given by where

is the number of neurons in the hidden layer,

{\displaystyle \mathbf {c} _{i}}

is the center vector for neuron

{\displaystyle i}

{\displaystyle a_{i}}

is the weight of neuron

{\displaystyle i}

in the linear output neuron.

Functions that depend only on the distance from a center vector are radially symmetric about that vector, hence the name radial basis function.

In the basic form all inputs are connected to each hidden neuron.

The norm is typically taken to be the Euclidean distance (although the Mahalanobis distance appears to perform better in general[citation needed]) and the radial basis function is commonly taken to be Gaussian The Gaussian basis functions are local to the center vector in the sense that i.e.

changing parameters of one neuron has only a small effect for input values that are far away from the center of that neuron.

Given certain mild conditions on the shape of the activation function, RBF networks are universal approximators on a compact subset of

{\displaystyle \mathbb {R} ^{n}}

.[4] This means that an RBF network with enough hidden neurons can approximate any continuous function on a closed, bounded set with arbitrary precision.

The parameters

{\displaystyle a_{i}}

{\displaystyle \mathbf {c} _{i}}

{\displaystyle \beta _{i}}

are determined in a manner that optimizes the fit between

{\displaystyle \varphi }

In addition to the above unnormalized architecture, RBF networks can be normalized.

In this case the mapping is where is known as a 'normalized radial basis function'.

There is theoretical justification for this architecture in the case of stochastic data flow.

Assume a stochastic kernel approximation for the joint probability density where the weights

{\displaystyle \mathbf {c} _{i}}

{\displaystyle e_{i}}

are exemplars from the data and we require the kernels to be normalized and The probability densities in the input and output spaces are and The expectation of y given an input

{\displaystyle \mathbf {x} }

is where is the conditional probability of y given

{\displaystyle \mathbf {x} }

The conditional probability is related to the joint probability through Bayes theorem which yields This becomes when the integrations are performed.

It is sometimes convenient to expand the architecture to include local linear models.

In that case the architectures become, to first order, and in the unnormalized and normalized cases, respectively.

{\displaystyle \mathbf {b} _{i}}

are weights to be determined.

Higher order linear terms are also possible.

This result can be written where and in the unnormalized case and in the normalized case.

{\displaystyle \delta _{ij}}

is a Kronecker delta function defined as RBF networks are typically trained from pairs of input and target values

{\displaystyle \mathbf {x} (t),y(t)}

{\displaystyle t=1,\dots ,T}

by a two-step algorithm.

In the first step, the center vectors

{\displaystyle \mathbf {c} _{i}}

of the RBF functions in the hidden layer are chosen.

This step can be performed in several ways;

centers can be randomly sampled from some set of examples, or they can be determined using k-means clustering.

Note that this step is unsupervised.

A third backpropagation step can be performed to fine-tune all of the RBF net's parameters.[3] The second step simply fits a linear model with coefficients

{\displaystyle w_{i}}

to the hidden layer's outputs with respect to some objective function.

A common objective function, at least for regression/function estimation, is the least squares function: where We have explicitly included the dependence on the weights.

Minimization of the least squares objective function by optimal choice of weights optimizes accuracy of fit.

There are occasions in which multiple objectives, such as smoothness as well as accuracy, must be optimized.

In that case it is useful to optimize a regularized objective function such as where and where optimization of S maximizes smoothness and

{\displaystyle \lambda }

is known as a regularization parameter.

RBF networks can be used to interpolate a function

{\displaystyle y:\mathbb {R} ^{n}\to \mathbb {R} }

when the values of that function are known on finite number of points:

{\displaystyle y(\mathbf {x} _{i})=b_{i},i=1,\ldots ,N}

Taking the known points

{\displaystyle \mathbf {x} _{i}}

to be the centers of the radial basis functions and evaluating the values of the basis functions at the same points

{\displaystyle g_{ij}=\rho (||\mathbf {x} _{j}-\mathbf {x} _{i}||)}

the weights can be solved from the equation It can be shown that the interpolation matrix in the above equation is non-singular, if the points

{\displaystyle \mathbf {x} _{i}}

are distinct, and thus the weights

{\displaystyle w}

can be solved by simple linear algebra: If the purpose is not to perform strict interpolation but instead more general function approximation or classification the optimization is somewhat more complex because there is no obvious choice for the centers.

The training is typically done in two phases first fixing the width and centers and then the weights.

This can be justified by considering the different nature of the non-linear hidden neurons versus the linear output neuron.

Basis function centers can be randomly sampled among the input instances or obtained by Orthogonal Least Square Learning Algorithm or found by clustering the samples and choosing the cluster means as the centers.

The RBF widths are usually all fixed to same value which is proportional to the maximum distance between the chosen centers.

After the centers

{\displaystyle c_{i}}

have been fixed, the weights that minimize the error at the output are computed with a linear pseudoinverse solution: where the entries of G are the values of the radial basis functions evaluated at the points

{\displaystyle x_{i}}

{\displaystyle g_{ji}=\rho (||x_{j}-c_{i}||)}

The existence of this linear solution means that unlike multi-layer perceptron (MLP) networks, RBF networks have a unique local minimum (when the centers are fixed).

Another possible training algorithm is gradient descent.

In gradient descent training, the weights are adjusted at each time step by moving them in a direction opposite from the gradient of the objective function (thus allowing the minimum of the objective function to be found), where

{\displaystyle \nu }

For the case of training the linear weights,

the algorithm becomes in the unnormalized case and in the normalized case.

For local-linear-architectures gradient-descent training is For the case of training the linear weights,

the algorithm becomes in the unnormalized case and in the normalized case and in the local-linear case.

For one basis function, projection operator training reduces to Newton's method.

The basic properties of radial basis functions can be illustrated with a simple mathematical map, the logistic map, which maps the unit interval onto itself.

It can be used to generate a convenient prototype data stream.

The logistic map can be used to explore function approximation, time series prediction, and control theory.

The map originated from the field of population dynamics and became the prototype for chaotic time series.

The map, in the fully chaotic regime, is given by where t is a time index.

The value of x at time t+1 is a parabolic function of x at time t.

This equation represents the underlying geometry of the chaotic time series generated by the logistic map.

Generation of the time series from this equation is the forward problem.

The examples here illustrate the inverse problem;

identification of the underlying dynamics, or fundamental equation, of the logistic map from exemplars of the time series.

The goal is to find an estimate for f.

The architecture is where Since the input is a scalar rather than a vector, the input dimension is one.

We choose the number of basis functions as N=5 and the size of the training set to be 100 exemplars generated by the chaotic time series.

{\displaystyle \beta }

is taken to be a constant equal to 5.

are five exemplars from the time series.

are trained with projection operator training: where the learning rate

{\displaystyle \nu }

The training is performed with one pass through the 100 training points.

The rms error is 0.15.

The normalized RBF architecture is where Again: Again, we choose the number of basis functions as five and the size of the training set to be 100 exemplars generated by the chaotic time series.

{\displaystyle \beta }

is taken to be a constant equal to 6.

are five exemplars from the time series.

are trained with projection operator training: where the learning rate

{\displaystyle \nu }

The training is performed with one pass through the 100 training points.

The rms error on a test set of 100 exemplars is 0.084, smaller than the unnormalized error.

Normalization yields accuracy improvement.

Typically accuracy with normalized basis functions increases even more over unnormalized functions as input dimensionality increases.

Once the underlying geometry of the time series is estimated as in the previous examples, a prediction for the time series can be made by iteration: A

comparison of the actual and estimated time series is displayed in the figure.

The estimated times series starts out at time zero with an exact knowledge of x(0).

It then uses the estimate of the dynamics to update the time series estimate for several time steps.

Note that the estimate is accurate for only a few time steps.

This is a general characteristic of chaotic time series.

This is a property of the sensitive dependence on initial conditions common to chaotic time series.

A small initial error is amplified with time.

A measure of the divergence of time series with nearly identical initial conditions is known as the Lyapunov exponent.

We assume the output of the logistic map can be manipulated through a control parameter

such that The goal is to choose the control parameter in such a way as to drive the time series to a desired output

This can be done if we choose the control paramer to be where is an approximation to the underlying natural dynamics of the system.

The learning algorithm is given by where

Deriving gradients using the backpropagation idea

In the section on the backpropagation algorithm, you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder.

It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from ).

To illustrate the use of the backpropagation idea to compute derivatives with respect to the inputs, we will use two functions from the section on sparse coding, in examples 1 and 2.

In example 3, we use a function from independent component analysis to illustrate the use of this idea to compute derivates with respect to weights, and in this specific case, what to do in the case of tied or repeated weights.

The first term, , can be seen as an instantiation of neural network taking s as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below:

Fortunately, it turns out that if W appears multiple times in the network, the gradient with respect to W is simply the sum of gradients for each instance of W in the network (you may wish to work out a formal proof of this fact to convince yourself).

Taking sums, noting that we need to transpose the gradient with respect to WT to get the gradient with respect to W, yields the final gradient with respect to W (pardon the slight abuse of notation here):