AI News,

(If you are interested in leveraging pre-existing Go packaging for machine learning, check out all the great existing packages, and be sure to watch Chris Benson's recent talk at GolangUK about Deep Learning in Go) There are a whole variety of ways to accomplish this task of building a neural net in Go, but I wanted to adhere to the following guidelines: The basic network architecture that we will utilize in this example includes an input layer, a single hidden layer, and an output layer:

By optimizing the weights and the biases, with a process called backpropagation, we will be able to mimic the relationships between our inputs (measurements of flowers) and what we are trying to predict (species of flowers).

(If you are new to neural nets, you might also check out this great intro, or, of course, you can read the relevant section in Machine Learning with Go.) Before diving into the backpropagation and feeding forward, let's define a couple types that will help us as we work with our model: We also need to define our activation function and it's derivative, which we will utilize during backpropagation.

Here we have utilized a helper function that allows us to sum values along one dimension of a matrix, keeping the other dimension intact: After training our neural net, we are going to want to use it to make predictions.

This data set includes sets of four iris flower measurements (what will become our x values) along with a corresponding indication of iris species (what will become our y values).

To utilize this data set with our neural net, I have slightly transformed the data set, such that the species values are represented by three binary columns (1 if the row corresponds to that species, 0 otherwise).

We can then parse the test data into matrices testInputs and testLabels (I'll spare you these details as they are the same as above), use our predict() method to make predictions for the flower species, and then compare the predictions to the actual species.

Backpropagation

Backpropagation is a method used in artificial neural networks to calculate a gradient that is needed in the calculation of the weights to be used in the network.[1] It is commonly used to train deep neural networks,[2] a term referring to neural networks with more than one hidden layer.[3] Backpropagation is a special case of an older and more general technique called automatic differentiation.

In the context of learning, backpropagation is commonly used by the gradient descent optimization algorithm to adjust the weight of neurons by calculating the gradient of the loss function.

This technique is also sometimes called backward propagation of errors, because the error is calculated at the output and distributed back through the network layers.

The backpropagation algorithm has been repeatedly rediscovered and is equivalent to automatic differentiation in reverse accumulation mode[citation needed][clarification needed].

Backpropagation requires the derivative of the loss function with respect to the network output to be known, which typically (but not necessarily) means that a desired target value is known.

For this reason it is considered to be a supervised learning method, although it is used in some unsupervised networks such as autoencoders.

Backpropagation is also a generalization of the delta rule to multi-layered feedforward networks, made possible by using the chain rule to iteratively compute gradients for each layer.

It is closely related to the Gauss–Newton algorithm, and is part of continuing research in neural backpropagation.

Backpropagation can be used with any gradient-based optimizer, such as L-BFGS or truncated Newton[citation needed][clarification needed].

The goal of any supervised learning algorithm is to find a function that best maps a set of inputs to their correct output.

An example would be a classification task, where the input is an image of an animal, and the correct output is the name of the animal.

The motivation for backpropagation is to train a multi-layered neural network such that it can learn the appropriate internal representations to allow it to learn any arbitrary mapping of input to output.[4] Sometimes referred to as the cost function or error function (not to be confused with the Gauss error function), the loss function is a function that maps values of one or more variables onto a real number intuitively representing some 'cost' associated with those values.

For backpropagation, the loss function calculates the difference between the network output and its expected output, after a case propagates through the network.

Two assumptions must be made about the form of the error function.[5] The first is that it can be written as an average

=

1

n

∑

x

x

{\textstyle E={\frac {1}{n}}\sum _{x}E_{x}}

over error functions

x

{\textstyle E_{x}}

,

for

n

{\textstyle n}

individual training examples,

x

{\textstyle x}

.

The reason for this assumption is that the backpropagation algorithm calculates the gradient of the error function for a single training example, which needs to be generalized to the overall error function.

The second assumption is that it can be written as a function of the outputs from the neural network.

Let

y

,

y

′

{\displaystyle y,y'}

be vectors in

n

{\displaystyle \mathbb {R} ^{n}}

.

Select an error function

(

y

,

y

′

)

{\displaystyle E(y,y')}

measuring the difference between two outputs.

The standard choice is the square of the Euclidean distance between the vectors

y

{\displaystyle y}

and

y

′

{\displaystyle y'}

:

(

y

,

y

′

)

=

1

2

‖

y

−

y

′

‖

2

{\displaystyle E(y,y')={\tfrac {1}{2}}\lVert y-y'\rVert ^{2}}

Note that the factor of

1

2

{\displaystyle {\tfrac {1}{2}}}

conveniently cancels the exponent when the error function is subsequently differentiated.

The error function over

n

{\textstyle n}

training examples can simply be written as an average of losses over individual examples:

=

1

{\displaystyle E={\frac {1}{2n}}\sum _{x}\lVert (y(x)-y'(x))\rVert ^{2}}

and therefore, the partial derivative with respect to the outputs:

{\displaystyle {\frac {\partial E}{\partial y'}}=y'-y}

The optimization algorithm repeats a two phase cycle, propagation and weight update.

When an input vector is presented to the network, it is propagated forward through the network, layer by layer, until it reaches the output layer.

The output of the network is then compared to the desired output, using a loss function.

The resulting error value is calculated for each of the neurons in the output layer.

The error values are then propagated from the output back through the network, until each neuron has an associated error value that reflects its contribution to the original output.

Backpropagation uses these error values to calculate the gradient of the loss function.

In the second phase, this gradient is fed to the optimization method, which in turn uses it to update the weights, in an attempt to minimize the loss function.

be a neural network with

{\displaystyle e}

connections,

{\displaystyle m}

inputs, and

{\displaystyle n}

{\displaystyle x_{1},x_{2},\dots }

will denote vectors in

{\displaystyle \mathbb {R} ^{m}}

{\displaystyle y_{1},y_{2},\dots }

vectors in

{\displaystyle \mathbb {R} ^{n}}

{\displaystyle w_{0},w_{1},w_{2},\ldots }

vectors in

{\displaystyle \mathbb {R} ^{e}}

These are called inputs, outputs and weights respectively.

The neural network corresponds to a function

which, given a weight

{\displaystyle w}

maps an input

{\displaystyle x}

to an output

{\displaystyle y}

The optimization takes as input a sequence of training examples

{\displaystyle (x_{1},y_{1}),\dots ,(x_{p},y_{p})}

and produces a sequence of weights

{\displaystyle w_{0},w_{1},\dots ,w_{p}}

starting from some initial weight

{\displaystyle w_{0}}

usually chosen at random.

These weights are computed in turn: first compute

{\displaystyle w_{i}}

using only

{\displaystyle (x_{i},y_{i},w_{i-1})}

{\displaystyle i=1,\dots ,p}

The output of the algorithm is then

{\displaystyle w_{p}}

giving us a new function

{\displaystyle x\mapsto f_{N}(w_{p},x)}

The computation is the same in each step, hence only the case

{\displaystyle i=1}

{\displaystyle w_{1}}

{\displaystyle (x_{1},y_{1},w_{0})}

is done by considering a variable weight

{\displaystyle w}

and applying gradient descent to the function

{\displaystyle w\mapsto E(f_{N}(w,x_{1}),y_{1})}

to find a local minimum, starting at

{\displaystyle w=w_{0}}

{\displaystyle w_{1}}

the minimizing weight found by gradient descent.

To implement the algorithm above, explicit formulas are required for the gradient of the function

{\displaystyle w\mapsto E(f_{N}(w,x),y)}

where the function is

The learning algorithm can be divided into two phases: propagation and weight update.

Each propagation involves the following steps: For each weight, the following steps must be followed: This ratio (percentage) influences the speed and quality of learning;

it is called the learning rate.

The greater the ratio, the faster the neuron trains, but the lower the ratio, the more accurate the training is.

The sign of the gradient of a weight indicates whether the error varies directly with, or inversely to, the weight.

Therefore, the weight must be updated in the opposite direction, 'descending' the gradient.

Learning is repeated (on new batches) until the network performs adequately.

The following is pseudocode for a stochastic gradient descent algorithm for training a three-layer network (only one hidden layer): The lines labeled 'backward pass' can be implemented using the backpropagation algorithm, which calculates the gradient of the error of the network regarding the network's modifiable weights.[6] To understand the mathematical derivation of the backpropagation algorithm, it helps to first develop some intuitions about the relationship between the actual output of a neuron and the correct output for a particular training case.

Consider a simple neural network with two input units, one output unit and no hidden units.

Each neuron uses a linear output[note 1] that is the weighted sum of its input.

Initially, before training, the weights will be set randomly.

Then the neuron learns from training examples, which in this case consists of a set of tuples

{\displaystyle (x_{1},x_{2},t)}

{\displaystyle x_{1}}

{\displaystyle x_{2}}

are the inputs to the network and t is the correct output (the output the network should eventually produce given those inputs).

The initial network, given

{\displaystyle x_{1}}

{\displaystyle x_{2}}

will compute an output y that likely differs from t (given random weights).

A common method for measuring the discrepancy between the expected output t and the actual output y is the squared error measure: where E is the discrepancy or error.

As an example, consider the network on a single training case:

{\displaystyle (1,1,0)}

thus the input

{\displaystyle x_{1}}

{\displaystyle x_{2}}

are 1 and 1 respectively and the correct output, t is 0.

Now if the actual output y is plotted on the horizontal axis against the error E on the vertical axis, the result is a parabola.

The minimum of the parabola corresponds to the output y which minimizes the error E.

For a single training case, the minimum also touches the horizontal axis, which means the error will be zero and the network can produce an output y that exactly matches the expected output t.

Therefore, the problem of mapping inputs to outputs can be reduced to an optimization problem of finding a function that will produce the minimal error.

However, the output of a neuron depends on the weighted sum of all its inputs: where

{\displaystyle w_{1}}

{\displaystyle w_{2}}

are the weights on the connection from the input units to the output unit.

Therefore, the error also depends on the incoming weights to the neuron, which is ultimately what needs to be changed in the network to enable learning.

If each weight is plotted on a separate horizontal axis and the error on the vertical axis, the result is a parabolic bowl.

For a neuron with k weights, the same plot would require an elliptic paraboloid of

{\displaystyle k+1}

One commonly used algorithm to find the set of weights that minimizes the error is gradient descent.

Backpropagation is then used to calculate the steepest descent direction.

The gradient descent method involves calculating the derivative of the squared error function with respect to the weights of the network.

This is normally done using backpropagation.

Assuming one output neuron,[note 2] the squared error function is: where The factor of

{\displaystyle \textstyle {\frac {1}{2}}}

is included to cancel the exponent when differentiating.

Later, the expression will be multiplied with an arbitrary learning rate, so that it doesn't matter if a constant coefficient is introduced now.

For each neuron

{\displaystyle j}

its output

{\displaystyle o_{j}}

is defined as The input

{\displaystyle {\text{net}}_{j}}

to a neuron is the weighted sum of outputs

{\displaystyle o_{k}}

If the neuron is in the first layer after the input layer, the

{\displaystyle o_{k}}

of the input layer are simply the inputs

{\displaystyle x_{k}}

The number of input units to the neuron is

{\displaystyle n}

The variable

{\displaystyle w_{kj}}

denotes the weight between neurons

{\displaystyle k}

{\displaystyle j}

The activation function

{\displaystyle \varphi }

is non-linear and differentiable.

A commonly used activation function is the logistic function: which has a convenient derivative of: Calculating the partial derivative of the error with respect to a weight

{\displaystyle w_{ij}}

is done using the chain rule twice: In the last factor of the right-hand side of the above, only one term in the sum

{\displaystyle {\text{net}}_{j}}

depends on

{\displaystyle w_{ij}}

so that If the neuron is in the first layer after the input layer,

{\displaystyle o_{i}}

{\displaystyle x_{i}}

The derivative of the output of neuron

{\displaystyle j}

with respect to its input is simply the partial derivative of the activation function (assuming here that the logistic function is used): This is the reason why backpropagation requires the activation function to be differentiable.

(Nevertheless, the non-differentiable ReLU activation function has become quite popular recently, e.g.

in AlexNet) The first factor is straightforward to evaluate if the neuron is in the output layer, because then

{\displaystyle o_{j}=y}

{\displaystyle j}

is in an arbitrary inner layer of the network, finding the derivative

with respect to

{\displaystyle o_{j}}

as a function of the inputs of all neurons

{\displaystyle L={u,v,\dots ,w}}

receiving input from neuron

{\displaystyle j}

and taking the total derivative with respect to

{\displaystyle o_{j}}

a recursive expression for the derivative is obtained: Therefore, the derivative with respect to

{\displaystyle o_{j}}

can be calculated if all the derivatives with respect to the outputs

{\displaystyle o_{\ell }}

of the next layer – the one closer to the output neuron – are known.

Putting it all together: with To update the weight

{\displaystyle w_{ij}}

using gradient descent, one must choose a learning rate,

{\displaystyle \eta >0}

The change in weight needs to reflect the impact on

of an increase or decrease in

{\displaystyle w_{ij}}

{\displaystyle {\frac {\partial E}{\partial w_{ij}}}>0}

an increase in

{\displaystyle w_{ij}}

increases

conversely, if

{\displaystyle {\frac {\partial E}{\partial w_{ij}}}<0}

an increase in

{\displaystyle w_{ij}}

decreases

The new

{\displaystyle \Delta w_{ij}}

is added to the old weight, and the product of the learning rate and the gradient, multiplied by

{\displaystyle -1}

guarantees that

{\displaystyle w_{ij}}

changes in a way that always decreases

In other words, in the equation immediately below,

{\displaystyle -\eta {\frac {\partial E}{\partial w_{ij}}}}

always changes

{\displaystyle w_{ij}}

in such a way that

is decreased: For a single-layer network, this expression becomes the Delta Rule.[7] The choice of learning rate

{\textstyle \eta }

is important, since a high value can cause too strong a change, causing the minimum to be missed, while a too low learning rate slows the training unnecessarily.

Optimizations such as Quickprop are primarily aimed at speeding up error minimization;

other improvements mainly try to increase reliability.

In order to avoid oscillation inside the network such as alternating connection weights, and to improve the rate of convergence, refinements of this algorithm use an adaptive learning rate.[8] By using a variable inertia term (Momentum)

{\textstyle \alpha }

the gradient and the last change can be weighted such that the weight adjustment additionally depends on the previous change.

{\textstyle \alpha }

is equal to 0, the change depends solely on the gradient, while a value of 1 will only depend on the last change.

Similar to a ball rolling down a mountain, whose current speed is determined not only by the current slope of the mountain but also by its own inertia, inertia can be added:

{\displaystyle \Delta w_{ij}(t+1)=(1-\alpha )\eta \delta _{j}o_{i}+\alpha \,\Delta w_{ij}(t)}

where: Inertia depends on the current weight change

{\textstyle (t+1)}

both from the current gradient of the error function (slope of the mountain, 1st summand), as well as from the weight change from the previous point in time (inertia, 2nd summand).

With inertia, the problems of getting stuck (in steep ravines and flat plateaus) are avoided.

Since, for example, the gradient of the error function becomes very small in flat plateaus, inertia would immediately lead to a 'deceleration' of the gradient descent.

This deceleration is delayed by the addition of the inertia term so that a flat plateau can be escaped more quickly.

Two modes of learning are available: stochastic and batch.

In stochastic learning, each input creates a weight adjustment.

In batch learning weights are adjusted based on a batch of inputs, accumulating errors over the batch.

Stochastic learning introduces 'noise' into the gradient descent process, using the local gradient calculated from one data point;

this reduces the chance of the network getting stuck in local minima.

However, batch learning typically yields a faster, more stable descent to a local minimum, since each update is performed in the direction of the average error of the batch.

A common compromise choice is to use 'mini-batches', meaning small batches and with samples in each batch selected stochastically from the entire data set.

According to various sources,[11][12][13][14][15] the basics of continuous backpropagation were derived in the context of control theory by Henry J.

Bryson in 1961.[17] They used principles of dynamic programming.

In 1962, Stuart Dreyfus published a simpler derivation based only on the chain rule.[18] Bryson and Ho described it as a multi-stage dynamic system optimization method in 1969.[19][20] In 1970 Linnainmaa published the general method for automatic differentiation (AD) of discrete connected networks of nested differentiable functions.[21][22] This corresponds to backpropagation, which is efficient even for sparse networks.[14][15][23][24] In 1973 Dreyfus used backpropagation to adapt parameters of controllers in proportion to error gradients.[25] In 1974 Werbos mentioned the possibility of applying this principle to artificial neural networks,[26] and in 1982 he applied Linnainmaa's AD method to neural networks in the way that is used today.[15][27] In 1986 Rumelhart, Hinton and Williams showed experimentally that this method can generate useful internal representations of incoming data in hidden layers of neural networks.[4][28] In 1993, Wan was the first[14] to win an international pattern recognition contest through backpropagation.[29] During the 2000s it fell out of favour, but returned in the 2010s, benefitting from cheap, powerful GPU-based computing systems.

Using neural nets to recognize handwritten digits

Simple intuitions about how we recognize shapes - 'a 9 has a loop at the top, and a vertical stroke in the bottom right' - turn out to be not so simple to express algorithmically.

As a prototype it hits a sweet spot: it's challenging - it's no small feat to recognize handwritten digits - but it's not so difficult as to require an extremely complicated solution, or tremendous computational power.

But along the way we'll develop many key ideas about neural networks, including two important types of artificial neuron (the perceptron and the sigmoid neuron), and the standard learning algorithm for neural networks, known as stochastic gradient descent.

Today, it's more common to use other models of artificial neurons - in this book, and in much modern work on neural networks, the main neuron model used is one called the sigmoid neuron.

A perceptron takes several binary inputs, $x_1, x_2, \ldots$, and produces a single binary output: In the example shown the perceptron has three inputs, $x_1, x_2, x_3$.

The neuron's output, $0$ or $1$, is determined by whether the weighted sum $\sum_j w_j x_j$ is less than or greater than some threshold value.

To put it in more precise algebraic terms: \begin{eqnarray} \mbox{output} & = & \left\{ \begin{array}{ll} 0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\ 1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold} \end{array} \right.

And it should seem plausible that a complex network of perceptrons could make quite subtle decisions: In this network, the first column of perceptrons - what we'll call the first layer of perceptrons - is making three very simple decisions, by weighing the input evidence.

The first change is to write $\sum_j w_j x_j$ as a dot product, $w \cdot x \equiv \sum_j w_j x_j$, where $w$ and $x$ are vectors whose components are the weights and inputs, respectively.

Using the bias instead of the threshold, the perceptron rule can be rewritten: \begin{eqnarray} \mbox{output} = \left\{ \begin{array}{ll} 0 & \mbox{if } w\cdot x + b \leq 0 \\ 1 & \mbox{if } w\cdot x + b > 0 \end{array} \right.

This requires computing the bitwise sum, $x_1 \oplus x_2$, as well as a carry bit which is set to $1$ when both $x_1$ and $x_2$ are $1$, i.e., the carry bit is just the bitwise product $x_1 x_2$: To get an equivalent network of perceptrons we replace all the NAND gates by perceptrons with two inputs, each with weight $-2$, and an overall bias of $3$.

Note that I've moved the perceptron corresponding to the bottom right NAND gate a little, just to make it easier to draw the arrows on the diagram: One notable aspect of this network of perceptrons is that the output from the leftmost perceptron is used twice as input to the bottommost perceptron.

(If you don't find this obvious, you should stop and prove to yourself that this is equivalent.) With that change, the network looks as follows, with all unmarked weights equal to -2, all biases equal to 3, and a single weight of -4, as marked: Up to now I've been drawing inputs like $x_1$ and $x_2$ as variables floating to the left of the network of perceptrons.

In fact, it's conventional to draw an extra layer of perceptrons - the input layer - to encode the inputs: This notation for input perceptrons, in which we have an output, but no inputs, is a shorthand.

Then the weighted sum $\sum_j w_j x_j$ would always be zero, and so the perceptron would output $1$ if $b > 0$, and $0$ if $b \leq 0$.

Instead of explicitly laying out a circuit of NAND and other gates, our neural networks can simply learn to solve problems, sometimes problems where it would be extremely difficult to directly design a conventional circuit.

If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want.

In fact, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from $0$ to $1$.

We'll depict sigmoid neurons in the same way we depicted perceptrons: Just like a perceptron, the sigmoid neuron has inputs, $x_1, x_2, \ldots$.

Instead, it's $\sigma(w \cdot x+b)$, where $\sigma$ is called the sigmoid function* *Incidentally, $\sigma$ is sometimes called the logistic function, and this new class of neurons called logistic neurons.

\tag{3}\end{eqnarray} To put it all a little more explicitly, the output of a sigmoid neuron with inputs $x_1,x_2,\ldots$, weights $w_1,w_2,\ldots$, and bias $b$ is \begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)}.

In fact, there are many similarities between perceptrons and sigmoid neurons, and the algebraic form of the sigmoid function turns out to be more of a technical detail than a true barrier to understanding.

var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))};

var y = d3.scale.linear() .domain([0, 1]) .range([height, 0]);

}) var graph = d3.select('#sigmoid_graph') .append('svg') .attr('width', width + m[1] + m[3]) .attr('height', height + m[0] + m[2]) .append('g') .attr('transform', 'translate(' + m[3] + ',' + m[0] + ')');

var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(-4, 5, 1)) .orient('bottom') graph.append('g') .attr('class', 'x axis') .attr('transform', 'translate(0, ' + height + ')') .call(xAxis);

var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(0, 1.01, 0.2)) .orient('left') .ticks(5) graph.append('g') .attr('class', 'y axis') .call(yAxis);

graph.append('text') .attr('class', 'x label') .attr('text-anchor', 'end') .attr('x', width/2) .attr('y', height+35) .text('z');

graph.append('text') .attr('x', (width / 2)) .attr('y', -10) .attr('text-anchor', 'middle') .style('font-size', '16px') .text('sigmoid function');

var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))};

var y = d3.scale.linear() .domain([0,1]) .range([height, 0]);

}) var graph = d3.select('#step_graph') .append('svg') .attr('width', width + m[1] + m[3]) .attr('height', height + m[0] + m[2]) .append('g') .attr('transform', 'translate(' + m[3] + ',' + m[0] + ')');

var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(-4, 5, 1)) .orient('bottom') graph.append('g') .attr('class', 'x axis') .attr('transform', 'translate(0, ' + height + ')') .call(xAxis);

var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(0, 1.01, 0.2)) .orient('left') .ticks(5) graph.append('g') .attr('class', 'y axis') .call(yAxis);

graph.append('text') .attr('class', 'x label') .attr('text-anchor', 'end') .attr('x', width/2) .attr('y', height+35) .text('z');

graph.append('text') .attr('x', (width / 2)) .attr('y', -10) .attr('text-anchor', 'middle') .style('font-size', '16px') .text('step function');

If $\sigma$ had in fact been a step function, then the sigmoid neuron would be a perceptron, since the output would be $1$ or $0$ depending on whether $w\cdot x+b$ was positive or negative* *Actually, when $w \cdot x +b = 0$ the perceptron outputs $0$, while the step function outputs $1$.

The smoothness of $\sigma$ means that small changes $\Delta w_j$ in the weights and $\Delta b$ in the bias will produce a small change $\Delta \mbox{output}$ in the output from the neuron.

In fact, calculus tells us that $\Delta \mbox{output}$ is well approximated by \begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b, \tag{5}\end{eqnarray} where the sum is over all the weights, $w_j$, and $\partial \, \mbox{output} / \partial w_j$ and $\partial \, \mbox{output} /\partial b$ denote partial derivatives of the $\mbox{output}$ with respect to $w_j$ and $b$, respectively.

While the expression above looks complicated, with all the partial derivatives, it's actually saying something very simple (and which is very good news): $\Delta \mbox{output}$ is a linear function of the changes $\Delta w_j$ and $\Delta b$ in the weights and bias.

If it's the shape of $\sigma$ which really matters, and not its exact form, then why use the particular form used for $\sigma$ in Equation (3)\begin{eqnarray} \sigma(z) \equiv \frac{1}{1+e^{-z}} \nonumber\end{eqnarray}$('#margin_778862672352_reveal').click(function() {$('#margin_778862672352').toggle('slow', function() {});});?

In fact, later in the book we will occasionally consider neurons where the output is $f(w \cdot x + b)$ for some other activation function $f(\cdot)$.

The main thing that changes when we use a different activation function is that the particular values for the partial derivatives in Equation (5)\begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b \nonumber\end{eqnarray}$('#margin_726336021933_reveal').click(function() {$('#margin_726336021933').toggle('slow', function() {});});

It turns out that when we compute those partial derivatives later, using $\sigma$ will simplify the algebra, simply because exponentials have lovely properties when differentiated.

But in practice we can set up a convention to deal with this, for example, by deciding to interpret any output of at least $0.5$ as indicating a '9', and any output less than $0.5$ as indicating 'not a 9'.

Exercises Sigmoid neurons simulating perceptrons, part I $\mbox{}$ Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant, $c > 0$.

Show that the behaviour of the network doesn't change.Sigmoid neurons simulating perceptrons, part II $\mbox{}$ Suppose we have the same setup as the last problem - a network of perceptrons.

Suppose the weights and biases are such that $w \cdot x + b \neq 0$ for the input $x$ to any particular perceptron in the network.

Now replace all the perceptrons in the network by sigmoid neurons, and multiply the weights and biases by a positive constant $c > 0$.

Suppose we have the network: As mentioned earlier, the leftmost layer in this network is called the input layer, and the neurons within the layer are called input neurons.

The term 'hidden' perhaps sounds a little mysterious - the first time I heard the term I thought it must have some deep philosophical or mathematical significance - but it really means nothing more than 'not an input or an output'.

For example, the following four-layer network has two hidden layers: Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons.

If the image is a $64$ by $64$ greyscale image, then we'd have $4,096 = 64 \times 64$ input neurons, with the intensities scaled appropriately between $0$ and $1$.

The output layer will contain just a single neuron, with output values of less than $0.5$ indicating 'input image is not a 9', and values greater than $0.5$ indicating 'input image is a 9 '.

A trial segmentation gets a high score if the individual digit classifier is confident of its classification in all segments, and a low score if the classifier is having a lot of trouble in one or more segments.

So instead of worrying about segmentation we'll concentrate on developing a neural network which can solve the more interesting and difficult problem, namely, recognizing individual handwritten digits.

As discussed in the next section, our training data for the network will consist of many $28$ by $28$ pixel images of scanned handwritten digits, and so the input layer contains $784 = 28 \times 28$ neurons.

The input pixels are greyscale, with a value of $0.0$ representing white, a value of $1.0$ representing black, and in between values representing gradually darkening shades of grey.

A seemingly natural way of doing that is to use just $4$ output neurons, treating each neuron as taking on a binary value, depending on whether the neuron's output is closer to $0$ or to $1$.

The ultimate justification is empirical: we can try out both network designs, and it turns out that, for this particular problem, the network with $10$ output neurons learns to recognize digits better than the network with $4$ output neurons.

In a similar way, let's suppose for the sake of argument that the second, third, and fourth neurons in the hidden layer detect whether or not the following images are present:

Of course, that's not the only sort of evidence we can use to conclude that the image was a $0$ - we could legitimately get a $0$ in many other ways (say, through translations of the above images, or slight distortions).

Assume that the first $3$ layers of neurons are such that the correct output in the third layer (i.e., the old output layer) has activation at least $0.99$, and incorrect outputs have activation less than $0.01$.

We'll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications.

To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students).

For example, if a particular training image, $x$, depicts a $6$, then $y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T$ is the desired output from the network.

We use the term cost function throughout this book, but you should note the other terminology, since it's often used in research papers and other discussions of neural networks.

\tag{6}\end{eqnarray} Here, $w$ denotes the collection of all weights in the network, $b$ all the biases, $n$ is the total number of training inputs, $a$ is the vector of outputs from the network when $x$ is input, and the sum is over all training inputs, $x$.

If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost.

Even given that we want to use a smooth cost function, you may still wonder why we choose the quadratic function used in Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \|

This is a well-posed problem, but it's got a lot of distracting structure as currently posed - the interpretation of $w$ and $b$ as weights and biases, the $\sigma$ function lurking in the background, the choice of network architecture, MNIST, and so on.

And for neural networks we'll often want far more variables - the biggest neural networks have cost functions which depend on billions of weights and biases in an extremely complicated way.

We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of $C$ - those derivatives would tell us everything we need to know about the local 'shape' of the valley, and therefore how our ball should roll.

So rather than get into all the messy details of physics, let's simply ask ourselves: if we were declared God for a day, and could make up our own laws of physics, dictating to the ball how it should roll, what law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley?

To make this question more precise, let's think about what happens when we move the ball a small amount $\Delta v_1$ in the $v_1$ direction, and a small amount $\Delta v_2$ in the $v_2$ direction.

Calculus tells us that $C$ changes as follows: \begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2.

To figure out how to make such a choice it helps to define $\Delta v$ to be the vector of changes in $v$, $\Delta v \equiv (\Delta v_1, \Delta v_2)^T$, where $T$ is again the transpose operation, turning row vectors into column vectors.

We denote the gradient vector by $\nabla C$, i.e.: \begin{eqnarray} \nabla C \equiv \left( \frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2} \right)^T.

In fact, it's perfectly fine to think of $\nabla C$ as a single mathematical object - the vector defined above - which happens to be written using two symbols.

With these definitions, the expression (7)\begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2 \nonumber\end{eqnarray}$('#margin_832985330775_reveal').click(function() {$('#margin_832985330775').toggle('slow', function() {});});

\tag{9}\end{eqnarray} This equation helps explain why $\nabla C$ is called the gradient vector: $\nabla C$ relates changes in $v$ to changes in $C$, just as we'd expect something called a gradient to do.

In particular, suppose we choose \begin{eqnarray} \Delta v = -\eta \nabla C, \tag{10}\end{eqnarray} where $\eta$ is a small, positive parameter (known as the learning rate).

\nabla C \|^2 \geq 0$, this guarantees that $\Delta C \leq 0$, i.e., $C$ will always decrease, never increase, if we change $v$ according to the prescription in (10)\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray}$('#margin_39079991636_reveal').click(function() {$('#margin_39079991636').toggle('slow', function() {});});.

to compute a value for $\Delta v$, then move the ball's position $v$ by that amount: \begin{eqnarray} v \rightarrow v' = v -\eta \nabla C.

To make gradient descent work correctly, we need to choose the learning rate $\eta$ to be small enough that Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_663076476028_reveal').click(function() {$('#margin_663076476028').toggle('slow', function() {});});

In practical implementations, $\eta$ is often varied so that Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_362932327599_reveal').click(function() {$('#margin_362932327599').toggle('slow', function() {});});

Then the change $\Delta C$ in $C$ produced by a small change $\Delta v = (\Delta v_1, \ldots, \Delta v_m)^T$ is \begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v, \tag{12}\end{eqnarray} where the gradient $\nabla C$ is the vector \begin{eqnarray} \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots, \frac{\partial C}{\partial v_m}\right)^T.

\tag{13}\end{eqnarray} Just as for the two variable case, we can choose \begin{eqnarray} \Delta v = -\eta \nabla C, \tag{14}\end{eqnarray} and we're guaranteed that our (approximate) expression (12)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_398945612724_reveal').click(function() {$('#margin_398945612724').toggle('slow', function() {});});

This gives us a way of following the gradient to a minimum, even when $C$ is a function of many variables, by repeatedly applying the update rule \begin{eqnarray} v \rightarrow v' = v-\eta \nabla C.

The rule doesn't always work - several things can go wrong and prevent gradient descent from finding the global minimum of $C$, a point we'll return to explore in later chapters.

But, in practice gradient descent often works extremely well, and in neural networks we'll find that it's a powerful way of minimizing the cost function, and so helping the net learn.

It can be proved that the choice of $\Delta v$ which minimizes $\nabla C \cdot \Delta v$ is $\Delta v = - \eta \nabla C$, where $\eta = \epsilon / \|\nabla C\|$ is determined by the size constraint $\|\Delta v\|

Hint: If you're not already familiar with the Cauchy-Schwarz inequality, you may find it helpful to familiarize yourself with it.

If there are a million such $v_j$ variables then we'd need to compute something like a trillion (i.e., a million squared) second partial derivatives* *Actually, more like half a trillion, since $\partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial v_k \partial v_j$.

The idea is to use gradient descent to find the weights $w_k$ and biases $b_l$ which minimize the cost in Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \|

In other words, our 'position' now has components $w_k$ and $b_l$, and the gradient vector $\nabla C$ has corresponding components $\partial C / \partial w_k$ and $\partial C / \partial b_l$.

Writing out the gradient descent update rule in terms of components, we have \begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \tag{16}\\ b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l}.

In practice, to compute the gradient $\nabla C$ we need to compute the gradients $\nabla C_x$ separately for each training input, $x$, and then average them, $\nabla C = \frac{1}{n} \sum_x \nabla C_x$.

To make these ideas more precise, stochastic gradient descent works by randomly picking out a small number $m$ of randomly chosen training inputs.

Provided the sample size $m$ is large enough we expect that the average value of the $\nabla C_{X_j}$ will be roughly equal to the average over all $\nabla C_x$, that is, \begin{eqnarray} \frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \frac{\sum_x \nabla C_x}{n} = \nabla C, \tag{18}\end{eqnarray} where the second sum is over the entire set of training data.

Swapping sides we get \begin{eqnarray} \nabla C \approx \frac{1}{m} \sum_{j=1}^m \nabla C_{X_{j}}, \tag{19}\end{eqnarray} confirming that we can estimate the overall gradient by computing gradients just for the randomly chosen mini-batch.

Then stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those, \begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{20}\\ b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l}, \tag{21}\end{eqnarray} where the sums are over all the training examples $X_j$ in the current mini-batch.

And, in a similar way, the mini-batch update rules (20)\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \nonumber\end{eqnarray}$('#margin_255037324417_reveal').click(function() {$('#margin_255037324417').toggle('slow', function() {});});

and (21)\begin{eqnarray} b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}$('#margin_141169455106_reveal').click(function() {$('#margin_141169455106').toggle('slow', function() {});});

We can think of stochastic gradient descent as being like political polling: it's much easier to sample a small mini-batch than it is to apply gradient descent to the full batch, just as carrying out a poll is easier than running a full election.

For example, if we have a training set of size $n = 60,000$, as in MNIST, and choose a mini-batch size of (say) $m = 10$, this means we'll get a factor of $6,000$ speedup in estimating the gradient!

Of course, the estimate won't be perfect - there will be statistical fluctuations - but it doesn't need to be perfect: all we really care about is moving in a general direction that will help decrease $C$, and that means we don't need an exact computation of the gradient.

In practice, stochastic gradient descent is a commonly used and powerful technique for learning in neural networks, and it's the basis for most of the learning techniques we'll develop in this book.

That is, given a training input, $x$, we update our weights and biases according to the rules $w_k \rightarrow w_k' = w_k - \eta \partial C_x / \partial w_k$ and $b_l \rightarrow b_l' = b_l - \eta \partial C_x / \partial b_l$.

Name one advantage and one disadvantage of online learning, compared to stochastic gradient descent with a mini-batch size of, say, $20$.

In neural networks the cost $C$ is, of course, a function of many variables - all the weights and biases - and so in some sense defines a surface in a very high-dimensional space.

I won't go into more detail here, but if you're interested then you may enjoy reading this discussion of some of the techniques professional mathematicians use to think in high dimensions.

We'll leave the test images as is, but split the 60,000-image MNIST training set into two parts: a set of 50,000 images, which we'll use to train our neural network, and a separate 10,000 image validation set.

We won't use the validation data in this chapter, but later in the book we'll find it useful in figuring out how to set certain hyper-parameters of the neural network - things like the learning rate, and so on, which aren't directly selected by our learning algorithm.

When I refer to the 'MNIST training data' from now on, I'll be referring to our 50,000 image data set, not the original 60,000 image data set* *As noted earlier, the MNIST data set is based on two data sets collected by NIST, the United States' National Institute of Standards and Technology.

for x, y in zip(sizes[:-1], sizes[1:])]

So, for example, if we want to create a Network object with 2 neurons in the first layer, 3 neurons in the second layer, and 1 neuron in the final layer, we'd do this with the code: net = Network([2, 3, 1])

The biases and weights in the Network object are all initialized randomly, using the Numpy np.random.randn function to generate Gaussian distributions with mean $0$ and standard deviation $1$.

Note that the Network initialization code assumes that the first layer of neurons is an input layer, and omits to set any biases for those neurons, since biases are only ever used in computing the outputs from later layers.

The big advantage of using this ordering is that it means that the vector of activations of the third layer of neurons is: \begin{eqnarray} a' = \sigma(w a + b).

(This is called vectorizing the function $\sigma$.) It's easy to verify that Equation (22)\begin{eqnarray} a' = \sigma(w a + b) \nonumber\end{eqnarray}$('#margin_552886241220_reveal').click(function() {$('#margin_552886241220').toggle('slow', function() {});});

gives the same result as our earlier rule, Equation (4)\begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}$('#margin_7421600236_reveal').click(function() {$('#margin_7421600236').toggle('slow', function() {});});, for computing the output of a sigmoid neuron.

in component form, and verify that it gives the same result as the rule (4)\begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}$('#margin_347257101140_reveal').click(function() {$('#margin_347257101140').toggle('slow', function() {});});

We then add a feedforward method to the Network class, which, given an input a for the network, returns the corresponding output* *It is assumed that the input a is an (n, 1) Numpy ndarray, not a (n,) vector.

Although using an (n,) vector appears the more natural choice, using an (n, 1) ndarray makes it particularly easy to modify the code to feedforward multiple inputs at once, and that is sometimes convenient.

All the method does is applies Equation (22)\begin{eqnarray} a' = \sigma(w a + b) \nonumber\end{eqnarray}$('#margin_335258165235_reveal').click(function() {$('#margin_335258165235').toggle('slow', function() {});});

training_data[k:k+mini_batch_size]

for k in xrange(0, n, mini_batch_size)]

self.update_mini_batch(mini_batch, eta)

print "Epoch {0}: {1} / {2}".format(

j, self.evaluate(test_data), n_test)

print "Epoch {0} complete".format(j)

This is done by the code self.update_mini_batch(mini_batch, eta), which updates the network weights and biases according to a single iteration of gradient descent, using just the training data in mini_batch.

delta_nabla_b, delta_nabla_w = self.backprop(x, y)

nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]

nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]

for w, nw in zip(self.weights, nabla_w)]

for b, nb in zip(self.biases, nabla_b)]

Most of the work is done by the line delta_nabla_b, delta_nabla_w = self.backprop(x, y)

The self.backprop method makes use of a few extra functions to help in computing the gradient, namely sigmoid_prime, which computes the derivative of the $\sigma$ function, and self.cost_derivative, which I won't describe here.

for x, y in zip(sizes[:-1], sizes[1:])]

training_data[k:k+mini_batch_size]

for k in xrange(0, n, mini_batch_size)]

self.update_mini_batch(mini_batch, eta)

print "Epoch {0}: {1} / {2}".format(

j, self.evaluate(test_data), n_test)

print "Epoch {0} complete".format(j)

delta_nabla_b, delta_nabla_w = self.backprop(x, y)

nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]

nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]

for w, nw in zip(self.weights, nabla_w)]

for b, nb in zip(self.biases, nabla_b)]

activations = [x] # list to store all the activations, layer by layer

zs = [] # list to store all the z vectors, layer by layer

# l = 1 means the last layer of neurons, l = 2 is the

delta = np.dot(self.weights[-l+1].transpose(), delta) * sp

for (x, y) in test_data]

Finally, we'll use stochastic gradient descent to learn from the MNIST training_data over 30 epochs, with a mini-batch size of 10, and a learning rate of $\eta = 3.0$, >>>

As was the case earlier, if you're running the code as you read along, you should be warned that it takes quite a while to execute (on my machine this experiment takes tens of seconds for each training epoch), so it's wise to continue reading in parallel while the code executes.

At least in this case, using more hidden neurons helps us get better results* *Reader feedback indicates quite some variation in results for this experiment, and some training runs give results quite a bit worse.

Using the techniques introduced in chapter 3 will greatly reduce the variation in performance across different training runs for our networks..

(If making a change improves things, try doing more!) If we do that several times over, we'll end up with a learning rate of something like $\eta = 1.0$ (and perhaps fine tune to $3.0$), which is close to our earlier experiments.

Exercise Try creating a network with just two layers - an input and an output layer, no hidden layer - with 784 and 10 neurons, respectively.

The data structures used to store the MNIST data are described in the documentation strings - it's straightforward stuff, tuples and lists of Numpy ndarray objects (think of them as vectors if you're not familiar with ndarrays): """mnist_loader~~~~~~~~~~~~A library to load the MNIST image data.

In some sense, the moral of both our results and those in more sophisticated papers, is that for some problems: sophisticated algorithm $\leq$ simple learning algorithm + good training data.

We could attack this problem the same way we attacked handwriting recognition - by using the pixels in the image as input to a neural network, with the output from the network a single neuron indicating either 'Yes, it's a face' or 'No, it's not a face'.

The end result is a network which breaks down a very complicated question - does this image show a face or not - into very simple questions answerable at the level of single pixels.

It does this through a series of many layers, with early layers answering very simple and specific questions about the input image, and later layers building up a hierarchy of ever more complex and abstract concepts.

Comparing a deep network to a shallow network is a bit like comparing a programming language with the ability to make function calls to a stripped down language with no ability to make such calls.

It involves subtracting the mean across every individual feature in the data, and has the geometric interpretation of centering the cloud of data around the origin along every dimension.

It only makes sense to apply this preprocessing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In

case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional preprocessing step.

Then, we can compute the covariance matrix that tells us about the correlation structure in the data: The (i,j) element of the data covariance matrix contains the covariance between i-th and j-th dimension of the data.

To decorrelate the data, we project the original (but zero-centered) data into the eigenbasis: Notice that the columns of U are a set of orthonormal vectors (norm of 1, and orthogonal to each other), so they can be regarded as basis vectors.

This is also sometimes refereed to as Principal Component Analysis (PCA) dimensionality reduction: After this operation, we would have reduced the original dataset of size [N x D] to one of size [N x 100], keeping the 100 dimensions of the data that contain the most variance.

The geometric interpretation of this transformation is that if the input data is a multivariable gaussian, then the whitened data will be a gaussian with zero mean and identity covariance matrix.

One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input.

Note that we do not know what the final value of every weight should be in the trained network, but with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative.

The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.

The implementation for one weight matrix might look like W = 0.01* np.random.randn(D,H), where randn samples from a zero mean, unit standard deviation gaussian.

With this formulation, every neuron’s weight vector is initialized as a random vector sampled from a multi-dimensional gaussian, so the neurons point in random direction in the input space.

That is, the recommended heuristic is to initialize each neuron’s weight vector as: w = np.random.randn(n) / sqrt(n), where n is the number of its inputs.

The sketch of the derivation is as follows: Consider the inner product \(s = \sum_i^n w_i x_i\) between the weights \(w\) and input \(x\), which gives the raw activation of a neuron before the non-linearity.

And since \(\text{Var}(aX) = a^2\text{Var}(X)\) for a random variable \(X\) and a scalar \(a\), this implies that we should draw from unit gaussian and then scale it by \(a = \sqrt{1/n}\), to make its variance \(1/n\).

In this paper, the authors end up recommending an initialization of the form \( \text{Var}(w) = 2/(n_{in} + n_{out}) \) where \(n_{in}, n_{out}\) are the number of units in the previous layer and the next layer.

A more recent paper on this topic, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification by He et al., derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be \(2.0/n\).

This gives the initialization w = np.random.randn(n) * sqrt(2.0/n), and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons.

Another way to address the uncalibrated variances problem is to set all weight matrices to zero, but to break symmetry every neuron is randomly connected (with weights sampled from a small gaussian as above) to a fixed number of neurons below it.

For ReLU non-linearities, some people like to use small constant value such as 0.01 for all biases because this ensures that all ReLU units fire in the beginning and therefore obtain and propagate some gradient.

However, it is not clear if this provides a consistent improvement (in fact some results seem to indicate that this performs worse) and it is more common to simply use 0 bias initialization.

A recently developed technique by Ioffe and Szegedy called Batch Normalization alleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training.

In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we’ll soon see), and before non-linearities.

It is common to see the factor of \(\frac{1}{2}\) in front because then the gradient of this term with respect to the parameter \(w\) is simply \(\lambda w\) instead of \(2 \lambda w\).

Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: W += -lambda * W towards zero.

L1 regularization is another relatively common form of regularization, where for each weight \(w\) we add the term \(\lambda \mid w \mid\) to the objective.

Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.

In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector \(\vec{w}\) of every neuron to satisfy \(\Vert \vec{w} \Vert_2 <

Vanilla dropout in an example 3-layer Neural Network would be implemented as follows: In the code above, inside the train_step function we have performed dropout twice: on the first hidden layer and on the second hidden layer.

It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction.

Since test-time performance is so critical, it is always preferable to use inverted dropout, which performs the scaling at train time, leaving the forward pass at test time untouched.

Inverted dropout looks as follows: There has a been a large amount of research after the first introduction of dropout that tries to understand the source of its power in practice, and its relation to the other regularization techniques.

As we already mentioned in the Linear Classification section, it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objective.

For example, a binary classifier for each category independently would take the form: where the sum is over all categories \(j\), and \(y_{ij}\) is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector \(f_j\) will be positive when the class is predicted to be present and negative otherwise.

A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as: Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is \(P(y = 0 \mid x;

The expression above can look scary but the gradient on \(f\) is in fact extremely simple and intuitive: \(\partial{L_i} / \partial{f_j} = y_{ij} - \sigma(f_j)\) (as you can double check yourself by taking the derivatives).

The L2 norm squared would compute the loss for a single example of the form: The reason the L2 norm is squared in the objective is that the gradient becomes much simpler, without changing the optimal parameters since squaring is a monotonic operation.

For example, if you are predicting star rating for a product, it might work much better to use 5 independent classifiers for ratings of 1-5 stars instead of a regression loss.

If you’re certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea.

Lecture 6 | Training Neural Networks I

In Lecture 6 we discuss many practical issues for training modern neural networks. We discuss different activation functions, the importance of data ...

Neural Network train in MATLAB

This video explain how to design and train a Neural Network in MATLAB.

Artificial Neural Network Tutorial | Deep Learning With Neural Networks | Edureka

TensorFlow Training - ) This Edureka "Neural Network Tutorial" video (Blog: will .

Gradient descent, how neural networks learn | Chapter 2, deep learning

Subscribe for more (part 3 will be on backpropagation): Thanks to everybody supporting on Patreon

Neural networks [2.7] : Training neural networks - backpropagation

The Best Way to Prepare a Dataset Easily

Only a few days left to signup for my Decentralized Applications course! In this video, I go over the 3 steps you need to prepare a ..

Artificial Neural Networks in R (a Regression example)

This tutorial covers the implementation of ANN models (using default algorithm: feed-forward back-propagation) and also discusses NID (Neural Interpretation ...

Train, Test, & Validation Sets explained

In this video, we explain the concept of the different data sets used for training and testing an artificial neural network, including the training set, testing set, and ...

Neural networks [9.9] : Computer vision - data set expansion

R-Session 11 - Statistical Learning - Neural Networks

Source: neuralnet: Training of Neural Network by Frauke Gunther and Stefan Fritsch - The R Journal Vol. 2/1, June 2010.