# AI News, Hacker's guide to Neural Networks

## Hacker's guide to Neural Networks

Javascript allows one to nicely visualize what’s going on and to play around with the various hyperparameter settings, but I still regularly hear from people who ask for a more thorough treatment of the topic.

In my opinion, the best way to think of Neural Networks is as real-valued circuits, where real values (instead of boolean values {0,1}) “flow” along edges and interact in gates.

Javascript version of this would very simply look something like this: And in math form we can think of this gate as implementing the real-valued function: As with this example, all of our gates will take one or two inputs and produce a single output value.

Why don’t we tweak x and y randomly and keep track of the tweak that works best: When I run this, I get best_x = -1.9928, best_y = 2.9901, and best_out = -5.9588.

Not quite: This is a perfectly fine strategy for tiny problems with a few gates if you can afford the compute time, but it won’t do if we want to eventually consider huge circuits with millions of inputs.

On the other hand, we’d expect a negative force induced on y that pushes it to become lower (since a lower y, such as y = 2, down from the original y = 3 would make output higher: 2 x -2 = -4, again, larger than -6).

Also, if you’re not very familiar with calculus it is important to note that in the left-hand side of the equation above, the horizontal line does not indicate division.

The entire symbol $$\frac{\partial f(x,y)}{\partial x}$$ is a single thing: the derivative of the function $$f(x,y)$$ with respect to $$x$$.

Anyway, I hope it doesn’t look too scary because it isn’t: The circuit was giving some initial output $$f(x,y)$$, and then we changed one of the inputs by a tiny amount $$h$$ and read the new output $$f(x+h, y)$$.

We turned the knob from x to x + h and the circuit responded by giving a higher value (note again that yes, -5.9997 is higher than -6: -5.9997 &gt;

Technically, you want the value of h to be infinitesimal (the precise mathematical definition of the gradient is defined as the limit of the expression as h goes to zero), but in practice h=0.00001 or so works fine in most cases to get a good approximation.

we just add the derivative on top of every input), we can see that the value increases, as expected: As expected, we changed the inputs by the gradient and the circuit now gives a slightly higher value (-5.87 &gt;

Evaluating the gradient requires just three evaluations of the forward pass of our circuit instead of hundreds, and gives the best tug you can hope for (locally) if you are interested in increasing the value of the output.

For example, step_size = 1.0 gives output -1 (higer, better!), and indeed infinite step size would give infinitely good results.

The gradient guarantees that if you have a very small (indeed, infinitesimally small) step size, then you will definitely get a higher number when you follow its direction, and for that infinitesimally small step size there is no other direction that would have worked better.

But in practice we will have hundreds, thousands or (for neural networks) even tens to hundreds of millions of inputs, and the circuits aren’t just one multiply gate but huge expressions that can be expensive to compute.

You may have seen other people who teach Neural Networks derive the gradient in huge and, frankly, scary and confusing mathematical equations (if you’re not well-versed in maths).

That is because we will only ever derive the gradient for very small and simple expressions (think of it as the base case) and then I will show you how we can compose these very simply with chain rule to evaluate the full gradient (think inductive/recursive case).

We invoked powerful mathematics and can now transform our derivative calculation into the following code: To compute the gradient we went from forwarding the circuit hundreds of times (Strategy #1) to forwarding it only on order of number of times twice the number of inputs (Strategy #2), to forwarding it a single time!

And it gets EVEN better, since the more expensive strategies (#1 and #2) only give an approximation of the gradient, while #3 (the fastest one by far) gives you the exact gradient.

That’s because the numerical gradient is very easy to evaluate (but can be a bit expensive to compute), while the analytic gradient can contain bugs at times, but is usually extremely efficient to compute.

Lets structure the code as follows to make the gates explicit as functions: In the above, I am using a and b as the local variables in the gate functions so that we don’t get these confused with our circuit inputs x,y,z.

If we don’t worry about x and y but only about q and z, then we are back to having only a single gate, and as far as that single * gate is concerned, we know what the (analytic) derivates are from previous section.

“Pulling” upwards on this output value induced a force on both q and z: To increase the output value, the circuit “wants” z to increase, as can be seen by the positive value of the derivative(derivative_f_wrt_z = +3).

The multiplication by -4 seen in the chain rule achieves exactly this: instead of applying a positive force of +1 on both x and y (the local derivative), the full circuit’s gradient on both x and y becomes 1 x -4 = -4.

The only difference between the case of a single gate and multiple interacting gates that compute arbitrarily complex expressions is this additional multipy operation that now happens in each gate.

For example, the + gate always takes the gradient on top and simply passes it on to all of its inputs (notice the example with -4 simply passed on to both of the inputs of + gate).

Since the gradient of max(x,y) with respect to its input is +1 for whichever one of x, y is larger and 0 for the other, this gate is during backprop effectively just a gradient “switch”: it will take the gradient from above and “route” it to the input that had a higher value during the forward pass.

Its best thought of as a “squashing function”, because it takes the input and squashes it to be between zero and one: Very negative values are squashed towards zero and positive values get squashed towards one.

Sigmoid function is defined as: The gradient with respect to its single input, as you can check on Wikipedia or derive yourself if you know some calculus is given by this expression: For example, if the input to the sigmoid gate is x = 3, the gate will compute output f = 1.0 / (1.0 + Math.exp(-x)) = 0.95, and then the (local) gradient on its input will simply be dx = (0.95) * (1 - 0.95) = 0.0475.

Another thing to note is that technically, the sigmoid function is made up of an entire series of gates in a line that compute more atomic functions: an exponentiation gate, an addition gate and a division gate.

Treating it so would work perfectly fine but for this example I chose to collapse all of these gates into a single gate that just computes sigmoid in one shot, because the gradient expression turns out to be simple.

If you’re not a Javascript - familiar person, all that’s going on here is that I’m defining a class that has certain properties (accessed with use of this keyword), and some methods (which in Javascript are placed into the function’s prototype).

Then notice that in the backward function call we get the gradient from the output unit we produced during the forward pass (which will by now hopefully have its gradient filled in) and multiply it with the local gradient for this gate (chain rule!).

This gate computes multiplication (u0.value * u1.value) during forward pass, so recall that the gradient w.r.t u0 is u1.value and w.r.t u1 is u0.value.

This will allow us to possibly use the output of one gate multiple times (think of it as a wire branching out), since it turns out that the gradients from these different branches just add up when computing the final gradient with respect to the circuit output.

To fully specify everything lets finally write out the forward and backward flow for our 2-dimensional neuron with some example values: And now lets compute the gradient: Simply iterate in reverse order and call the backward function!

Finally, lets verify that we implemented the backpropagation correctly by checking the numerical gradient: Indeed, these all give the same values as the backpropagated gradients [-0.105, 0.315, 0.105, 0.105, 0.210].

hope it is clear that even though we only looked at an example of a single neuron, the code I gave above generalizes in a very straight-forward way to compute gradients of arbitrary expressions (including very deep expressions #foreshadowing).

All you have to do is write small gates that compute local, simple derivatives w.r.t their inputs, wire it up in a graph, do a forward pass to compute the output value and then a backward pass that chains the gradients all the way to the input.

Our first example was the * gate: In the code above, I’m assuming that the variable dx is given, coming from somewhere above us in the circuit while we’re doing backprop (or it is +1 by default otherwise).

If you remember the backward flow diagram, the + gate simply takes the gradient on top and routes it equally to all of its inputs (because its local gradient is always simply 1.0 for all its inputs, regardless of their actual values).

So we can do it much faster: Okay, how about combining gates?: If you don’t see how the above happened, introduce a temporary variable q = a * b and then compute x = q + c to convince yourself.

In other words nothing changes: In fact, if you know your power rule from calculus you would also know that if you have $$f(a) = a^2$$ then $$\frac{\partial f(a)}{\partial a} = 2a$$, which is exactly what we get if we think of it as wire splitting up and being two inputs to a gate.

Lets do another one: Okay now lets start to get more complex: When more complex cases like this come up in practice, I like to split the expression into manageable chunks which are almost always composed of simpler expressions and then I chain them together with chain rule: That wasn’t too difficult!

Here are a few more useful functions and their local gradients that are useful in practice: Here’s what division might look like in practice then: Hopefully you see that we are breaking down expressions, doing the forward pass, and then for every variable (such as a) we derive its gradient da as we go backwards, one by one, applying the simple local gradients and chaining them with gradients from above.

Everything we’ve done in this chapter comes down to this: We saw that we can feed some input through arbitrarily complex real-valued circuit, tug at the end of the circuit with some force, and backpropagation distributes that tug through the entire circuit all the way back to the inputs.

In the last chapter we were concerned with real-valued circuits that computed possibly complex expressions of their inputs (the forward pass), and also we could compute the gradients of these expressions on the original inputs (backward pass).

This is a silly toy example, but in practice a +1/-1 dataset could be very useful things indeed: For example spam/no spam emails, where the vectors somehow measure various features of the content of the email, such as the number of times certain enhancement drugs are mentioned.

We will eventually build up to entire neural networks and complex expressions, but lets start out simple and train a linear classifier very similar to the single neuron we saw at the end of Chapter 1.

The only difference is that we’ll get rid of the sigmoid because it makes things unnecessarily complicated (I only used it as an example in Chapter 1 because sigmoid neurons are historically popular but modern Neural Networks rarely, if ever, use sigmoid non-linearities).

For example, if a = 1, b = -2, c = -1, then the function will take the first datapoint ([1.2, 0.7]) and output 1 * 1.2 + (-2) * 0.7 + (-1) = -1.2.

Over time, the pulls on these parameters will tune these values in such a way that the function outputs high scores for positive examples and low scores for negative examples.

At this point, if you’ve seen an explanation of SVMs you’re probably expecting me to define the SVM loss function and plunge into an explanation of slack variables, geometrical intuitions of large margins, kernels, duality, etc.

Lets write the SVM code and take advantage of the circuit machinery we have from Chapter 1: That’s a circuit that simply computes a*x + b*y + c and can also compute the gradient.

Now lets train the SVM with Stochastic Gradient Descent: This code prints the following output: We see that initially our classifier only had 33% training accuracy, but by the end all training examples are correctly classifier as the parameters a,b,c adjusted their values according to the pulls we exerted.

a negative data point that gets a score +100, its influence will be relatively minor on our classifier because we will only pull with force of -1 regardless of how bad the mistake was.

Of interest is the fact that an SVM is just a particular type of a very simple circuit (circuit that computes score = a*x + b*y + c where a,b,c are weights and x,y are data points).

The forward pass will look like this: The specification above is a 2-layer Neural Network with 3 hidden neurons (n1, n2, n3) that uses Rectified Linear Unit (ReLU) non-linearity on each hidden neuron.

But for now, I hope your takeaway is that a 2-layer Neural Net is really not such a scary thing: we write a forward pass expression, interpret the value at the end as a score, and then we pull on that value in a positive or negative direction depending on what we want that value to be for our current particular example.

We are given a dataset of $$N$$ examples $$(x_{i0}, x_{i1})$$ and their corresponding labels $$y_{i}$$ which are allowed to be either $$+1/-1$$ for positive or negative example respectively.

Due to this term the cost will never actually become zero (because this would mean all parameters of the model except the bias are exactly zero), but the closer we get, the better our classifier will become.

The core problem studied in this section is as follows: We are given some function $$f(x)$$ where $$x$$ is a vector of inputs and we are interested in computing the gradient of $$f$$ at $$x$$ (i.e.

Recall that the primary reason we are interested in this problem is that in the specific case of Neural Networks, $$f$$ will correspond to the loss function ( $$L$$ ) and the inputs $$x$$ will consist of the training data and the neural network weights.

If you are coming to this class and you’re comfortable with deriving gradients with chain rule, we would still like to encourage you to at least skim this section, since it presents a rarely developed view of backpropagation as backward flow in real-valued circuits and any insights you’ll gain may help you throughout the class.

For example, if $$x = 4, y = -3$$ then $$f(x,y) = -12$$ and the derivative on $$x$$ $$\frac{\partial f}{\partial x} = -3$$.

Analogously, since $$\frac{\partial f}{\partial y} = 4$$, we expect that increasing the value of $$y$$ by some very small amount $$h$$ would also increase the output of the function (due to the positive sign), and by $$4h$$.

As mentioned, the gradient $$\nabla f$$ is the vector of partial derivatives, so we have that $$\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}] = [y, x]$$.

Even though the gradient is technically a vector, we will often use terms such as “the gradient on x” instead of the technically correct phrase “the partial derivative on x” for simplicity.

This makes sense, since increasing either $$x,y$$ would increase the output of $$f$$, and the rate of that increase would be independent of what the actual values of $$x,y$$ are (unlike the case of multiplication above).

$$f$$ is just multiplication of $$q$$ and $$z$$, so $$\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$$, and $$q$$ is addition of $$x$$ and $$y$$ so $$\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$$.

This extra multiplication (for each input) due to the chain rule can turn a single and relatively useless gate into a cog in a complex circuit such as an entire neural network.

During the backward pass in which the chain rule is applied recursively backwards through the circuit, the add gate (which is an input to the multiply gate) learns that the gradient for its output was -4.

If we anthropomorphize the circuit as wanting to output a higher value (which can help with intuition), then we can think of the circuit as “wanting” the output of the add gate to be lower (due to negative sign), and with a force of 4.

To continue the recurrence and to chain the gradient, the add gate takes that gradient and multiplies it to all of the local gradients for its inputs (making the gradient on both x and y 1 * -4 = -4).

Notice that this has the desired effect: If x,y were to decrease (responding to their negative gradient) then the add gate’s output would decrease, which in turn makes the multiply gate’s output increase.

Lets look at another expression that illustrates this point: as we will see later in the class, this expression describes a 2-dimensional neuron (with inputs x and weights w) that uses the sigmoid activation function.

In addition to the ones described already above (add, mul, max), there are four more: Where the functions $$f_c, f_a$$ translate the input by a constant of $$c$$ and scale the input by a constant of $$a$$, respectively.

It turns out that the derivative of the sigmoid function with respect to its input simplifies if you perform the derivation (after a fun tricky part where we add and subtract a 1 in the numerator): As we see, the gradient turns out to simplify and becomes surprisingly simple.

The derivation above shows that the local gradient would simply be (1 - 0.73) * 0.73 ~= 0.2, as the circuit computed before (see the image above), except this way it would be done with a single, simple and efficient expression (and with less numerical issues).

Therefore, computing the backprop pass is easy: We’ll go backwards and for every variable along the way in the forward pass (sigy, num, sigx, xpy, xpysqr, den, invden) we will have the same variable, but one that begins with a d, which will hold the gradient of the output of the circuit with respect to that variable.

The forward expression involves the variables x,y multiple times, so when we perform backpropagation we must be careful to use += instead of = to accumulate the gradient on these variables (otherwise we would overwrite it).

Consider this example circuit: Looking at the diagram above as an example, we can see that: The add gate always takes the gradient on its output and distributes it equally to all of its inputs, regardless of what their values were during the forward pass.

This follows from the fact that the local gradient for the add operation is simply +1.0, so the gradients on all inputs will exactly equal the gradients on the output because it will be multiplied by x1.0 (and remain unchanged).

Unlike the add gate which distributed the gradient unchanged to all its inputs, the max gate distributes the gradient (unchanged) to exactly one of its inputs (the input that had the highest value during the forward pass).

In the example circuit above, the max operation routed the gradient of 2.00 to the z variable, which had a higher value than w, and the gradient on w remains zero.

Notice that if one of the inputs to the multiply gate is very small and the other is very big, then the multiply gate will do something slightly unintuitive: it will assign a relatively huge gradient to the small input and a tiny gradient to the large input.

For example, if you multiplied all input data examples $$x_i$$ by 1000 during preprocessing, then the gradient on the weights will be 1000 times larger, and you’d have to lower the learning rate by that factor to compensate.

## A friendly introduction to Backpropagation in Python

My aim here is to test my understanding of Andrej Karpathy’s great blog post “Hacker’s guide to Neural Networks” as well as of Python, to get a hang of which I recently perused through Derek Banas’ awesome commented code expositions.

As someone steeped in R and classical statistical learning methods for structured data, I’m very new to both Python as well as Neural nets, so it is best not to fall into the easy delusions of competence that stem from being able to follow things while reading about them.

Since the output is essentially a function of the inputs as shown before, the sensitivity of the output to each of the inputs is just the partial derivative with respect to that input, let us compute that: The gradient of the multiplicative gate with respect to x, when x=3 and y=4, equals 4, as expected, since del(xy)/del(x) = y, which is 4.

But what if the gate implements the function f(x) where : f(x) = 100 - (x-y)² If we provide 3 and 4 to such a gate, it should try to bring x and y closer together such that the output is maximized.

In the plot below the function is plotted vertically and x and y are the two horizontal axes: Now let’s define this gate and see whether our method of computing gradients and modifying inputs helps in increasing the output of this gate.

Let’s consider a gate that takes x, y, and z as inputs and outputs : f(x, y, z)=(x+y)∗z In fact, this is essentially a combination of two simple gates: an addition gate that takes in x and y, and a multiplication gate that takes in z and the output of the addition gate, say q.

The last two equations above are key: when calculating the gradient of the entire circuit with respect to x (or y) we merely calculate the gradient of the gate q with respect to x (or y) and magnify it by a factor equal to the gradient of the circuit with respect to the output of gate q.

Having defined the gates and units, let us run the forward pass to generate output values: Now let us run the backward pass to decipher the gradient df/dx: Which is the same as what we saw graphically and mathematically before: So now we’ve implemented backprop from scratch for a simple circuit, and seen how to obtain gradients of a larger circuit with respect to individual inputs using the chain rule backwards.

## Why are deep neural networks hard to train?

In practice, when solving circuit design problems (or most any kind of algorithmic problem), we usually start by figuring out how to solve sub-problems, and then gradually integrate the solutions.

See Johan Håstad's 2012 paper On the correlation of parity and small-depth circuits for an account of the early history and references.

On the other hand, if you use deeper circuits it's easy to compute the parity using a small circuit: you just compute the parity of pairs of bits, then use those results to compute the parity of pairs of pairs of bits, and so on, building up quickly to the overall parity.

These simple networks have been remarkably useful: in earlier chapters we used networks like this to classify handwritten digits with better than 98 percent accuracy!

For instance, if we're doing visual pattern recognition, then the neurons in the first layer might learn to recognize edges, the neurons in the second layer could learn to recognize more complex shapes, say triangle or rectangles, built up from edges.

These multiple layers of abstraction seem likely to give deep networks a compelling advantage in learning to solve complex pattern recognition problems.

Moreover, just as in the case of circuits, there are theoretical results suggesting that deep networks are intrinsically more powerful than shallow networks* *For certain problems and network architectures this is proved in On the number of response regions of deep feed forward networks with piece-wise linear activations, by Razvan Pascanu, Guido Montúfar, and Yoshua Bengio (2014).

In this chapter, we'll try training deep networks using our workhorse learning algorithm - stochastic gradient descent by backpropagation.

As per usual, we'll use the MNIST digit classification problem as our playground for learning and experimentation* *I introduced the MNIST problem and data here and here..

If you do wish to follow live, then you'll need Python 2.7, Numpy, and a copy of the code, which you can get by cloning the relevant repository from the command line: git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

We use 30 hidden neurons, as well as 10 output neurons, corresponding to the 10 possible classifications for the MNIST digits ('0', '1', '2', $\ldots$, '9').

Let's try training our network for 30 complete epochs, using mini-batches of 10 training examples at a time, a learning rate $\eta = 0.1$, and regularization parameter $\lambda = 5.0$.

As we train we'll monitor the classification accuracy on the validation_data* *Note that the networks is likely to take some minutes to train, depending on the speed of your machine.

So if you're running the code you may wish to continue reading and return later, not wait for the code to finish executing.: &gt;&gt;&gt;

We get a classification accuracy of 96.48 percent (or thereabouts - it'll vary a bit from run to run), comparable to our earlier results with a similar configuration.

Certainly, things shouldn't get worse, since the extra layers can, in the worst case, simply do nothing* *See this later problem to understand how to build a hidden layer that does nothing..

Below, I've plotted part of a $[784, 30, 30, 10]$ network, i.e., a network with two hidden layers, each containing $30$ hidden neurons.

A big bar means the neuron's weights and bias are changing rapidly, while a small bar means the weights and bias are changing slowly.

More precisely, the bars denote the gradient $\partial C / \partial b$ for each neuron, i.e., the rate of change of the cost with respect to the neuron's bias.

Back in Chapter 2 we saw that this gradient quantity controlled not just how rapidly the bias changes during learning, but also how rapidly the weights input to the neuron change, too.

Don't worry if you don't recall the details: the thing to keep in mind is simply that these bars show how quickly each neuron's weights and bias are changing as the network learns.

To do this, let's denote the gradient as $\delta^l_j = \partial C / \partial b^l_j$, i.e., the gradient for the $j$th neuron in the $l$th layer* *Back in Chapter 2 we referred to this as the error, but here we'll adopt the informal term 'gradient'.

I say 'informal' because of course this doesn't explicitly include the partial derivatives of the cost with respect to the weights, $\partial C / \partial w$..

We can think of the gradient $\delta^1$ as a vector whose entries determine how quickly the first hidden layer learns, and $\delta^2$ as a vector whose entries determine how quickly the second hidden layer learns.

If we have three hidden layers, in a $[784, 30, 30, 30, 10]$ network, then the respective speeds of learning turn out to be 0.012, 0.060, and 0.283.

This is a bit different than the way we usually train - I've used no mini-batches, and just 1,000 training images, rather than the full 50,000 image training set.

I'm not trying to do anything sneaky, or pull the wool over your eyes, but it turns out that using mini-batch stochastic gradient descent gives much noisier (albeit very similar, when you average away the noise) results.

The phenomenon is known as the vanishing gradient problem* *See Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, by Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and J&uuml;rgen Schmidhuber (2001).

Unstable gradients in deep neural nets To get insight into why the vanishing gradient problem occurs, let's consider the simplest deep neural network: one with just a single neuron in each layer.

Here's a network with three hidden layers: Here, $w_1, w_2, \ldots$ are the weights, $b_1, b_2, \ldots$ are the biases, and $C$ is some cost function.

Just to remind you how this works, the output $a_j$ from the $j$th neuron is $\sigma(z_j)$, where $\sigma$ is the usual sigmoid activation function, and $z_j = w_{j} a_{j-1}+b_j$ is the weighted input to the neuron.

I've drawn the cost $C$ at the end to emphasize that the cost is a function of the network's output, $a_4$: if the actual output from the network is close to the desired output, then the cost will be low, while if it's far away, the cost will be high.

We have $a_1 = \sigma(z_1) = \sigma(w_1 a_0 + b_1)$, so \begin{eqnarray} \Delta a_1 & \approx & \frac{\partial \sigma(w_1 a_0+b_1)}{\partial b_1} \Delta b_1 \tag{115}\\ & = & \sigma'(z_1) \Delta b_1.

That change $\Delta a_1$ in turn causes a change in the weighted input $z_2 = w_2 a_1 + b_2$ to the second hidden neuron: \begin{eqnarray} \Delta z_2 & \approx & \frac{\partial z_2}{\partial a_1} \Delta a_1 \tag{117}\\ & = & w_2 \Delta a_1.

\tag{118}\end{eqnarray} Combining our expressions for $\Delta z_2$ and $\Delta a_1$, we see how the change in the bias $b_1$ propagates along the network to affect $z_2$: \begin{eqnarray} \Delta z_2 & \approx & \sigma'(z_1) w_2 \Delta b_1.

The end result is an expression relating the final change $\Delta C$ in cost to the initial change $\Delta b_1$ in the bias: \begin{eqnarray} \Delta C & \approx & \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4) \frac{\partial C}{\partial a_4} \Delta b_1.

\tag{120}\end{eqnarray} Dividing by $\Delta b_1$ we do indeed get the desired expression for the gradient: \begin{eqnarray} \frac{\partial C}{\partial b_1} = \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4) \frac{\partial C}{\partial a_4}.

Why the vanishing gradient problem occurs: To understand why the vanishing gradient problem occurs, let's explicitly write out the entire expression for the gradient: \begin{eqnarray} \frac{\partial C}{\partial b_1} = \sigma'(z_1) \, w_2 \sigma'(z_2) \, w_3 \sigma'(z_3) \, w_4 \sigma'(z_4) \, \frac{\partial C}{\partial a_4}.

To make this all a bit more explicit, let's compare the expression for $\partial C / \partial b_1$ to an expression for the gradient with respect to a later bias, say $\partial C / \partial b_3$.

Of course, we haven't explicitly worked out an expression for $\partial C / \partial b_3$, but it follows the same pattern described above for $\partial C / \partial b_1$.

That's actually pretty easy to do: all we need do is choose the biases to ensure that the weighted input to each neuron is $z_j = 0$ (and so $\sigma'(z_j) = 1/4$).

The only way to avoid this is if the input activation falls within a fairly narrow range of values (this qualitative explanation is made quantitative in the first problem below).

Show that the set of $a$ satisfying that constraint can range over an interval no greater in width than \begin{eqnarray} \frac{2}{|w|} \ln\left( \frac{|w|(1+\sqrt{1-4/|w|})}{2}-1\right).

\tag{123}\end{eqnarray} (3) Show numerically that the above expression bounding the width of the range is greatest at $|w| And so even given that everything lines up just perfectly, we still have a fairly narrow range of input activations which can avoid the vanishing gradient problem.Identity neuron: Consider a neuron with a single input,$x$, a corresponding weight,$w_1$, a bias$b$, and a weight$w_2$on the output. Show that by choosing the weights and bias appropriately, we can ensure$w_2 \sigma(w_1 x+b) \approx x$for$x \in [0, 1]$. Such a neuron can thus be used as a kind of identity neuron, that is, a neuron whose output is the same (up to rescaling by a weight factor) as its input. Hint: It helps to rewrite$x = 1/2+\Delta$, to assume$w_1$is small, and to use a Taylor series expansion in$w_1 \Delta$. In the earlier chapter on backpropagation we saw that the gradient in the$l$th layer of an$L$layer network is given by: \begin{eqnarray} \delta^l = \Sigma'(z^l) (w^{l+1})^T \Sigma'(z^{l+1}) (w^{l+2})^T \ldots \Sigma'(z^L) \nabla_a C \tag{124}\end{eqnarray} Here,$\Sigma'(z^l)$is a diagonal matrix whose entries are the$\sigma'(z)$values for the weighted inputs to the$l$th layer. I won't comprehensively summarize that work here, but just want to briefly mention a couple of papers, to give you the flavor of some of the questions people are asking. In particular, they found evidence that the use of sigmoids will cause the activations in the final hidden layer to saturate near$0\$ early in training, substantially slowing down learning.

As a second example, in 2013 Sutskever, Martens, Dahl and Hinton* *On the importance of initialization and momentum in deep learning, by Ilya Sutskever, James Martens, George Dahl and Geoffrey Hinton (2013).

The results in the last two paragraphs suggest that there is also a role played by the choice of activation function, the way weights are initialized, and even details of how learning by gradient descent is implemented.

## Deep Learning: Back Propagation

Now the problem that we have to solve is to update weight and biases such that our cost function can be minimised.

An obvious way of doing that is to use the approximation where ϵ&gt;0 is a small positive number, and eᵢ is the unit vector in the iᵗʰ direction.

That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example).

What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives ∂C/∂wᵢ using just one forward pass through the network, followed by one backward pass through the network.

This expression is still simple enough to differentiate directly, but we’ll take a particular approach to it that will be helpful with understanding the intuition behind back propagation.

During the backward pass in which the chain rule is applied recursively backwards through the circuit, the add gate (which is an input to the multiply gate) learns that the gradient for its output was -4.

If we anthropomorphise the circuit as wanting to output a higher value (which can help with intuition), then we can think of the circuit as “wanting” the output of the add gate to be lower (due to negative sign), and with a force of 4.

To continue the recurrence and to chain the gradient, the add gate takes that gradient and multiplies it to all of the local gradients for its inputs (making the gradient on both x and y 1* -4 = -4).

Notice that this has the desired effect: If x, y were to decrease (responding to their negative gradient) then the add gate’s output would decrease, which in turn makes the multiply gate’s output increase.

Back propagation can thus be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher.

Let’s explicitly write this out in the form of an algorithm: Some approaches to back-propagation take a computational graph and a set of numerical values for the inputs to the graph, then return a set of numerical values describing the gradient at those input values.

GATE 2014 ECE Expression for current i(t) for of given circuit

GATE 2017 Find the operating state of M1 and M2 transistors

002 Simple neural network logical AND table

Also SUBSCRIBE to my new Channel: Best deals on SmartPhone OnePlus 3T (Midnight ..

CB Transistor (Input Characteristics & Early Effect)

Analog Electronics: Common Base Transistor (Input Characteristics & Early Effect) Topics Covered: 1. Input characteristics of common base transistor. 2.

2d curl formula

Here we build up to the formula for computing the two-dimensional curl of a vector field, reasoning through what partial derivative information corresponds to ...

The Simple Essence of Automatic Differentiation

Automatic differentiation (AD) in reverse mode (RAD) is a central component of deep learning and other uses of large-scale optimization. Commonly used RAD ...

Representation of signals in terms of unit step function and ramp function

Representation of signals in terms of unit step function and ramp function. If you have any doubts, use the comments section.

Divergence formula, part 2

Here we finish the line of reasoning which leads to the formula for divergence in two dimensions.

Physics - Thermodynamics: Conduction: Heat Transfer (5 of 20) Double -Pane Window

Visit for more math and science lectures! In this video I will show you how to calculate the power dissipation of a double-pane window

XOR as Perceptron Network Quiz Solution - Georgia Tech - Machine Learning

Watch on Udacity: Check out the full Advanced Operating Systems ..