AI News, Hacker's guide to Neural Networks

Hacker's guide to Neural Networks

Javascript allows one to nicely visualize what’s going on and to play around with the various hyperparameter settings, but I still regularly hear from people who ask for a more thorough treatment of the topic.

In my opinion, the best way to think of Neural Networks is as real-valued circuits, where real values (instead of boolean values {0,1}) “flow” along edges and interact in gates.

Javascript version of this would very simply look something like this: And in math form we can think of this gate as implementing the real-valued function: As with this example, all of our gates will take one or two inputs and produce a single output value.

Why don’t we tweak x and y randomly and keep track of the tweak that works best: When I run this, I get best_x = -1.9928, best_y = 2.9901, and best_out = -5.9588.

Not quite: This is a perfectly fine strategy for tiny problems with a few gates if you can afford the compute time, but it won’t do if we want to eventually consider huge circuits with millions of inputs.

On the other hand, we’d expect a negative force induced on y that pushes it to become lower (since a lower y, such as y = 2, down from the original y = 3 would make output higher: 2 x -2 = -4, again, larger than -6).

Also, if you’re not very familiar with calculus it is important to note that in the left-hand side of the equation above, the horizontal line does not indicate division.

The entire symbol \( \frac{\partial f(x,y)}{\partial x} \) is a single thing: the derivative of the function \( f(x,y) \) with respect to \( x \).

Anyway, I hope it doesn’t look too scary because it isn’t: The circuit was giving some initial output \( f(x,y) \), and then we changed one of the inputs by a tiny amount \(h \) and read the new output \( f(x+h, y) \).

We turned the knob from x to x + h and the circuit responded by giving a higher value (note again that yes, -5.9997 is higher than -6: -5.9997 >

Technically, you want the value of h to be infinitesimal (the precise mathematical definition of the gradient is defined as the limit of the expression as h goes to zero), but in practice h=0.00001 or so works fine in most cases to get a good approximation.

we just add the derivative on top of every input), we can see that the value increases, as expected: As expected, we changed the inputs by the gradient and the circuit now gives a slightly higher value (-5.87 >

Evaluating the gradient requires just three evaluations of the forward pass of our circuit instead of hundreds, and gives the best tug you can hope for (locally) if you are interested in increasing the value of the output.

For example, step_size = 1.0 gives output -1 (higer, better!), and indeed infinite step size would give infinitely good results.

The gradient guarantees that if you have a very small (indeed, infinitesimally small) step size, then you will definitely get a higher number when you follow its direction, and for that infinitesimally small step size there is no other direction that would have worked better.

But in practice we will have hundreds, thousands or (for neural networks) even tens to hundreds of millions of inputs, and the circuits aren’t just one multiply gate but huge expressions that can be expensive to compute.

You may have seen other people who teach Neural Networks derive the gradient in huge and, frankly, scary and confusing mathematical equations (if you’re not well-versed in maths).

That is because we will only ever derive the gradient for very small and simple expressions (think of it as the base case) and then I will show you how we can compose these very simply with chain rule to evaluate the full gradient (think inductive/recursive case).

We invoked powerful mathematics and can now transform our derivative calculation into the following code: To compute the gradient we went from forwarding the circuit hundreds of times (Strategy #1) to forwarding it only on order of number of times twice the number of inputs (Strategy #2), to forwarding it a single time!

And it gets EVEN better, since the more expensive strategies (#1 and #2) only give an approximation of the gradient, while #3 (the fastest one by far) gives you the exact gradient.

That’s because the numerical gradient is very easy to evaluate (but can be a bit expensive to compute), while the analytic gradient can contain bugs at times, but is usually extremely efficient to compute.

Lets structure the code as follows to make the gates explicit as functions: In the above, I am using a and b as the local variables in the gate functions so that we don’t get these confused with our circuit inputs x,y,z.

If we don’t worry about x and y but only about q and z, then we are back to having only a single gate, and as far as that single * gate is concerned, we know what the (analytic) derivates are from previous section.

“Pulling” upwards on this output value induced a force on both q and z: To increase the output value, the circuit “wants” z to increase, as can be seen by the positive value of the derivative(derivative_f_wrt_z = +3).

The multiplication by -4 seen in the chain rule achieves exactly this: instead of applying a positive force of +1 on both x and y (the local derivative), the full circuit’s gradient on both x and y becomes 1 x -4 = -4.

The only difference between the case of a single gate and multiple interacting gates that compute arbitrarily complex expressions is this additional multipy operation that now happens in each gate.

For example, the + gate always takes the gradient on top and simply passes it on to all of its inputs (notice the example with -4 simply passed on to both of the inputs of + gate).

Since the gradient of max(x,y) with respect to its input is +1 for whichever one of x, y is larger and 0 for the other, this gate is during backprop effectively just a gradient “switch”: it will take the gradient from above and “route” it to the input that had a higher value during the forward pass.

Its best thought of as a “squashing function”, because it takes the input and squashes it to be between zero and one: Very negative values are squashed towards zero and positive values get squashed towards one.

Sigmoid function is defined as: The gradient with respect to its single input, as you can check on Wikipedia or derive yourself if you know some calculus is given by this expression: For example, if the input to the sigmoid gate is x = 3, the gate will compute output f = 1.0 / (1.0 + Math.exp(-x)) = 0.95, and then the (local) gradient on its input will simply be dx = (0.95) * (1 - 0.95) = 0.0475.

Another thing to note is that technically, the sigmoid function is made up of an entire series of gates in a line that compute more atomic functions: an exponentiation gate, an addition gate and a division gate.

Treating it so would work perfectly fine but for this example I chose to collapse all of these gates into a single gate that just computes sigmoid in one shot, because the gradient expression turns out to be simple.

If you’re not a Javascript - familiar person, all that’s going on here is that I’m defining a class that has certain properties (accessed with use of this keyword), and some methods (which in Javascript are placed into the function’s prototype).

Then notice that in the backward function call we get the gradient from the output unit we produced during the forward pass (which will by now hopefully have its gradient filled in) and multiply it with the local gradient for this gate (chain rule!).

This gate computes multiplication (u0.value * u1.value) during forward pass, so recall that the gradient w.r.t u0 is u1.value and w.r.t u1 is u0.value.

This will allow us to possibly use the output of one gate multiple times (think of it as a wire branching out), since it turns out that the gradients from these different branches just add up when computing the final gradient with respect to the circuit output.

To fully specify everything lets finally write out the forward and backward flow for our 2-dimensional neuron with some example values: And now lets compute the gradient: Simply iterate in reverse order and call the backward function!

Finally, lets verify that we implemented the backpropagation correctly by checking the numerical gradient: Indeed, these all give the same values as the backpropagated gradients [-0.105, 0.315, 0.105, 0.105, 0.210].

hope it is clear that even though we only looked at an example of a single neuron, the code I gave above generalizes in a very straight-forward way to compute gradients of arbitrary expressions (including very deep expressions #foreshadowing).

All you have to do is write small gates that compute local, simple derivatives w.r.t their inputs, wire it up in a graph, do a forward pass to compute the output value and then a backward pass that chains the gradients all the way to the input.

Our first example was the * gate: In the code above, I’m assuming that the variable dx is given, coming from somewhere above us in the circuit while we’re doing backprop (or it is +1 by default otherwise).

If you remember the backward flow diagram, the + gate simply takes the gradient on top and routes it equally to all of its inputs (because its local gradient is always simply 1.0 for all its inputs, regardless of their actual values).

So we can do it much faster: Okay, how about combining gates?: If you don’t see how the above happened, introduce a temporary variable q = a * b and then compute x = q + c to convince yourself.

In other words nothing changes: In fact, if you know your power rule from calculus you would also know that if you have \( f(a) = a^2 \) then \( \frac{\partial f(a)}{\partial a} = 2a \), which is exactly what we get if we think of it as wire splitting up and being two inputs to a gate.

Lets do another one: Okay now lets start to get more complex: When more complex cases like this come up in practice, I like to split the expression into manageable chunks which are almost always composed of simpler expressions and then I chain them together with chain rule: That wasn’t too difficult!

Here are a few more useful functions and their local gradients that are useful in practice: Here’s what division might look like in practice then: Hopefully you see that we are breaking down expressions, doing the forward pass, and then for every variable (such as a) we derive its gradient da as we go backwards, one by one, applying the simple local gradients and chaining them with gradients from above.

Everything we’ve done in this chapter comes down to this: We saw that we can feed some input through arbitrarily complex real-valued circuit, tug at the end of the circuit with some force, and backpropagation distributes that tug through the entire circuit all the way back to the inputs.

In the last chapter we were concerned with real-valued circuits that computed possibly complex expressions of their inputs (the forward pass), and also we could compute the gradients of these expressions on the original inputs (backward pass).

This is a silly toy example, but in practice a +1/-1 dataset could be very useful things indeed: For example spam/no spam emails, where the vectors somehow measure various features of the content of the email, such as the number of times certain enhancement drugs are mentioned.

We will eventually build up to entire neural networks and complex expressions, but lets start out simple and train a linear classifier very similar to the single neuron we saw at the end of Chapter 1.

The only difference is that we’ll get rid of the sigmoid because it makes things unnecessarily complicated (I only used it as an example in Chapter 1 because sigmoid neurons are historically popular but modern Neural Networks rarely, if ever, use sigmoid non-linearities).

For example, if a = 1, b = -2, c = -1, then the function will take the first datapoint ([1.2, 0.7]) and output 1 * 1.2 + (-2) * 0.7 + (-1) = -1.2.

Over time, the pulls on these parameters will tune these values in such a way that the function outputs high scores for positive examples and low scores for negative examples.

At this point, if you’ve seen an explanation of SVMs you’re probably expecting me to define the SVM loss function and plunge into an explanation of slack variables, geometrical intuitions of large margins, kernels, duality, etc.

Lets write the SVM code and take advantage of the circuit machinery we have from Chapter 1: That’s a circuit that simply computes a*x + b*y + c and can also compute the gradient.

Now lets train the SVM with Stochastic Gradient Descent: This code prints the following output: We see that initially our classifier only had 33% training accuracy, but by the end all training examples are correctly classifier as the parameters a,b,c adjusted their values according to the pulls we exerted.

a negative data point that gets a score +100, its influence will be relatively minor on our classifier because we will only pull with force of -1 regardless of how bad the mistake was.

Of interest is the fact that an SVM is just a particular type of a very simple circuit (circuit that computes score = a*x + b*y + c where a,b,c are weights and x,y are data points).

The forward pass will look like this: The specification above is a 2-layer Neural Network with 3 hidden neurons (n1, n2, n3) that uses Rectified Linear Unit (ReLU) non-linearity on each hidden neuron.

But for now, I hope your takeaway is that a 2-layer Neural Net is really not such a scary thing: we write a forward pass expression, interpret the value at the end as a score, and then we pull on that value in a positive or negative direction depending on what we want that value to be for our current particular example.

We are given a dataset of \( N \) examples \( (x_{i0}, x_{i1}) \) and their corresponding labels \( y_{i} \) which are allowed to be either \( +1/-1 \) for positive or negative example respectively.

Due to this term the cost will never actually become zero (because this would mean all parameters of the model except the bias are exactly zero), but the closer we get, the better our classifier will become.

The core problem studied in this section is as follows: We are given some function \(f(x)\) where \(x\) is a vector of inputs and we are interested in computing the gradient of \(f\) at \(x\) (i.e.

Recall that the primary reason we are interested in this problem is that in the specific case of Neural Networks, \(f\) will correspond to the loss function ( \(L\) ) and the inputs \(x\) will consist of the training data and the neural network weights.

If you are coming to this class and you’re comfortable with deriving gradients with chain rule, we would still like to encourage you to at least skim this section, since it presents a rarely developed view of backpropagation as backward flow in real-valued circuits and any insights you’ll gain may help you throughout the class.

For example, if \(x = 4, y = -3\) then \(f(x,y) = -12\) and the derivative on \(x\) \(\frac{\partial f}{\partial x} = -3\).

Analogously, since \(\frac{\partial f}{\partial y} = 4\), we expect that increasing the value of \(y\) by some very small amount \(h\) would also increase the output of the function (due to the positive sign), and by \(4h\).

As mentioned, the gradient \(\nabla f\) is the vector of partial derivatives, so we have that \(\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}] = [y, x]\).

Even though the gradient is technically a vector, we will often use terms such as “the gradient on x” instead of the technically correct phrase “the partial derivative on x” for simplicity.

This makes sense, since increasing either \(x,y\) would increase the output of \(f\), and the rate of that increase would be independent of what the actual values of \(x,y\) are (unlike the case of multiplication above).

\(f\) is just multiplication of \(q\) and \(z\), so \(\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q\), and \(q\) is addition of \(x\) and \(y\) so \( \frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1 \).

This extra multiplication (for each input) due to the chain rule can turn a single and relatively useless gate into a cog in a complex circuit such as an entire neural network.

During the backward pass in which the chain rule is applied recursively backwards through the circuit, the add gate (which is an input to the multiply gate) learns that the gradient for its output was -4.

If we anthropomorphize the circuit as wanting to output a higher value (which can help with intuition), then we can think of the circuit as “wanting” the output of the add gate to be lower (due to negative sign), and with a force of 4.

To continue the recurrence and to chain the gradient, the add gate takes that gradient and multiplies it to all of the local gradients for its inputs (making the gradient on both x and y 1 * -4 = -4).

Notice that this has the desired effect: If x,y were to decrease (responding to their negative gradient) then the add gate’s output would decrease, which in turn makes the multiply gate’s output increase.

Lets look at another expression that illustrates this point: as we will see later in the class, this expression describes a 2-dimensional neuron (with inputs x and weights w) that uses the sigmoid activation function.

In addition to the ones described already above (add, mul, max), there are four more: Where the functions \(f_c, f_a\) translate the input by a constant of \(c\) and scale the input by a constant of \(a\), respectively.

It turns out that the derivative of the sigmoid function with respect to its input simplifies if you perform the derivation (after a fun tricky part where we add and subtract a 1 in the numerator): As we see, the gradient turns out to simplify and becomes surprisingly simple.

The derivation above shows that the local gradient would simply be (1 - 0.73) * 0.73 ~= 0.2, as the circuit computed before (see the image above), except this way it would be done with a single, simple and efficient expression (and with less numerical issues).

Therefore, computing the backprop pass is easy: We’ll go backwards and for every variable along the way in the forward pass (sigy, num, sigx, xpy, xpysqr, den, invden) we will have the same variable, but one that begins with a d, which will hold the gradient of the output of the circuit with respect to that variable.

The forward expression involves the variables x,y multiple times, so when we perform backpropagation we must be careful to use += instead of = to accumulate the gradient on these variables (otherwise we would overwrite it).

Consider this example circuit: Looking at the diagram above as an example, we can see that: The add gate always takes the gradient on its output and distributes it equally to all of its inputs, regardless of what their values were during the forward pass.

This follows from the fact that the local gradient for the add operation is simply +1.0, so the gradients on all inputs will exactly equal the gradients on the output because it will be multiplied by x1.0 (and remain unchanged).

Unlike the add gate which distributed the gradient unchanged to all its inputs, the max gate distributes the gradient (unchanged) to exactly one of its inputs (the input that had the highest value during the forward pass).

In the example circuit above, the max operation routed the gradient of 2.00 to the z variable, which had a higher value than w, and the gradient on w remains zero.

Notice that if one of the inputs to the multiply gate is very small and the other is very big, then the multiply gate will do something slightly unintuitive: it will assign a relatively huge gradient to the small input and a tiny gradient to the large input.

For example, if you multiplied all input data examples \(x_i\) by 1000 during preprocessing, then the gradient on the weights will be 1000 times larger, and you’d have to lower the learning rate by that factor to compensate.

Representation of signals in terms of unit step function and ramp function

Representation of signals in terms of unit step function and ramp function. If you have any doubts, use the comments section.

Directional derivative

Directional derivatives tell you how a multivariable function changes as you move along some vector in its input space. About Khan Academy: Khan Academy offers practice exercises, instructional...

24. Entanglement — QComputing, EPR, and Bell

MIT 8.04 Quantum Physics I, Spring 2013 View the complete course: Instructor: Allan Adams In this lecture, Prof. Adams discusses the basic principles of quantum..

EC Analog Circuit all questions and answers GATE 2014 set 3

ADC Methods Flash Conversion

A video by Jim Pytel for Renewable Energy Technology students at Columbia Gorge Community College.

GATE 2014 ECE Expression for current i(t) for of given circuit

Lecture 40 Transmission Line Effects

Lecture Series on Digital Integrated Circuits by Dr. Amitava Dasgupta, Department of Electrical Engineering,IIT Madras. For more details on NPTEL visit

GATE 2017 Find the operating state of M1 and M2 transistors

Concept of Transient Analysis (Part-3), Important GATE Questions | Network Theory

Subject- Network Theory Topic - Concept of Transient Faculty- Mr. Umesh Dhande sir GATE ACADEMY is the India"s leading institute with best teaching practices and most affordable fee. This...

GATE 2009 ECE Find the type of Compensator for given Magnitude plot