# AI News, A Neural Network in 11 lines of Python (Part 1)

## A Neural Network in 11 lines of Python (Part 1)

Summary: I learn best with toy code that I can play with.

This tutorial teaches backpropagation via a very simple toy example, a short python implementation.

neural network trained with backpropagation is attempting to use input to predict output.

Consider trying to predict the output column given the three input columns.

We could solve this problem by simply measuring statistics between the input values and the output values.

Before I describe processes, I recommend playing around with the code to get an intuitive feel for how it works.

Here are some good places to look in the code: • Compare l1 after the first iteration and after the last iteration.

One of the desirable properties of a sigmoid function is that its output can be used to create its derivative.

If you're unfamililar with derivatives, just think about it as the slope of the sigmoid function at a given point (as you can see above, different points have different slopes).

In this case, I generated the dataset horizontally (with a single row and 4 columns) for space.

Your numbers will still be randomly distributed, but they'll be randomly distributed in exactly the same way each time you train.

Another way of looking at it is that l0 is of size 3 and l1 is of size 1.

Thus, we want to connect every node in l0 to every node in l1, which requires a matrix of dimensionality (3,1).

This for loop 'iterates' multiple times over the training code to optimize our network to the dataset. Line

x 3) dot (3 x 1) = (4 x 1) Matrix multiplication is ordered, such the dimensions in the middle of the equation must be the same.

The final matrix generated is thus the number of rows of the first matrix and the number of columns of the second matrix. Since

we loaded in 4 training examples, we ended up with 4 guesses for the correct answer, a (4 x 1) matrix.

l1 represents these three dots, the code above generates the slopes of the lines below.

Notice that very high values such as x=2.0 (green dot) and very low values such as x=-1.0 (purple dot) have rather shallow slopes.

When we multiply the 'slopes' by the error, we are reducing the error of high confidence predictions.

If the slope was really shallow (close to 0), then the network either had a very high value, or a very low value.

We update these 'wishy-washy' predictions most heavily, and we tend to leave the confident ones alone by multiplying them by a number close to 0. Line

because we're using a 'full batch' configuration, we're doing the above step on all four training examples.

It computes the weight updates for each weight for each training example, sums them, and updates the weights, all in a simple line.

Takeaways:So, now that we've looked at how the network updates, let's look back at our training data and reflect.

Thus, in our four training examples below, the weight from the first input to the output would consistently increment or remain unchanged, whereas the other two weights would find themselves both increasing and decreasing across training examples (cancelling out progress).

Consider trying to predict the output column given the two input columns.

Each column has a 50% chance of predicting a 1 and a 50% chance of predicting a 0.

This is considered a 'nonlinear' pattern because there isn't a direct one-to-one relationship between the input and output.

If one had 100 identically sized images of pipes and bicycles, no individual pixel position would directly correlate with the presence of a bicycle or pipe.

However, certain combinations of pixels are not random, namely the combination that forms the image of a bicycle or a person. Our

In order to first combine pixels into something that can then have a one-to-one relationship with the output, we need to add another layer.

Our first layer will combine the inputs, and our second layer will then map them to the output using the output of the first layer as input.

If we randomly initialize our weights, we will get hidden state values for layer 1.

The second column (second hidden node), has a slight correlation with the output already!

(Arguably, it's the only way that neural networks train.) What the training below is going to do is amplify that correlation.

It's both going to update syn1 to map it to the output, and update syn0 to be better at producing it from the input! Note:

The field of adding more layers to model more combinations of relationships such as this is known as 'deep learning' because of the increasingly deep layers being modeled. 3

[0,1,1],

[1,0,1],

[1,1,1]])

y

l2_error = y - l2

print 'Error:' + str(np.mean(np.abs(l2_error)))

# how much did each l1 value contribute to the l2 error (according to the weights)?

Input dataset matrix where each row is a training example

Output dataset matrix where each row is a training example

Final Layer of the Network, which is our hypothesis, and should approximate the correct answer as we train.

First layer of weights, Synapse 0, connecting l0 to l1.

Second layer of weights, Synapse 1 connecting l1 to l2.

Weighting l2_delta by the weights in syn1, we can calculate the error in the middle/hidden layer.

This is the l1 error of the network scaled by the confidence.

43: uses the 'confidence weighted error' from l2 to establish an error for l1.

This gives what you could call a 'contribution weighted error' because we learn how much each node value in l1 'contributed' to the error in l2.

If you want to be able to create arbitrary architectures based on new academic papers or read and understand sample code for these different architectures, I think that it's a killer exercise.

I worked with neural networks for a couple years before performing this exercise, and it was the best investment of time I've made in the field (and it didn't take long).

Machine Learning expertise is one of the most valuable skills in the job market today, and there are many firms looking for practitioners.

5px;}#indJobContent .company_location{font-size: 11px;overflow: hidden;display:block;}#indJobContent.wide .job{display:block;float:left;margin-right: 5px;width: 135px;overflow: hidden}#indeed_widget_wrapper{position: relative;font-family: 'Helvetica Neue',Helvetica,Arial,sans-serif;font-size: 13px;font-weight: normal;line-height: 18px;padding: 10px;height: auto;overflow: hidden;}#indeed_widget_header{font-size:18px;

}#indeed_search_wrapper{clear: both;font-size: 12px;margin-top: 5px;padding-top: 5px;}#indeed_search_wrapper label{font-size: 12px;line-height: inherit;text-align: left;

One of the desirable properties of a sigmoid function is that its output can be used to create its derivative.

If you're unfamililar with derivatives, just think about it as the slope of the sigmoid function at a given point (as you can see above, different points have different slopes).

In this case, I generated the dataset horizontally (with a single row and 4 columns) for space.

Your numbers will still be randomly distributed, but they'll be randomly distributed in exactly the same way each time you train.

Another way of looking at it is that l0 is of size 3 and l1 is of size 1.

Thus, we want to connect every node in l0 to every node in l1, which requires a matrix of dimensionality (3,1).

This for loop 'iterates' multiple times over the training code to optimize our network to the dataset. Line

x 3) dot (3 x 1) = (4 x 1) Matrix multiplication is ordered, such the dimensions in the middle of the equation must be the same.

The final matrix generated is thus the number of rows of the first matrix and the number of columns of the second matrix. Since

we loaded in 4 training examples, we ended up with 4 guesses for the correct answer, a (4 x 1) matrix.

l1 represents these three dots, the code above generates the slopes of the lines below.

Notice that very high values such as x=2.0 (green dot) and very low values such as x=-1.0 (purple dot) have rather shallow slopes.

When we multiply the 'slopes' by the error, we are reducing the error of high confidence predictions.

If the slope was really shallow (close to 0), then the network either had a very high value, or a very low value.

We update these 'wishy-washy' predictions most heavily, and we tend to leave the confident ones alone by multiplying them by a number close to 0. Line

because we're using a 'full batch' configuration, we're doing the above step on all four training examples.

It computes the weight updates for each weight for each training example, sums them, and updates the weights, all in a simple line.

Takeaways:So, now that we've looked at how the network updates, let's look back at our training data and reflect.

Thus, in our four training examples below, the weight from the first input to the output would consistently increment or remain unchanged, whereas the other two weights would find themselves both increasing and decreasing across training examples (cancelling out progress).

Consider trying to predict the output column given the two input columns.

Each column has a 50% chance of predicting a 1 and a 50% chance of predicting a 0.

This is considered a 'nonlinear' pattern because there isn't a direct one-to-one relationship between the input and output.

If one had 100 identically sized images of pipes and bicycles, no individual pixel position would directly correlate with the presence of a bicycle or pipe.

However, certain combinations of pixels are not random, namely the combination that forms the image of a bicycle or a person. Our

In order to first combine pixels into something that can then have a one-to-one relationship with the output, we need to add another layer.

Our first layer will combine the inputs, and our second layer will then map them to the output using the output of the first layer as input.

If we randomly initialize our weights, we will get hidden state values for layer 1.

The second column (second hidden node), has a slight correlation with the output already!

(Arguably, it's the only way that neural networks train.) What the training below is going to do is amplify that correlation.

It's both going to update syn1 to map it to the output, and update syn0 to be better at producing it from the input! Note:

The field of adding more layers to model more combinations of relationships such as this is known as 'deep learning' because of the increasingly deep layers being modeled. 3

[0,1,1],

[1,0,1],

[1,1,1]])

y

l2_error = y - l2

print 'Error:' + str(np.mean(np.abs(l2_error)))

# how much did each l1 value contribute to the l2 error (according to the weights)?

Input dataset matrix where each row is a training example

Output dataset matrix where each row is a training example

Final Layer of the Network, which is our hypothesis, and should approximate the correct answer as we train.

First layer of weights, Synapse 0, connecting l0 to l1.

Second layer of weights, Synapse 1 connecting l1 to l2.

Weighting l2_delta by the weights in syn1, we can calculate the error in the middle/hidden layer.

This is the l1 error of the network scaled by the confidence.

43: uses the 'confidence weighted error' from l2 to establish an error for l1.

This gives what you could call a 'contribution weighted error' because we learn how much each node value in l1 'contributed' to the error in l2.

If you want to be able to create arbitrary architectures based on new academic papers or read and understand sample code for these different architectures, I think that it's a killer exercise.

I worked with neural networks for a couple years before performing this exercise, and it was the best investment of time I've made in the field (and it didn't take long).

Machine Learning expertise is one of the most valuable skills in the job market today, and there are many firms looking for practitioners.

5px;}#indJobContent .company_location{font-size: 11px;overflow: hidden;display:block;}#indJobContent.wide .job{display:block;float:left;margin-right: 5px;width: 135px;overflow: hidden}#indeed_widget_wrapper{position: relative;font-family: 'Helvetica Neue',Helvetica,Arial,sans-serif;font-size: 13px;font-weight: normal;line-height: 18px;padding: 10px;height: auto;overflow: hidden;}#indeed_widget_header{font-size:18px;

}#indeed_search_wrapper{clear: both;font-size: 12px;margin-top: 5px;padding-top: 5px;}#indeed_search_wrapper label{font-size: 12px;line-height: inherit;text-align: left;

In this case, I generated the dataset horizontally (with a single row and 4 columns) for space.

Your numbers will still be randomly distributed, but they'll be randomly distributed in exactly the same way each time you train.

Another way of looking at it is that l0 is of size 3 and l1 is of size 1.

Thus, we want to connect every node in l0 to every node in l1, which requires a matrix of dimensionality (3,1).

This for loop 'iterates' multiple times over the training code to optimize our network to the dataset.

x 3) dot (3 x 1) = (4 x 1) Matrix multiplication is ordered, such the dimensions in the middle of the equation must be the same.

The final matrix generated is thus the number of rows of the first matrix and the number of columns of the second matrix. Since

we loaded in 4 training examples, we ended up with 4 guesses for the correct answer, a (4 x 1) matrix.

l1_error is just a vector of positive and negative numbers reflecting how much the network missed.

If l1 represents these three dots, the code above generates the slopes of the lines below.

Notice that very high values such as x=2.0 (green dot) and very low values such as x=-1.0 (purple dot) have rather shallow slopes.

When we multiply the 'slopes' by the error, we are reducing the error of high confidence predictions.

If the slope was really shallow (close to 0), then the network either had a very high value, or a very low value.

We update these 'wishy-washy' predictions most heavily, and we tend to leave the confident ones alone by multiplying them by a number close to 0. Line

because we're using a 'full batch' configuration, we're doing the above step on all four training examples.

It computes the weight updates for each weight for each training example, sums them, and updates the weights, all in a simple line.

Takeaways:So, now that we've looked at how the network updates, let's look back at our training data and reflect.

Thus, in our four training examples below, the weight from the first input to the output would consistently increment or remain unchanged, whereas the other two weights would find themselves both increasing and decreasing across training examples (cancelling out progress).

Consider trying to predict the output column given the two input columns.

Each column has a 50% chance of predicting a 1 and a 50% chance of predicting a 0.

This is considered a 'nonlinear' pattern because there isn't a direct one-to-one relationship between the input and output.

If one had 100 identically sized images of pipes and bicycles, no individual pixel position would directly correlate with the presence of a bicycle or pipe.

However, certain combinations of pixels are not random, namely the combination that forms the image of a bicycle or a person. Our

In order to first combine pixels into something that can then have a one-to-one relationship with the output, we need to add another layer.

Our first layer will combine the inputs, and our second layer will then map them to the output using the output of the first layer as input.

If we randomly initialize our weights, we will get hidden state values for layer 1.

The second column (second hidden node), has a slight correlation with the output already!

(Arguably, it's the only way that neural networks train.) What the training below is going to do is amplify that correlation.

It's both going to update syn1 to map it to the output, and update syn0 to be better at producing it from the input! Note:

The field of adding more layers to model more combinations of relationships such as this is known as 'deep learning' because of the increasingly deep layers being modeled. 3

[0,1,1],

[1,0,1],

[1,1,1]])

y

print 'Error:' + str(np.mean(np.abs(l2_error)))

# how much did each l1 value contribute to the l2 error (according to the weights)?

First layer of weights, Synapse 0, connecting l0 to l1.

Second layer of weights, Synapse 1 connecting l1 to l2.

Weighting l2_delta by the weights in syn1, we can calculate the error in the middle/hidden layer.

43: uses the 'confidence weighted error' from l2 to establish an error for l1.

This gives what you could call a 'contribution weighted error' because we learn how much each node value in l1 'contributed' to the error in l2.

If you want to be able to create arbitrary architectures based on new academic papers or read and understand sample code for these different architectures, I think that it's a killer exercise.

I worked with neural networks for a couple years before performing this exercise, and it was the best investment of time I've made in the field (and it didn't take long).

Machine Learning expertise is one of the most valuable skills in the job market today, and there are many firms looking for practitioners.

5px;}#indJobContent .company_location{font-size: 11px;overflow: hidden;display:block;}#indJobContent.wide .job{display:block;float:left;margin-right: 5px;width: 135px;overflow: hidden}#indeed_widget_wrapper{position: relative;font-family: 'Helvetica Neue',Helvetica,Arial,sans-serif;font-size: 13px;font-weight: normal;line-height: 18px;padding: 10px;height: auto;overflow: hidden;}#indeed_widget_header{font-size:18px;

}#indeed_search_wrapper{clear: both;font-size: 12px;margin-top: 5px;padding-top: 5px;}#indeed_search_wrapper label{font-size: 12px;line-height: inherit;text-align: left;

However, because we're using a 'full batch' configuration, we're doing the above step on all four training examples.

It computes the weight updates for each weight for each training example, sums them, and updates the weights, all in a simple line.

Thus, in our four training examples below, the weight from the first input to the output would consistently increment or remain unchanged, whereas the other two weights would find themselves both increasing and decreasing across training examples (cancelling out progress).

Each column has a 50% chance of predicting a 1 and a 50% chance of predicting a 0.

If one had 100 identically sized images of pipes and bicycles, no individual pixel position would directly correlate with the presence of a bicycle or pipe.

However, certain combinations of pixels are not random, namely the combination that forms the image of a bicycle or a person.

Our first layer will combine the inputs, and our second layer will then map them to the output using the output of the first layer as input.

The second column (second hidden node), has a slight correlation with the output already!

(Arguably, it's the only way that neural networks train.) What the training below is going to do is amplify that correlation.

It's both going to update syn1 to map it to the output, and update syn0 to be better at producing it from the input!

Note: The field of adding more layers to model more combinations of relationships such as this is known as 'deep learning' because of the increasingly deep layers being modeled.

Line 43: uses the 'confidence weighted error' from l2 to establish an error for l1.

This gives what you could call a 'contribution weighted error' because we learn how much each node value in l1 'contributed' to the error in l2.

If you want to be able to create arbitrary architectures based on new academic papers or read and understand sample code for these different architectures, I think that it's a killer exercise.

I worked with neural networks for a couple years before performing this exercise, and it was the best investment of time I've made in the field (and it didn't take long).

Machine Learning expertise is one of the most valuable skills in the job market today, and there are many firms looking for practitioners.

5px;}#indJobContent .company_location{font-size: 11px;overflow: hidden;display:block;}#indJobContent.wide .job{display:block;float:left;margin-right: 5px;width: 135px;overflow: hidden}#indeed_widget_wrapper{position: relative;font-family: 'Helvetica Neue',Helvetica,Arial,sans-serif;font-size: 13px;font-weight: normal;line-height: 18px;padding: 10px;height: auto;overflow: hidden;}#indeed_widget_header{font-size:18px;

}#indeed_search_wrapper{clear: both;font-size: 12px;margin-top: 5px;padding-top: 5px;}#indeed_search_wrapper label{font-size: 12px;line-height: inherit;text-align: left;

Artificial neural networks are statistical learning models, inspired by biological neural networks (central nervous systems, such as the brain), that are used in machine learning.

Screenshot taken from this great introductory video, which trains a neural network to predict a test score based on hours spent studying and sleeping the night before.

Training a neural network basically means calibrating all of the “weights” by repeating two key steps, forward propagation and back propagation.

Since neural networks are great for regression, the best input data are numbers (as opposed to discrete values, like colors or movie genres, whose data is better for statistical classification models).

Next, we’ll walk through a simple example of training a neural network to function as an “Exclusive or” (“XOR”) operation to illustrate each step in the training process.

The purpose of the activation function is to transform the input signal into an output signal and are necessary for neural networks to model complex non-linear patterns that simpler models might miss.

Then, we sum the product of the hidden layer results with the second set of weights (also determined at random the first time around) to determine the output sum.

Calculating the incremental change to these weights happens in two steps: 1) we find the margin of error of the output result (what we get after applying the activation function) to back out the necessary change in the output sum (we call this delta output sum) and 2) we extract the change in weights by multiplying delta output sum by the hidden layer results.

And doing the math: To calculate the necessary change in the output sum, or delta output sum, we take the derivative of the activation function and apply it to the output sum.

Since the output sum margin of error is the difference in the result, we can simply multiply that with the rate of change to give us the delta output sum:

Now that we have the proposed change in the output layer sum (-0.13), let’s use this in the derivative of the output sum function to determine the new change in weights.

Instead of deriving for output sum, let’s derive for hidden result as a function of output sum to ultimately find out delta hidden sum:

All of the pieces in the above equation can be calculated, so we can determine the delta hidden sum: Once we get the delta hidden sum, we calculate the change in weights between the input and hidden layer by dividing it with the input data, (1, 1).

The input data here is equivalent to the hidden results in the earlier back propagation process to determine the change in the hidden-to-output weights.

Here is the derivation of that relationship, similar to the one before: Let’s do the math: Here are the new weights, right next to the initial random starting weights as comparison: Once we arrive at the adjusted weights, we start again with forward propagation.

Check out this short video for a great explanation of identifying global minima in a cost function as a way to determine necessary weight changes.

In the section on linear classification we computed scores for different visual categories given the image using the formula $$s = W x$$, where $$W$$ was a matrix and $$x$$ was an input column vector containing all pixel data of the image.

In the case of CIFAR-10, $$x$$ is a [3072x1] column vector, and $$W$$ is a [10x3072] matrix, so that the output scores is a vector of 10 class scores.

There are several choices we could make for the non-linearity (which we’ll study below), but this one is a common choice and simply thresholds all activations that are below zero to zero.

Notice that the non-linearity is critical computationally - if we left it out, the two matrices could be collapsed to a single matrix, and therefore the predicted class scores would again be a linear function of the input.

three-layer neural network could analogously look like $$s = W_3 \max(0, W_2 \max(0, W_1 x))$$, where all of $$W_3, W_2, W_1$$ are parameters to be learned.

The area of Neural Networks has originally been primarily inspired by the goal of modeling biological neural systems, but has since diverged and become a matter of engineering and achieving good results in Machine Learning tasks.

Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10^14 - 10^15 synapses.

The idea is that the synaptic strengths (the weights $$w$$) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another.

Based on this rate code interpretation, we model the firing rate of the neuron with an activation function $$f$$, which represents the frequency of the spikes along the axon.

Historically, a common choice of activation function is the sigmoid function $$\sigma$$, since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1.

An example code for forward-propagating a single neuron might look as follows: In other words, each neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function), in this case the sigmoid $$\sigma(x) = 1/(1+e^{-x})$$.

As we saw with linear classifiers, a neuron has the capacity to “like” (activation near one) or “dislike” (activation near zero) certain linear regions of its input space.

With this interpretation, we can formulate the cross-entropy loss as we have seen in the Linear Classification section, and optimizing it would lead to a binary Softmax classifier (also known as logistic regression).

The regularization loss in both SVM/Softmax cases could in this biological view be interpreted as gradual forgetting, since it would have the effect of driving all synaptic weights $$w$$ towards zero after every parameter update.

The sigmoid non-linearity has the mathematical form $$\sigma(x) = 1 / (1 + e^{-x})$$ and is shown in the image above on the left.

The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).

Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: $$\tanh(x) = 2 \sigma(2x) -1$$.

Other types of units have been proposed that do not have the functional form $$f(w^Tx + b)$$ where a non-linearity is applied on the dot product between the weights and the data.

TLDR: “What neuron type should I use?” Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of “dead” units in a network.

For regular neural networks, the most common layer type is the fully-connected layer in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections.

Working with the two example networks in the above picture: To give you some context, modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10-20 layers (hence deep learning).

The full forward pass of this 3-layer neural network is then simply three matrix multiplications, interwoven with the application of the activation function: In the above code, W1,W2,W3,b1,b2,b3 are the learnable parameters of the network.

Notice also that instead of having a single input column vector, the variable x could hold an entire batch of training data (where each input example would be a column of x) and then all examples would be efficiently evaluated in parallel.

Neural Networks work well in practice because they compactly express nice, smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using our optimization algorithms (e.g.

Similarly, the fact that deeper networks (with multiple hidden layers) can work better than a single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal.

As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more.

We could train three separate neural networks, each with one hidden layer of some size and obtain the following classifiers: In the diagram above, we can see that Neural Networks with more neurons can express more complicated functions.

For example, the model with 20 hidden neurons fits all the training data but at the cost of segmenting the space into many disjoint red and green decision regions.

The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It’s clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e.

Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss.

In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima.

## A visual proof that neural nets can compute any function

One of the most striking facts about neural networks is that they can compute any function at all.

No matter what the function, there is guaranteed to be a neural network so that for every possible input, $x$, the value $f(x)$ (or some close approximation) is output from the network, e.g.:

For instance, here's a network computing a function with $m = 3$ inputs and $n = 2$ outputs:

What's more, this universality theorem holds even if we restrict our networks to have just a single layer intermediate between the input and the output neurons - a so-called single hidden layer.

For instance, one of the original papers proving the result* *Approximation by superpositions of a sigmoidal function, by George Cybenko (1989).

Another important early paper is Multilayer feedforward networks are universal approximators, by Kurt Hornik, Maxwell Stinchcombe, and Halbert White (1989).

Again, that can be thought of as computing a function* *Actually, computing one of many functions, since there are often many acceptable translations of a given piece of text..

Or consider the problem of taking an mp4 movie file and generating a description of the plot of the movie, and a discussion of the quality of the acting.

Two caveats Before explaining why the universality theorem is true, I want to mention two caveats to the informal statement 'a neural network can compute any function'.

To make this statement more precise, suppose we're given a function $f(x)$ which we'd like to compute to within some desired accuracy $\epsilon > 0$.

The guarantee is that by using enough hidden neurons we can always find a neural network whose output $g(x)$ satisfies $|g(x) - f(x)| If a function is discontinuous, i.e., makes sudden, sharp jumps, then it won't in general be possible to approximate using a neural net. Summing up, a more precise statement of the universality theorem is that neural networks with a single hidden layer can be used to approximate any continuous function to any desired precision. In this chapter we'll actually prove a slightly weaker version of this result, using two hidden layers instead of one. In the problems I'll briefly outline how the explanation can, with a few tweaks, be adapted to give a proof which uses only a single hidden layer. Universality with one input and one output To understand why the universality theorem is true, let's start by understanding how to construct a neural network which approximates a function with just one input and one output: To build insight into how to construct a network to compute$f$, let's start with a network containing just a single hidden layer, with two hidden neurons, and an output layer containing a single output neuron: In the diagram below, click on the weight,$w$, and drag the mouse a little ways to the right to increase$w$. As we learnt earlier in the book, what's being computed by the hidden neuron is$\sigma(wx + b)$, where$\sigma(z) \equiv 1/(1+e^{-z})$is the sigmoid function. But for the proof of universality we will obtain more insight by ignoring the algebra entirely, and instead manipulating and observing the shape shown in the graph. This won't just give us a better feel for what's going on, it will also give us a proof* *Strictly speaking, the visual approach I'm taking isn't what's traditionally thought of as a proof. Occasionally, there will be small gaps in the reasoning I present: places where I make a visual argument that is plausible, but not quite rigorous. function playVideo (name) { var div =$('#'+name)[0];

} function videoEnded (name) { var div = document.getElementById(name);

} We can simplify our analysis quite a bit by increasing the weight so much that the output really is a step function, to a very good approximation.

It's easy to analyze the sum of a bunch of step functions, but rather more difficult to reason about what happens when you add up a bunch of sigmoid shaped curves.

With a little work you should be able to convince yourself that the position of the step is proportional to $b$, and inversely proportional to $w$.

It will greatly simplify our lives to describe hidden neurons using just a single parameter, $s$, which is the step position, $s = -b/w$.

As noted above, we've implicitly set the weight $w$ on the input to be some large value - big enough that the step function is a very good approximation.

We can easily convert a neuron parameterized in this way back into the conventional model, by choosing the bias $b = -w s$.

In particular, we'll suppose the hidden neurons are computing step functions parameterized by step points $s_1$ (top neuron) and $s_2$ (bottom neuron).

Here, $a_1$ and $a_2$ are the outputs from the top and bottom hidden neurons, respectively* *Note, by the way, that the output from the whole network is $\sigma(w_1 a_1+w_2 a_2 + b)$, where $b$ is the bias on the output neuron.

We're going to focus on the weighted output from the hidden layer right now, and only later will we think about how that relates to the output from the whole network..

You'll see that the graph changes shape when this happens, since we have moved from a situation where the top hidden neuron is the first to be activated to a situation where the bottom hidden neuron is the first to be activated.

Similarly, try manipulating the step point $s_2$ of the bottom hidden neuron, and get a feel for how this changes the combined output from the hidden neurons.

You'll notice, by the way, that we're using our neurons in a way that can be thought of not just in graphical terms, but in more conventional programming terms, as a kind of if-then-else statement, e.g.:

In particular, we can divide the interval $[0, 1]$ up into a large number, $N$, of subintervals, and use $N$ pairs of hidden neurons to set up peaks of any desired height.

Apologies for the complexity of the diagram: I could hide the complexity by abstracting away further, but I think it's worth putting up with a little complexity, for the sake of getting a more concrete feel for how these networks work.

didn't say it at the time, but what I plotted is actually the function \begin{eqnarray} f(x) = 0.2+0.4 x^2+0.3x \sin(15 x) + 0.05 \cos(50 x), \tag{113}\end{eqnarray} plotted over $x$ from $0$ to $1$, and with the $y$ axis taking values from $0$ to $1$.

The solution is to design a neural network whose hidden layer has a weighted output given by $\sigma^{-1} \circ f(x)$, where $\sigma^{-1}$ is just the inverse of the $\sigma$ function.

If we can do this, then the output from the network as a whole will be a good approximation to $f(x)$* *Note that I have set the bias on the output neuron to $0$..

How well you're doing is measured by the average deviation between the goal function and the function the network is actually computing.

It's only a coarse approximation, but we could easily do much better, merely by increasing the number of pairs of hidden neurons, allowing more bumps.

So, for instance, for the second hidden neuron $s = 0.2$ becomes $b = -1000 \times 0.2 = -200$.

So, for instance, the value you've chosen above for the first $h$, $h =$ , means that the output weights from the top two hidden neurons are and , respectively.

Just as in our earlier discussion, as the input weight gets larger the output approaches a step function.

Here, we assume the weight on the $x$ input has some large value - I've used $w_1 = 1000$ - and the weight $w_2 = 0$.

Of course, it's also possible to get a step function in the $y$ direction, by making the weight on the $y$ input very large (say, $w_2 = 1000$), and the weight on the $x$ equal to $0$, i.e., $w_1 = 0$:

The number on the neuron is again the step point, and in this case the little $y$ above the number reminds us that the step is in the $y$ direction.

But do keep in mind that the little $y$ marker implicitly tells us that the $y$ weight is large, and the $x$ weight is $0$.

That reminds us that they're producing $y$ step functions, not $x$ step functions, and so the weight is very large on the $y$ input, and zero on the $x$ input, not vice versa.

If we choose the threshold appropriately - say, a value of $3h/2$, which is sandwiched between the height of the plateau and the height of the central tower - we could squash the plateau down to zero, and leave just the tower standing.

This is a bit tricky, so if you think about this for a while and remain stuck, here's two hints: (1) To get the output neuron to show the right kind of if-then-else behaviour, we need the input weights (all $h$ or $-h$) to be large;

Even for this relatively modest value of $h$, we get a pretty good tower function.

To make the respective roles of the two sub-networks clear I've put them in separate boxes, below: each box computes a tower function, using the technique described above.

In particular, by making the weighted output from the second hidden layer a good approximation to $\sigma^{-1} \circ f$, we ensure the output from our network will be a good approximation to any desired function, $f$.

The $s_1, t_1$ and so on are step points for neurons - that is, all the weights in the first layer are large, and the biases are set to give the step points $s_1, t_1, s_2, \ldots$.

Of course, such a function can be regarded as just $n$ separate real-valued functions, $f^1(x_1, \ldots, x_m), f^2(x_1, \ldots, x_m)$, and so on.

As a hint, try working in the case of just two input variables, and showing that: (a) it's possible to get step functions not just in the $x$ or $y$ directions, but in an arbitrary direction;

(b) by adding up many of the constructions from part (a) it's possible to approximate a tower function which is circular in shape, rather than rectangular;

To do part (c) it may help to use ideas from a bit later in this chapter.

Recall that in a sigmoid neuron the inputs $x_1, x_2, \ldots$ result in the output $\sigma(\sum_j w_j x_j + b)$, where $w_j$ are the weights, $b$ is the bias, and $\sigma$ is the sigmoid function:

That is, we'll assume that if our neurons has inputs $x_1, x_2, \ldots$, weights $w_1, w_2, \ldots$ and bias $b$, then the output is $s(\sum_j w_j x_j + b)$.

It should be pretty clear that if we add all these bump functions up we'll end up with a reasonable approximation to $\sigma^{-1} \circ f(x)$, except within the windows of failure.

Suppose that instead of using the approximation just described, we use a set of hidden neurons to compute an approximation to half our original goal function, i.e., to $\sigma^{-1} \circ f(x) / 2$.

And suppose we use another set of hidden neurons to compute an approximation to $\sigma^{-1} \circ f(x)/ 2$, but with the bases of the bumps shifted by half the width of a bump:

Although the result isn't directly useful in constructing networks, it's important because it takes off the table the question of whether any particular function is computable using a neural network.

As argued in Chapter 1, deep networks have a hierarchical structure which makes them particularly well adapted to learn the hierarchies of knowledge that seem to be useful in solving real-world problems.

Put more concretely, when attacking problems such as image recognition, it helps to use a system that understands not just individual pixels, but also increasingly more complex concepts: from edges to simple geometric shapes, all the way up through complex, multi-object scenes.

In later chapters, we'll see evidence suggesting that deep networks do a better job than shallow networks at learning such hierarchies of knowledge.

and empirical evidence suggests that deep networks are the networks best adapted to learn the functions useful in solving many real-world problems.

What is a Neural Network - Ep. 2 (Deep Learning SIMPLIFIED)

With plenty of machine learning tools currently available, why would you ever choose an artificial neural network over all the rest? This clip and the next could open your eyes to their awesome...

Neural Network Calculation (Part 1): Feedforward Structure

From In this series we will see how a neural network actually calculates its values. This first video takes a look at the structure of a feedforward neural network

Neural Networks (1): Basics

The basic form of a feed-forward multi-layer perceptron / neural network; example activation functions.

But what *is* a Neural Network? | Chapter 1, deep learning

Subscribe to stay notified about new videos: Support more videos like this on Patreon: Special thanks to these supporters:

Backpropagation Neural Network - How it Works e.g. Counting

Here's a small backpropagation neural network that counts and an example and an explanation for how it works, how it learns. A neural network is a tool in artificial intelligence that learns...

Minimum Spanning Tree #1: Kruskal Algorithm

Introduction of Kruskal Algorithm with code demo. Notes can be downloaded from: boqian.weebly.com.

Micro 6.3 Negative Externalities: Econ Concepts in 60 Seconds-Externality

Mr. Clifford's 60 second explanation of negative externalities (aka: spillover costs). Notice that there are two different supply curves. One is the marginal private cost which ignores the...

What is backpropagation really doing? | Chapter 3, deep learning

What's actually happening to a neural network as it learns? Training data generation + T-shirt at Crowdflower does some cool work and addresses a meaningful need..

Build a Neural Net in 4 Minutes

How does a Neural network work? Its the basis of deep learning and the reason why image recognition, chatbots, self driving cars, and language translation work! In this video, i'll use python...

An Old Problem - Ep. 5 (Deep Learning SIMPLIFIED)

If deep neural networks are so powerful, why aren't they used more often? The reason is that they are very difficult to train due to an issue known as the vanishing gradient. Deep Learning...