# AI News, A Neural Network in 13 lines of Python (Part 2 - Gradient Descent)

- On Sunday, June 3, 2018
- By Read More

## A Neural Network in 13 lines of Python (Part 2 - Gradient Descent)

Summary: I learn best with toy code that I can play with.

This tutorial teaches gradient descent via a very simple toy example, a short python implementation.

Followup Post: I intend to write a followup post to this one adding popular features leveraged by state-of-the-art approaches (likely Dropout, DropConnect, and Momentum).

layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))

It moves the error information from the end of the network to all the weights inside the network so that a different algorithm can optimize those weights to fit our data.

In this tutorial, we will walk through Gradient Descent, which is arguably the simplest and most widely used neural network optimization algorithm.

By learning about Gradient Descent, we will then be able to improve our toy neural network through parameterization and tuning, and ultimately make it a lot more powerful. Part

In our case, the ball is optimizing it's position (from left to right) to find the lowest point in the bucket.

The only information it has is the slope of the side of the bucket at its current position, pictured below with the blue line.

As it gets closer and closer to the bottom, it takes smaller and smaller steps until the slope equals zero, at which point it stops.

What makes this problem so destructive is that overshooting this far means we land at an EVEN STEEPER slope in the opposite direction.

your bucket has a funny shape, and following the slope doesn't take you to the absolute lowest point.

are a myriad of ways in which randomness is used to overcome getting stuck in a local minimum.

For example, if a ball randomly falls within the blue domain, it will converge to the blue minimum.

This is far better than pure random searching, which has to randomly try EVERY space (which could easily be millions of places on this black line depending on the granularity).

Parameterizing this size allows the neural network user to potentially try thousands (or tens of billions) of different local minima in a single neural network. Sidenote

We can search the entire black line above with (in theory) only 5 balls and a handful of iterations.

The current state-of-the-art approaches to avoiding hidden nodes coming up with the same answer (by searching the same space) are Dropout and Drop-Connect, which I intend to cover in a later post. Problem

The ball just drops right into an instant local minimum and ignores the big picture.

So, if we computed the network's error for every possible value of a single weight, it would generate the curve you see above.

We would then pick the value of the single weight that has the lowest error (the lowest part of the curve).

Thus, the x dimension is the value of the weight and the y dimension is the neural network's error when the weight is at that position. Stop

Let's take a look at what this process looks like in a simple 2 layer neural network. 2

[0,1],

[1,0],

[1,0] ])

output dataset y

in this case, we have a single error at the output (single value), which is computed on line 35.

If we take that logic and plot the overall error (a single scalar representing the network error over the entire dataset) for every possible set of weights (from -10 to 10 for x and y), it looks something like this. Don't

It really is as simple as computing every possible set of weights, and the error that the network generates at each set.

Now that we have seen how our neural network leverages Gradient Descent, we can improve our network to overcome these weaknesses in the same way that we improved Gradient Descent in Part 3 (the 3 problems and solutions). Improvement

As described above, the alpha parameter reduces the size of each iteration's update in the simplest way possible.

At the very last minute, right before we update the weights, we multiply the weight update by alpha (usually between 0 and 1, thus reducing the size of the weight update).

We're going to jump back to our 3 layer neural network from the first post and add in an alpha parameter at the appropriate place.

Then, we're going to run a series of experiments to align all the intuition we developed around alpha with its behavior in live code. Improved

[0,1,1],

[1,0,1],

[1,1,1]])

y

# randomly initialize our weights with mean 0

# Feed forward through layers 0, 1, and 2

layer_0 = X

layer_1 = sigmoid(np.dot(layer_0,synapse_0))

layer_2 = sigmoid(np.dot(layer_1,synapse_1))

# how much did we miss the target value?

layer_2_error = layer_2 - y

if (j% 10000) == 0:

print 'Error after '+str(j)+' iterations:' + str(np.mean(np.abs(layer_2_error)))

# in what direction is the target value?

layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)

# how much did each l1 value contribute to the l2 error (according to the weights)?

# in what direction is the target l1?

layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)

synapse_1 -= alpha * (layer_1.T.dot(layer_2_delta))

synapse_0 -= alpha * (layer_0.T.dot(layer_1_delta)) Training

= 10Perhaps you were surprised that an alpha that was greater than 1 achieved the best score after only 10,000 iterations!

This means that in the smaller alpha parameters (less than 10), the network's weights were generally headed in the right direction, they just needed to hurry up and get there! Alpha

with an extremely large alpha, we see a textbook example of divergence, with the error increasing instead of decreasing...

[0,1,1],

[1,0,1],

[1,1,1]])

y

# Feed forward through layers 0, 1, and 2

layer_1 = sigmoid(np.dot(layer_0,synapse_0))

layer_2 = sigmoid(np.dot(layer_1,synapse_1))

# how much did we miss the target value?

layer_2_error = y - layer_2

if (j% 10000) == 0:

print 'Error:' + str(np.mean(np.abs(layer_2_error)))

# in what direction is the target value?

layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)

# how much did each l1 value contribute to the l2 error (according to the weights)?

# in what direction is the target l1?

layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)

synapse_0_direction_count += np.abs(((synapse_0_weight_update >

synapse_1_direction_count += np.abs(((synapse_1_weight_update >

0) + 0))

synapse_1 += alpha * synapse_1_weight_update

synapse_0 += alpha * synapse_0_weight_update

prev_synapse_0_weight_update = synapse_0_weight_update

prev_synapse_1_weight_update = synapse_1_weight_update

If a slope (derivative) changes direction, it means that it passed OVER the local minimum and needs to go back.

able to increase the size of the hidden layer increases the amount of search space that we converge to in each iteration.

[0,1,1],

[1,0,1],

[1,1,1]])

y

# Feed forward through layers 0, 1, and 2

layer_1 = sigmoid(np.dot(layer_0,synapse_0))

layer_2 = sigmoid(np.dot(layer_1,synapse_1))

# how much did we miss the target value?

layer_2_error = layer_2 - y

if (j% 10000) == 0:

print 'Error after '+str(j)+' iterations:' + str(np.mean(np.abs(layer_2_error)))

# in what direction is the target value?

layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)

# how much did each l1 value contribute to the l2 error (according to the weights)?

# in what direction is the target l1?

layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)

synapse_1 -= alpha * (layer_1.T.dot(layer_2_delta))

synapse_0 -= alpha * (layer_0.T.dot(layer_1_delta)) Training

that the best error with 32 nodes is 0.0009 whereas the best error with 4 hidden nodes was only 0.0013.

Even though this is very marginal in this toy problem, this affect plays a huge role when modeling very complex datasets. Part

If you want to be able to create arbitrary architectures based on new academic papers or read and understand sample code for these different architectures, I think that it's a killer exercise.

I worked with neural networks for a couple years before performing this exercise, and it was the best investment of time I've made in the field (and it didn't take long).

5px;}#indJobContent .company_location{font-size: 11px;overflow: hidden;display:block;}#indJobContent.wide .job{display:block;float:left;margin-right: 5px;width: 135px;overflow: hidden}#indeed_widget_wrapper{position: relative;font-family: 'Helvetica Neue',Helvetica,Arial,sans-serif;font-size: 13px;font-weight: normal;line-height: 18px;padding: 10px;height: auto;overflow: hidden;}#indeed_widget_header{font-size:18px;

Followup Post: I intend to write a followup post to this one adding popular features leveraged by state-of-the-art approaches (likely Dropout, DropConnect, and Momentum).

It moves the error information from the end of the network to all the weights inside the network so that a different algorithm can optimize those weights to fit our data.

In this tutorial, we will walk through Gradient Descent, which is arguably the simplest and most widely used neural network optimization algorithm.

In our case, the ball is optimizing it's position (from left to right) to find the lowest point in the bucket.

The only information it has is the slope of the side of the bucket at its current position, pictured below with the blue line.

As it gets closer and closer to the bottom, it takes smaller and smaller steps until the slope equals zero, at which point it stops.

For example, if a ball randomly falls within the blue domain, it will converge to the blue minimum.

This is far better than pure random searching, which has to randomly try EVERY space (which could easily be millions of places on this black line depending on the granularity).

Parameterizing this size allows the neural network user to potentially try thousands (or tens of billions) of different local minima in a single neural network. Sidenote

We can search the entire black line above with (in theory) only 5 balls and a handful of iterations.

The current state-of-the-art approaches to avoiding hidden nodes coming up with the same answer (by searching the same space) are Dropout and Drop-Connect, which I intend to cover in a later post. Problem

It moves the error information from the end of the network to all the weights inside the network so that a different algorithm can optimize those weights to fit our data.

In this tutorial, we will walk through Gradient Descent, which is arguably the simplest and most widely used neural network optimization algorithm.

By learning about Gradient Descent, we will then be able to improve our toy neural network through parameterization and tuning, and ultimately make it a lot more powerful.

In our case, the ball is optimizing it's position (from left to right) to find the lowest point in the bucket.

So, it needs to press the left and right buttons correctly to find the lowest spot So, what information does the ball use to adjust its position to find the lowest point?

The only information it has is the slope of the side of the bucket at its current position, pictured below with the blue line.

As it gets closer and closer to the bottom, it takes smaller and smaller steps until the slope equals zero, at which point it stops.

What makes this problem so destructive is that overshooting this far means we land at an EVEN STEEPER slope in the opposite direction.

For example, if a ball randomly falls within the blue domain, it will converge to the blue minimum.

This is far better than pure random searching, which has to randomly try EVERY space (which could easily be millions of places on this black line depending on the granularity).

Parameterizing this size allows the neural network user to potentially try thousands (or tens of billions) of different local minima in a single neural network.

We can search the entire black line above with (in theory) only 5 balls and a handful of iterations.

The current state-of-the-art approaches to avoiding hidden nodes coming up with the same answer (by searching the same space) are Dropout and Drop-Connect, which I intend to cover in a later post.

So, if we computed the network's error for every possible value of a single weight, it would generate the curve you see above.

Let's take a look at what this process looks like in a simple 2 layer neural network. 2

[0,1],

[1,0],

[1,0] ])

output dataset y

If we take that logic and plot the overall error (a single scalar representing the network error over the entire dataset) for every possible set of weights (from -10 to 10 for x and y), it looks something like this. Don't

Now that we have seen how our neural network leverages Gradient Descent, we can improve our network to overcome these weaknesses in the same way that we improved Gradient Descent in Part 3 (the 3 problems and solutions). Improvement

As described above, the alpha parameter reduces the size of each iteration's update in the simplest way possible.

At the very last minute, right before we update the weights, we multiply the weight update by alpha (usually between 0 and 1, thus reducing the size of the weight update).

We're going to jump back to our 3 layer neural network from the first post and add in an alpha parameter at the appropriate place.

Then, we're going to run a series of experiments to align all the intuition we developed around alpha with its behavior in live code. Improved

[0,1,1],

[1,0,1],

[1,1,1]])

y

# Feed forward through layers 0, 1, and 2

layer_1 = sigmoid(np.dot(layer_0,synapse_0))

layer_2 = sigmoid(np.dot(layer_1,synapse_1))

# how much did we miss the target value?

layer_2_error = layer_2 - y

if (j% 10000) == 0:

print 'Error after '+str(j)+' iterations:' + str(np.mean(np.abs(layer_2_error)))

# in what direction is the target value?

layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)

# how much did each l1 value contribute to the l2 error (according to the weights)?

# in what direction is the target l1?

layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)

synapse_1 -= alpha * (layer_1.T.dot(layer_2_delta))

synapse_0 -= alpha * (layer_0.T.dot(layer_1_delta)) Training

= 10Perhaps you were surprised that an alpha that was greater than 1 achieved the best score after only 10,000 iterations!

This means that in the smaller alpha parameters (less than 10), the network's weights were generally headed in the right direction, they just needed to hurry up and get there! Alpha

with an extremely large alpha, we see a textbook example of divergence, with the error increasing instead of decreasing...

[0,1,1],

[1,0,1],

[1,1,1]])

y

# Feed forward through layers 0, 1, and 2

layer_1 = sigmoid(np.dot(layer_0,synapse_0))

layer_2 = sigmoid(np.dot(layer_1,synapse_1))

# how much did we miss the target value?

layer_2_error = y - layer_2

if (j% 10000) == 0:

print 'Error:' + str(np.mean(np.abs(layer_2_error)))

# in what direction is the target value?

layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)

# how much did each l1 value contribute to the l2 error (according to the weights)?

# in what direction is the target l1?

layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)

synapse_0_direction_count += np.abs(((synapse_0_weight_update >

synapse_1_direction_count += np.abs(((synapse_1_weight_update >

0) + 0))

synapse_1 += alpha * synapse_1_weight_update

synapse_0 += alpha * synapse_0_weight_update

prev_synapse_0_weight_update = synapse_0_weight_update

prev_synapse_1_weight_update = synapse_1_weight_update

If a slope (derivative) changes direction, it means that it passed OVER the local minimum and needs to go back.

able to increase the size of the hidden layer increases the amount of search space that we converge to in each iteration.

[0,1,1],

[1,0,1],

[1,1,1]])

y

# Feed forward through layers 0, 1, and 2

layer_1 = sigmoid(np.dot(layer_0,synapse_0))

layer_2 = sigmoid(np.dot(layer_1,synapse_1))

# how much did we miss the target value?

layer_2_error = layer_2 - y

if (j% 10000) == 0:

print 'Error after '+str(j)+' iterations:' + str(np.mean(np.abs(layer_2_error)))

# in what direction is the target value?

layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)

# how much did each l1 value contribute to the l2 error (according to the weights)?

# in what direction is the target l1?

layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)

synapse_1 -= alpha * (layer_1.T.dot(layer_2_delta))

synapse_0 -= alpha * (layer_0.T.dot(layer_1_delta)) Training

that the best error with 32 nodes is 0.0009 whereas the best error with 4 hidden nodes was only 0.0013.

Even though this is very marginal in this toy problem, this affect plays a huge role when modeling very complex datasets. Part

If you want to be able to create arbitrary architectures based on new academic papers or read and understand sample code for these different architectures, I think that it's a killer exercise.

I worked with neural networks for a couple years before performing this exercise, and it was the best investment of time I've made in the field (and it didn't take long).

5px;}#indJobContent .company_location{font-size: 11px;overflow: hidden;display:block;}#indJobContent.wide .job{display:block;float:left;margin-right: 5px;width: 135px;overflow: hidden}#indeed_widget_wrapper{position: relative;font-family: 'Helvetica Neue',Helvetica,Arial,sans-serif;font-size: 13px;font-weight: normal;line-height: 18px;padding: 10px;height: auto;overflow: hidden;}#indeed_widget_header{font-size:18px;

So, if we computed the network's error for every possible value of a single weight, it would generate the curve you see above.

Thus, the x dimension is the value of the weight and the y dimension is the neural network's error when the weight is at that position.

Let's take a look at what this process looks like in a simple 2 layer neural network.

If we take that logic and plot the overall error (a single scalar representing the network error over the entire dataset) for every possible set of weights (from -10 to 10 for x and y), it looks something like this.

Now that we have seen how our neural network leverages Gradient Descent, we can improve our network to overcome these weaknesses in the same way that we improved Gradient Descent in Part 3 (the 3 problems and solutions). Improvement

As described above, the alpha parameter reduces the size of each iteration's update in the simplest way possible.

At the very last minute, right before we update the weights, we multiply the weight update by alpha (usually between 0 and 1, thus reducing the size of the weight update).

We're going to jump back to our 3 layer neural network from the first post and add in an alpha parameter at the appropriate place.

Then, we're going to run a series of experiments to align all the intuition we developed around alpha with its behavior in live code. Improved

[0,1,1],

[1,0,1],

[1,1,1]])

y

# Feed forward through layers 0, 1, and 2

layer_1 = sigmoid(np.dot(layer_0,synapse_0))

layer_2 = sigmoid(np.dot(layer_1,synapse_1))

# how much did we miss the target value?

layer_2_error = layer_2 - y

if (j% 10000) == 0:

print 'Error after '+str(j)+' iterations:' + str(np.mean(np.abs(layer_2_error)))

# in what direction is the target value?

layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)

# how much did each l1 value contribute to the l2 error (according to the weights)?

# in what direction is the target l1?

layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)

synapse_1 -= alpha * (layer_1.T.dot(layer_2_delta))

synapse_0 -= alpha * (layer_0.T.dot(layer_1_delta)) Training

= 10Perhaps you were surprised that an alpha that was greater than 1 achieved the best score after only 10,000 iterations!

This means that in the smaller alpha parameters (less than 10), the network's weights were generally headed in the right direction, they just needed to hurry up and get there! Alpha

with an extremely large alpha, we see a textbook example of divergence, with the error increasing instead of decreasing...

[0,1,1],

[1,0,1],

[1,1,1]])

y

# Feed forward through layers 0, 1, and 2

layer_1 = sigmoid(np.dot(layer_0,synapse_0))

layer_2 = sigmoid(np.dot(layer_1,synapse_1))

# how much did we miss the target value?

layer_2_error = y - layer_2

if (j% 10000) == 0:

print 'Error:' + str(np.mean(np.abs(layer_2_error)))

# in what direction is the target value?

layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)

# how much did each l1 value contribute to the l2 error (according to the weights)?

# in what direction is the target l1?

layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)

synapse_0_direction_count += np.abs(((synapse_0_weight_update >

synapse_1_direction_count += np.abs(((synapse_1_weight_update >

0) + 0))

synapse_1 += alpha * synapse_1_weight_update

synapse_0 += alpha * synapse_0_weight_update

prev_synapse_0_weight_update = synapse_0_weight_update

prev_synapse_1_weight_update = synapse_1_weight_update

If a slope (derivative) changes direction, it means that it passed OVER the local minimum and needs to go back.

able to increase the size of the hidden layer increases the amount of search space that we converge to in each iteration.

[0,1,1],

[1,0,1],

[1,1,1]])

y

# Feed forward through layers 0, 1, and 2

layer_1 = sigmoid(np.dot(layer_0,synapse_0))

layer_2 = sigmoid(np.dot(layer_1,synapse_1))

# how much did we miss the target value?

layer_2_error = layer_2 - y

if (j% 10000) == 0:

print 'Error after '+str(j)+' iterations:' + str(np.mean(np.abs(layer_2_error)))

# in what direction is the target value?

layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)

# how much did each l1 value contribute to the l2 error (according to the weights)?

# in what direction is the target l1?

layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)

synapse_1 -= alpha * (layer_1.T.dot(layer_2_delta))

synapse_0 -= alpha * (layer_0.T.dot(layer_1_delta)) Training

that the best error with 32 nodes is 0.0009 whereas the best error with 4 hidden nodes was only 0.0013.

Even though this is very marginal in this toy problem, this affect plays a huge role when modeling very complex datasets. Part

If you want to be able to create arbitrary architectures based on new academic papers or read and understand sample code for these different architectures, I think that it's a killer exercise.

I worked with neural networks for a couple years before performing this exercise, and it was the best investment of time I've made in the field (and it didn't take long).

5px;}#indJobContent .company_location{font-size: 11px;overflow: hidden;display:block;}#indJobContent.wide .job{display:block;float:left;margin-right: 5px;width: 135px;overflow: hidden}#indeed_widget_wrapper{position: relative;font-family: 'Helvetica Neue',Helvetica,Arial,sans-serif;font-size: 13px;font-weight: normal;line-height: 18px;padding: 10px;height: auto;overflow: hidden;}#indeed_widget_header{font-size:18px;

Now that we have seen how our neural network leverages Gradient Descent, we can improve our network to overcome these weaknesses in the same way that we improved Gradient Descent in Part 3 (the 3 problems and solutions).

Then, we're going to run a series of experiments to align all the intuition we developed around alpha with its behavior in live code.

:) Alpha = 10Perhaps you were surprised that an alpha that was greater than 1 achieved the best score after only 10,000 iterations!

This means that in the smaller alpha parameters (less than 10), the network's weights were generally headed in the right direction, they just needed to hurry up and get there!

This is a more extreme version of Problem 3 where it overcorrectly whildly and ends up very far away from any local minimums.

Being able to increase the size of the hidden layer increases the amount of search space that we converge to in each iteration.

Consider the network and output Notice that the best error with 32 nodes is 0.0009 whereas the best error with 4 hidden nodes was only 0.0013.

Even though this is very marginal in this toy problem, this affect plays a huge role when modeling very complex datasets.

- On Tuesday, June 25, 2019

**AlphaZero's Dark-Square Domination**

The idea behind an initial deployment of the queen's bishop to a6 in the Queen's Indian Defense is disrupt white's queenside in some way since the pressure on ...

**Alfa Networks UBDo Long-Range Antenna - USB wireless long range network Wifi Adapter**

Alfa 2000mw 2W Waterproof Marine high power Long Range Outdoor 802.11 B, G, N, USB wireless network Wifi adapter with Integrated 12dBi Antenna - Up to ...

**The Difference Between Original and Fake Alfa AWUS036H USB Adapters**

Fake Alfa AWUS036H adapter unboxing and what's the difference between original and fake Alfa adapters.

**Leela Chess Zero (LCZero) | Gameplay : Rating 2400 | similar to Alpha Zero vlog # 4**

Play online : my username : joanapita Leela Chess Zero Game quality tags: amazing, awesome, astonishing, brilliant, classic, crushing, ..

**Sid Meier's Alpha Centauri Secret Project: The Network Backbone**

Playlist of all SMAC and SMAX cutscenes: From the videogame known as Sid Meier's Alpha ..

**Stargate Network Alpha V4.0**

Just a short video showcasing the latest release of the stargate network game! Stargate is something special and close to my heart so i thought i would share this ...

**Ashes of Creation - Game Development, Class Names, and Official Alpha Zero Footage**

Remember to hit that subscribe button! Jahlon's Ashes of Creation Referral Link All profits from your support will go into ..

**Golem Alpha 0.11.0 Windows 10 Installation & Demo**

Step by step walkthrough and instructions for Installing the Golem 0.11.0 alpha version on Windows 10. More info about Golem available at ...

**Crypto News June 1: Aragon Polls, Request Network + Shopify, Binance's $1 Billion Crypto Fund**

Aragon polls ..