AI News, Let’s code a Neural Network in plain NumPy

Let’s code a Neural Network in plain NumPy

Using high-level frameworks like Keras, TensorFlow or PyTorch allows us to build very complex models quickly.

Our goal is to create a program capable of creating a densely connected neural network with the specified architecture (number and size of layers and appropriate activation function).

Superscript [l] denotes the index of the current layer (counted from one) and the value n indicates the number of units in a given layer.

Each item in the list is a dictionary describing the basic parameters of a single network layer: input_dim - the size of the signal vector supplied as an input for the layer, output_dim - the size of the activation vector obtained at the output of the layer and activation - the activation function to be used inside the layer.

If you’re familiar with this subject, you’ve probably already heard a voice in your head speaking with anxious tone: “Hey, hey!

However, I deliberately decided to use the following notation to keep the objects consistent across all layers and make the code easier to understand for those who encounter these topic for the first time.

Those who have already looked at the code on Snippet 2 and have some experience with NumPy have noticed that the matrix W and vector b have been filled with small, random numbers.

Looking at the graph of the sigmoid function, shown in Figure 4, we can see that it becomes almost flat for large values, what has significant effect on the speed of learning of our NN.

All in all parameter initiation using small random numbers is simple approach, but it guarantees good enough starting point for our algorithm.

“Without them, our neural network would become a combination of linear functions, so it would be just a linear function itself.” There are many activation functions, but in this project I decided to provide the possibility of using two of them — sigmoid and ReLU.

The information flows in one direction — it is delivered in the form of an X matrix, and then travels through hidden units, resulting in the vector of predictions Y_hat.

To make it easier to read, I split forward propagation into two separate functions — step forward for a single layer and step forward for the entire NN.

In order to monitor our progress and make sure that we are moving in right direction, we should routinely calculate the value of the loss function.

“Generally speaking, the loss function is designed to show how far we are from the ‘ideal’ solution.” It is selected according to the problem we plan to solve, and frameworks such as Keras have many options to choose from.

In NN, we calculate the gradient of the cost function (discussed earlier) in respect to parameters, but backpropagation can be used to calculate derivatives of any function.

The essence of this algorithm is the recursive use of a chain rule known from differential calculus — calculate a derivative of functions created by assembling other functions, whose derivatives we already know.

The second one, representing full backward propagation, deals primarily with key juggling to read and update values in three dictionaries.

Then iterate through the layers of the network starting from the end and calculate the derivatives with respect to all parameters according to the diagram shown in Figure 6.

To accomplish this task, we will use two dictionaries provided as function arguments: params_values, which stores the current values of parameters, and grads_values, which stores cost function derivatives calculated with respect to these parameters.

This is a very simple optimization algorithm, but I decided to use it because it is a great starting point for more advanced optimizers, which will probably be the subject of one of my next articles.

In order to make a prediction, you only need to run a full forward propagation using the received weight matrix and a set of test data.

hope that my article has broadened your horizons and increased your understanding of compiled operations taking place inside the neural network.

In the section on linear classification we computed scores for different visual categories given the image using the formula \( s = W x \), where \(W\) was a matrix and \(x\) was an input column vector containing all pixel data of the image.

In the case of CIFAR-10, \(x\) is a [3072x1] column vector, and \(W\) is a [10x3072] matrix, so that the output scores is a vector of 10 class scores.

There are several choices we could make for the non-linearity (which we’ll study below), but this one is a common choice and simply thresholds all activations that are below zero to zero.

Notice that the non-linearity is critical computationally - if we left it out, the two matrices could be collapsed to a single matrix, and therefore the predicted class scores would again be a linear function of the input.

three-layer neural network could analogously look like \( s = W_3 \max(0, W_2 \max(0, W_1 x)) \), where all of \(W_3, W_2, W_1\) are parameters to be learned.

The area of Neural Networks has originally been primarily inspired by the goal of modeling biological neural systems, but has since diverged and become a matter of engineering and achieving good results in Machine Learning tasks.

Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10^14 - 10^15 synapses.

The idea is that the synaptic strengths (the weights \(w\)) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another.

Based on this rate code interpretation, we model the firing rate of the neuron with an activation function \(f\), which represents the frequency of the spikes along the axon.

Historically, a common choice of activation function is the sigmoid function \(\sigma\), since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1.

An example code for forward-propagating a single neuron might look as follows: In other words, each neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function), in this case the sigmoid \(\sigma(x) = 1/(1+e^{-x})\).

As we saw with linear classifiers, a neuron has the capacity to “like” (activation near one) or “dislike” (activation near zero) certain linear regions of its input space.

With this interpretation, we can formulate the cross-entropy loss as we have seen in the Linear Classification section, and optimizing it would lead to a binary Softmax classifier (also known as logistic regression).

The regularization loss in both SVM/Softmax cases could in this biological view be interpreted as gradual forgetting, since it would have the effect of driving all synaptic weights \(w\) towards zero after every parameter update.

The sigmoid non-linearity has the mathematical form \(\sigma(x) = 1 / (1 + e^{-x})\) and is shown in the image above on the left.

The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).

Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: \( \tanh(x) = 2 \sigma(2x) -1 \).

Other types of units have been proposed that do not have the functional form \(f(w^Tx + b)\) where a non-linearity is applied on the dot product between the weights and the data.

TLDR: “What neuron type should I use?” Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of “dead” units in a network.

For regular neural networks, the most common layer type is the fully-connected layer in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections.

Working with the two example networks in the above picture: To give you some context, modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10-20 layers (hence deep learning).

The full forward pass of this 3-layer neural network is then simply three matrix multiplications, interwoven with the application of the activation function: In the above code, W1,W2,W3,b1,b2,b3 are the learnable parameters of the network.

Notice also that instead of having a single input column vector, the variable x could hold an entire batch of training data (where each input example would be a column of x) and then all examples would be efficiently evaluated in parallel.

Neural Networks work well in practice because they compactly express nice, smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using our optimization algorithms (e.g.

Similarly, the fact that deeper networks (with multiple hidden layers) can work better than a single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal.

As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more.

We could train three separate neural networks, each with one hidden layer of some size and obtain the following classifiers: In the diagram above, we can see that Neural Networks with more neurons can express more complicated functions.

For example, the model with 20 hidden neurons fits all the training data but at the cost of segmenting the space into many disjoint red and green decision regions.

The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It’s clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e.

Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss.

In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima.

Select a Web Site

The output layer uses two functions to compute the loss and the derivatives:

The backwardLoss function computes the derivatives of the loss

input T corresponds to the training targets.

according to the specified loss function.

For classification problems, the dimensions of T depend on the type of

that outputs the correct size before the output layer.

classes, you can include a fully connected layer of size K followed by a

For regression problems, the dimensions of T also depend on the type of

For example, if the network defines an image regression network with one response and has

mini-batches of size 50, then T is a 4-D array of size

that outputs the correct size before the output layer.

the correct size, you can include a fully connected layer of size R

If you want to include a user-defined output layer after a built-in layer, then

GPU Compatibility.For GPU compatibility, the layer functions must support inputs

Many MATLAB built-in functions support gpuArray input arguments.

gpuArray input, then the function executes on the GPU and returns a

learning, you must also have a CUDA enabled NVIDIA GPU with compute capability 3.0 or higher.

Coding Neural Network — Forward Propagation and Backpropagtion

On each hidden layer, the neural network learns new feature space by first compute the affine (linear) transformations of the given inputs and then apply non-linear function which in turn will be the input of the next layer.

For a 3-layers neural network, the learned function would be: f(x) = f_3(f_2(f_1(x))) where: Therefore, on each layer we learn different representation that gets more complicated with later hidden layers.Below is an example of a 3-layers neural network (we don’t count input layer): For example, computers can’t understand images directly and don’t know what to do with pixels data.

Since truth is never linear and representation is very critical to the performance of a machine learning algorithm, neural network can help us build very complex models and leave it to the algorithm to learn such representations without worrying about feature engineering that takes practitioners very long time and effort to curate a good representation.

The post has two parts: This post will be the first in a series of posts that cover implementing neural network in numpy including gradient checking, parameter initialization, L2 regularization, dropout.

Lets first introduce some notations that will be used throughout the post: Next, we’ll write down the dimensions of a multi-layer neural network in the general form to help us in matrix multiplication because one of the major challenges in implementing a neural network is getting the dimensions right.

It’s important to note that we shouldn’t initialize all the parameters to zero because doing so will lead the gradients to be equal and on each iteration the output would be the same and the learning algorithm won’t learn anything.

It’s also recommended to multiply the random values by small scalar such as 0.01 to make the activation units active and be on the regions where activation functions’ derivatives are not close to zero.

Therefore, loop over the nodes starting at the final node in reverse topological order to compute the derivative of the final node output with respect to each edge’s node tail.

Next, we’ll train two versions of the neural network where each one will use different activation function on hidden layers: One will use rectified linear unit (ReLU) and the second one will use hyperbolic tangent function (tanh).

Finally we’ll use the parameters we get from both neural networks to classify training examples and compute the training accuracy rates for each version to see which activation function works best on this problem.

Neural networks: training with backpropagation.

In my first post on neural networks, I discussed a model representation for neural networks and how we can feed in inputs and calculate an output.

When I was first learning backpropagation, many people tried to abstract away the underlying math (derivative chains) and I was never really able to grasp what the heck was going on until watching Professor Winston's lecture at MIT.

I hope that this post gives a better understanding of backpropagation than simply "this is the step where we send the error backwards to update the weights".

To figure out how to use gradient descent in training a neural network, let's start with the simplest neural network: one input neuron, one hidden layer neuron, and one output neuron.

To show a more complete picture of what's going on, I've expanded each neuron to show 1) the linear combination of inputs and weights and 2) the activation of this linear combination.

2$}}{\left( {y - {{\rm{a}}^{(3)}}} \right)^2}] There's a myriad of cost functions we could use, but for this neural network squared error will work just fine.

In order to minimize the difference between our neural network's output and the target output, we need to know how the model performance changes with respect to each parameter in our model.

[\frac{{\partial J\left( \theta \right)}}{{\partial {\theta _1}}} = ?] [\frac{{\partial J\left( \theta \right)}}{{\partial {\theta _2}}} = ?] Let's look at $\frac{{\partial J\left( \theta \right)}}{{\partial {\theta _2}}}$ first.

}{{\partial z}}p\left( {q\left( z \right)} \right) = \frac{{\partial p}}{{\partial q}}\frac{{\partial q}}{{\partial z}}] Let's apply the chain rule to solve for $\frac{{\partial J\left( \theta \right)}}{{\partial {\theta _2}}}$.

[\frac{{\partial J\left( \theta \right)}}{{\partial {\theta _2}}} = \left( {\frac{{\partial J\left( \theta \right)}}{{\partial {{\rm{a}}^{(3)}}}}} \right)\left( {\frac{{\partial {{\rm{a}}^{(3)}}}}{{\partial z}}} \right)\left( {\frac{{\partial z}}{{\partial {\theta _2}}}} \right)] By similar logic, we can find $\frac{{\partial J\left( \theta \right)}}{{\partial {\theta _1}}}$.

[\frac{{\partial J\left( \theta \right)}}{{\partial {\theta _1}}} = \left( {\frac{{\partial J\left( \theta \right)}}{{\partial {{\rm{a}}^{(3)}}}}} \right)\left( {\frac{{\partial {{\rm{a}}^{(3)}}}}{{\partial {z^{(3)}}}}} \right)\left( {\frac{{\partial {z^{(3)}}}}{{\partial {{\rm{a}}^{(2)}}}}} \right)\left( {\frac{{\partial {{\rm{a}}^{(2)}}}}{{\partial {z^{(2)}}}}} \right)\left( {\frac{{\partial {z^{(2)}}}}{{\partial {\theta _1}}}} \right)] For the sake of clarity, I've updated our neural network diagram to visualize these chains.

For the parameter values $\theta$, I like to read them like a mailing label - the first value denotes what neuron the input is being sent to in the next layer, and the second value denotes from what neuron the information is being sent.

[J\left( \theta \right) = \frac{1}{{2m}}\sum{{{\left( {{y _i} - {\rm{a}} _i^{(2)}} \right)}^2}} ] Note: If you're training on multiple observations (which you always will in practice), we'll also need to perform a summation of the cost function over all training examples.

\left( {\frac{{\partial J\left( \theta \right)}}{{\partial {\rm{a}} _1^{(3)}}}} \right)\left( {\frac{{\partial {\rm{a}} _1^{(3)}}}{{\partial z _1^{(3)}}}} \right)\left( {\frac{{\partial z _1^{(3)}}}{{\partial {\rm{a}} _1^{(2)}}}} \right)\left( {\frac{{\partial {\rm{a}} _1^{(2)}}}{{\partial z _1^{(2)}}}} \right)\left( {\frac{{\partial z _1^{(2)}}}{{\partial \theta _{11}^{(1)}}}} \right) $$ The derivative chain for the orange path is: $$

\left( {\frac{{\partial J\left( \theta \right)}}{{\partial {\rm{a}} _2^{(3)}}}} \right)\left( {\frac{{\partial {\rm{a}} _2^{(3)}}}{{\partial z _2^{(3)}}}} \right)\left( {\frac{{\partial z _2^{(3)}}}{{\partial {\rm{a}} _1^{(2)}}}} \right)\left( {\frac{{\partial {\rm{a}} _1^{(2)}}}{{\partial z _1^{(2)}}}} \right)\left( {\frac{{\partial z _1^{(2)}}}{{\partial \theta _{11}^{(1)}}}} \right)$$ Combining these, we get the total expression for $\frac{{\partial J\left( \theta \right)}}{{\partial \theta _{11}^{(1)}}}$.

We just went from a neural network with 2 parameters that needed 8 partial derivative terms in the previous example to a neural network with 8 parameters that needed 52 partial derivative terms.

A homogenous activation function within a layer is necessary in order to be able to leverage matrix operations in calculating the neural network output.

[\delta _i^{(3)} = \frac{1}{m}\left( {{y _i} - a _i^{(3)}} \right)f'\left( {{a^{(3)}}} \right)] The third column represents how the parameter of interest changes with respect to the weighted inputs for the current layer;

I'd also like to note that the first two partial derivative terms seem concerned with the first output neuron (neuron 1 in layer 3) while the last two partial derivative terms seem concerned with the second output neuron (neuron 2 in layer 3).

[\frac{{\partial J\left( \theta \right)}}{{\partial \theta _{jk}^{(2)}}} = \delta _j^{(3)}{\rm{a}} _k^{(2)}] Let's see if we can figure out a matrix operation to compute all of the partial derivatives in one expression.

The first two columns (of each summed term) corresponds with a $\delta _j^{(3)}$ calculated in the next layer forward (remember, we started at the end of the network and are working our way back).

$$ \delta _j^{(l)} = f'\left( {{a^{(l)}}} \right)\sum\limits _{i = 1}^n {\delta _i^{(l + 1)}\theta _{ij}^{(l)}} $$ We could write this more succinctly using matrix expressions.

$$ {\delta ^{(l)}} = {\delta ^{(l + 1)}}{\Theta ^{(l)}}f'\left( {{a^{(l)}}} \right) $$ Note: It's standard notation to denote vectors with lowercase letters and matrices as uppercase letters.

$$ \frac{{\partial J\left( \theta \right)}}{{\partial \theta _{11}^{(1)}}} = \delta _1^{(3)}\left( {\frac{{\partial z _1^{(3)}}}{{\partial {\rm{a}} _1^{(2)}}}} \right)\left( {\frac{{\partial {\rm{a}} _1^{(2)}}}{{\partial z _1^{(2)}}}} \right)\left( {\frac{{\partial z _1^{(2)}}}{{\partial \theta _{11}^{(1)}}}} \right) + \delta _2^{(3)}\left( {\frac{{\partial z _2^{(3)}}}{{\partial {\rm{a}} _1^{(2)}}}} \right)\left( {\frac{{\partial {\rm{a}} _1^{(2)}}}{{\partial z _1^{(2)}}}} \right)\left( {\frac{{\partial z _1^{(2)}}}{{\partial \theta _{11}^{(1)}}}} \right) $$ Further, we established the the third columns of each derivative chain was a parameter that acted to weight each of the respective $\delta$ terms.

$$ \frac{{\partial J\left( \theta \right)}}{{\partial \theta _{11}^{(1)}}} = \delta _1^{(3)}\theta _{11}^{(2)}f'\left( {{a^{(2)}}} \right)\left( {\frac{{\partial z _1^{(2)}}}{{\partial \theta _{11}^{(1)}}}} \right) + \delta _2^{(3)}\theta _{21}^{(2)}f'\left( {{a^{(2)}}} \right)\left( {\frac{{\partial z _1^{(2)}}}{{\partial \theta _{11}^{(1)}}}} \right) $$ Factoring out $\left( {\frac{{\partial z_1^{(2)}}}{{\partial \theta _{11}^{(1)}}}} \right)$, $$ \frac{{\partial J\left( \theta \right)}}{{\partial \theta _{11}^{(1)}}} = \left( {\frac{{\partial z _1^{(2)}}}{{\partial \theta _{11}^{(1)}}}} \right)\left( {\delta _1^{(3)}\theta _{11}^{(2)}f'\left( {{a^{(2)}}} \right) + \delta _2^{(3)}\theta _{21}^{(2)}f'\left( {{a^{(2)}}} \right)} \right) $$ we're left with our new definition of $\delta_j^{(l)}$.

$$ \frac{{\partial J\left( \theta \right)}}{{\partial \theta _{11}^{(1)}}} = \left( {\frac{{\partial z _1^{(2)}}}{{\partial \theta _{11}^{(1)}}}} \right)\left( {\delta _1^{(2)}} \right) $$ Lastly, we established the fifth column (${\frac{{\partial z_1^{(2)}}}{{\partial \theta_{11}^{(1)}}}}$) corresponded with an input from the previous layer.

$$ \frac{{\partial J\left( \theta \right)}}{{\partial \theta _{11}^{(1)}}} = \delta _1^{(2)}{x _1} $$ Again let's figure out a matrix operation to compute all of the partial derivatives in one expression.

We can calculate $\delta ^{(2)}$ as the weighted combination of errors from layer 3, multiplied by the derivative of the activation function used in layer 2.

In the last section, we developed a way to calculate all of the partial derivatives necessary for gradient descent (partial derivative of the cost function with respect to all model parameters) using matrix expressions.

We also developed a new term, $\delta$, which essentially serves to represent all of the partial derivative terms that we would need to reuse later and we progress, layer by layer, backwards through the network.

Generally, we established that you can calculate the partial derivatives for layer $l$ by combining $\delta$ terms of the next layer forward with the activations of the current layer.

$$ \frac{{\partial J\left( \theta \right)}}{{\partial \theta _{ij}^{(l)}}} = {\left( {{\delta ^{(l + 1)}}} \right)^T}{a^{(l)}} $$ Before defining the formal method for backpropagation, I'd like to provide a visualization of the process.

Typically we'll assign each parameter to a random value in $\left[ { - \varepsilon ,\varepsilon } \right]$ where $\varepsilon$ is some value close to zero.

[\Delta {\theta _i} = - \eta \frac{{\partial J\left( \theta \right)}}{{\partial {\theta _i}}}] We'll use this formula to update each of the weights, recompute forward propagation with the new weights, backpropagate the error, and calculate the next weight update.

Blog Archives

When constructing Artificial Neural Network (ANN) models, one of the primary considerations is choosing activation functions for hidden and output layers that are differentiable.

The simplest activation function, one that is commonly used for the output layer activation function in regression problems,  is the identity/linear activation function:

For example, a multi-layer network that has nonlinear activation functions amongst the hidden units and an output layer that uses the identity activation function implements a powerful form of nonlinear regression.

Specifically, the network can predict continuous target values using a linear combination of signals that arise from one or more layers of nonlinear transformations of the input.

(Figure 1, blue curves) and outputs values that range . The logistic sigmoid is motivated somewhat by biological neurons and can be interpreted as the probability of an artificial neuron “firing”

(It turns out that the logistic sigmoid can also be derived as the maximum likelihood solution to for logistic regression in statistics). Calculating the derivative of the logistic sigmoid function makes use of the quotient rule and a clever trick that both adds and subtracts a one from the numerator:

This turns out to be a convenient form for efficiently calculating gradients used in neural networks: if one keeps in memory the feed-forward activations of the logistic function for a given layer, the gradients for that layer can be evaluated using simple multiplication and subtraction rather than performing any re-evaluating the sigmoid function, which requires extra exponentiation.

These activation functions are motivated by biology and/or provide some handy implementation tricks like calculating derivatives using cached feed-forward activation values.

Neural Network Calculation (Part 1): Feedforward Structure

From In this series we will see how a neural network actually calculates its values. This first video takes a look at the structure of ..

Lecture 3 | Loss Functions and Optimization

Lecture 3 continues our discussion of linear classifiers. We introduce the idea of a loss function to quantify our unhappiness with a model's predictions, and ...

Mod-01 Lec-28 RBF Neural Network (Contd.)

Pattern Recognition and Application by Prof. P.K. Biswas,Department of Electronics & Communication Engineering,IIT Kharagpur.For more details on NPTEL ...

Lecture 4: Word Window Classification and Neural Networks

Lecture 4 introduces single and multilayer neural networks, and how they can be used for classification purposes. Key phrases: Neural networks. Forward ...

Mod-08 Lec-26 Multilayer Feedforward Neural networks with Sigmoidal activation functions;

Pattern Recognition by Prof. P.S. Sastry, Department of Electronics & Communication Engineering, IISc Bangalore. For more details on NPTEL visit ...

Lecture 11 | Detection and Segmentation

In Lecture 11 we move beyond image classification, and show how convolutional networks can be applied to other core computer vision tasks. We show how ...

Lecture 5: Backpropagation and Project Advice

Lecture 5 discusses how neural networks can be trained using a distributed gradient descent technique known as back propagation. Key phrases: Neural ...

Neural Networks Demystified [Part 2: Forward Propagation]

Neural Networks Demystified @stephencwelch Supporting Code: In this short series, we will ..

Lec-23 Multi-Class Classification Using Multi-layered Perceptrons

Lecture Series on Neural Networks and Applications by Prof.S. Sengupta, Department of Electronics and Electrical Communication Engineering, IIT Kharagpur.

But what *is* a Neural Network? | Deep learning, chapter 1

Subscribe to stay notified about new videos: Support more videos like this on Patreon: Or don'