AI News, How to build your own Neural Network from scratch in Python

How to build your own Neural Network from scratch in Python

Motivation: As part of my personal journey to gain a better understanding of Deep Learning, I’ve decided to build a Neural Network from scratch without a deep learning library like TensorFlow.

Without delving into brain analogies, I find it easier to simply describe Neural Networks as a mathematical function that maps a given input to a desired output.

Neural Networks consist of the following components The diagram below shows the architecture of a 2-layer Neural Network (note that the input layer is typically excluded when counting the number of layers in a Neural Network) Creating a Neural Network class in Python is easy.

Training the Neural Network The output ŷ of a simple 2-layer Neural Network is: You might notice that in the equation above, the weights W and the biases b are the only variables that affects the output ŷ.

As we’ve seen in the sequential graph above, feedforward is just simple calculus and for a basic 2-layer neural network, the output of the Neural Network is: Let’s add a feedforward function in our python code to do exactly that.

In order to know the appropriate amount to adjust the weights and biases by, we need to know the derivative of the loss function with respect to the weights and biases.

However, we can’t directly calculate the derivative of the loss function with respect to the weights and biases because the equation of the loss function does not contain the weights and biases.

Although Deep Learning libraries such as TensorFlow and Keras makes it easy to build deep nets without fully understanding the inner workings of a Neural Network, I find that it’s beneficial for aspiring data scientist to gain a deeper understanding of Neural Networks.

It involves subtracting the mean across every individual feature in the data, and has the geometric interpretation of centering the cloud of data around the origin along every dimension.

It only makes sense to apply this preprocessing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In

case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional preprocessing step.

Then, we can compute the covariance matrix that tells us about the correlation structure in the data: The (i,j) element of the data covariance matrix contains the covariance between i-th and j-th dimension of the data.

To decorrelate the data, we project the original (but zero-centered) data into the eigenbasis: Notice that the columns of U are a set of orthonormal vectors (norm of 1, and orthogonal to each other), so they can be regarded as basis vectors.

This is also sometimes refereed to as Principal Component Analysis (PCA) dimensionality reduction: After this operation, we would have reduced the original dataset of size [N x D] to one of size [N x 100], keeping the 100 dimensions of the data that contain the most variance.

The geometric interpretation of this transformation is that if the input data is a multivariable gaussian, then the whitened data will be a gaussian with zero mean and identity covariance matrix.

One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input.

Note that we do not know what the final value of every weight should be in the trained network, but with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative.

The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.

The implementation for one weight matrix might look like W = 0.01* np.random.randn(D,H), where randn samples from a zero mean, unit standard deviation gaussian.

With this formulation, every neuron’s weight vector is initialized as a random vector sampled from a multi-dimensional gaussian, so the neurons point in random direction in the input space.

That is, the recommended heuristic is to initialize each neuron’s weight vector as: w = np.random.randn(n) / sqrt(n), where n is the number of its inputs.

The sketch of the derivation is as follows: Consider the inner product \(s = \sum_i^n w_i x_i\) between the weights \(w\) and input \(x\), which gives the raw activation of a neuron before the non-linearity.

And since \(\text{Var}(aX) = a^2\text{Var}(X)\) for a random variable \(X\) and a scalar \(a\), this implies that we should draw from unit gaussian and then scale it by \(a = \sqrt{1/n}\), to make its variance \(1/n\).

In this paper, the authors end up recommending an initialization of the form \( \text{Var}(w) = 2/(n_{in} + n_{out}) \) where \(n_{in}, n_{out}\) are the number of units in the previous layer and the next layer.

A more recent paper on this topic, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification by He et al., derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be \(2.0/n\).

This gives the initialization w = np.random.randn(n) * sqrt(2.0/n), and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons.

Another way to address the uncalibrated variances problem is to set all weight matrices to zero, but to break symmetry every neuron is randomly connected (with weights sampled from a small gaussian as above) to a fixed number of neurons below it.

For ReLU non-linearities, some people like to use small constant value such as 0.01 for all biases because this ensures that all ReLU units fire in the beginning and therefore obtain and propagate some gradient.

However, it is not clear if this provides a consistent improvement (in fact some results seem to indicate that this performs worse) and it is more common to simply use 0 bias initialization.

A recently developed technique by Ioffe and Szegedy called Batch Normalization alleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training.

In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we’ll soon see), and before non-linearities.

It is common to see the factor of \(\frac{1}{2}\) in front because then the gradient of this term with respect to the parameter \(w\) is simply \(\lambda w\) instead of \(2 \lambda w\).

Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: W += -lambda * W towards zero.

L1 regularization is another relatively common form of regularization, where for each weight \(w\) we add the term \(\lambda \mid w \mid\) to the objective.

Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.

In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector \(\vec{w}\) of every neuron to satisfy \(\Vert \vec{w} \Vert_2 <

Vanilla dropout in an example 3-layer Neural Network would be implemented as follows: In the code above, inside the train_step function we have performed dropout twice: on the first hidden layer and on the second hidden layer.

It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction.

Since test-time performance is so critical, it is always preferable to use inverted dropout, which performs the scaling at train time, leaving the forward pass at test time untouched.

Inverted dropout looks as follows: There has a been a large amount of research after the first introduction of dropout that tries to understand the source of its power in practice, and its relation to the other regularization techniques.

As we already mentioned in the Linear Classification section, it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objective.

For example, a binary classifier for each category independently would take the form: where the sum is over all categories \(j\), and \(y_{ij}\) is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector \(f_j\) will be positive when the class is predicted to be present and negative otherwise.

A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as: Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is \(P(y = 0 \mid x;

The expression above can look scary but the gradient on \(f\) is in fact extremely simple and intuitive: \(\partial{L_i} / \partial{f_j} = y_{ij} - \sigma(f_j)\) (as you can double check yourself by taking the derivatives).

The L2 norm squared would compute the loss for a single example of the form: The reason the L2 norm is squared in the objective is that the gradient becomes much simpler, without changing the optimal parameters since squaring is a monotonic operation.

For example, if you are predicting star rating for a product, it might work much better to use 5 independent classifiers for ratings of 1-5 stars instead of a regression loss.

If you’re certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea.

Deep Learning Best Practices (1) — Weight Initialization

One of the starting points to take care of while building your network is to initialize your weight matrix correctly.

Let us consider 2 scenarios that can cause issues while training the model: Let’s just put it out there — this makes your model equivalent to a linear model.

It is important to note that setting biases to 0 will not create any troubles as non zero weights take care of breaking the symmetry and even if bias is 0, the values in every neuron are still different.

Initializing weights randomly, following standard normal distribution (np.random.randn(size_l, size_l-1) in Python) while working with a (deep) network can potentially lead to 2 issues — vanishing gradients or exploding gradients.

More specifically, in case of sigmoid(z) and tanh(z), if your weights are large, then the gradient will be vanishingly small, effectively preventing the weights from changing their value.

With RELU(z) vanishing gradients are generally not a problem as the gradient is 0 for negative (and zero) inputs and 1 for positive inputs.

Another impact of exploding gradients is that huge values of the gradients may cause number overflow resulting in incorrect computations or introductions of NaN’s.

Here, instead of drawing from standard normal distribution, we are drawing W from normal distribution with variance k/n, where k depends on the activation function.

In TensorFlow W = tf.get_variable('W', [dims], initializer) where initializer = tf.contrib.layers.xavier_initializer() c) Another commonly used heuristic is: These serve as good starting points for initialization and mitigate the chances of exploding or vanishing gradients.

About Me: Graduated with MS Data Science at USF and undergrad in Computer Science, I have 2 years of experience in building predictive and recommendation algorithms, and deriving business insights for finance and retail clients.

The approach will have two major components: a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels.

For example, in CIFAR-10 we have a training set of N = 50,000 images, each with D = 32 x 32 x 3 = 3072 pixels, and K = 10, since there are 10 distinct classes (dog, cat, car, etc).

In this module we will start out with arguably the simplest possible function, a linear mapping: In the above equation, we are assuming that the image \(x_i\) has all of its pixels flattened out to a single column vector of shape [D x 1].

In CIFAR-10, \(x_i\) contains all pixels in the i-th image flattened into a single [3072 x 1] column, W is [10 x 3072] and b is [10 x 1], so 3072 numbers come into the function (the raw pixel values) and 10 numbers come out (the class scores).

You might expect that the “ship” classifier would then have a lot of positive weights across its blue channel weights (presence of blue increases score of ship), and negative weights in the red/green channels (presence of red/green decreases the score of ship).

We cannot visualize 3072-dimensional spaces, but if we imagine squashing all those dimensions into only two dimensions, then we can try to visualize what the classifier might be doing: As we saw above, every row of \(W\) is a classifier for one of the classes.

In particular, note that without the bias terms, plugging in \( x_i = 0 \) would always give score of zero regardless of the weights, so all lines would be forced to cross the origin.

Another way to think of it is that we are still effectively doing Nearest Neighbor, but instead of having thousands of training images we are only using a single image per class (although we will learn it, and it does not necessarily have to be one of the images in the training set), and we use the (negative) inner product as the distance instead of the L1 or L2 distance.

green car facing left, blue car facing front, etc.), and neurons on the next layer could combine these into a more accurate car score through a weighted sum of the individual car detectors.

A commonly used trick is to combine the two sets of parameters into a single matrix that holds both of them by extending the vector \(x_i\) with one additional dimension that always holds the constant \(1\) - a default bias dimension.

With the extra dimension, the new score function will simplify to a single matrix multiply: With our CIFAR-10 example, \(x_i\) is now [3073 x 1] instead of [3072 x 1] - (with the extra dimension holding the constant 1), and \(W\) is now [10 x 3073] instead of [10 x 3072].

In the case of images, this corresponds to computing a mean image across the training images and subtracting it from every image to get images where the pixels range from approximately [-127 … 127].

Moreover, we saw that we don’t have control over the data \( (x_i,y_i) \) (it is fixed and given), but we do have control over these weights and we want to set them so that the predicted class scores are consistent with the ground truth labels in the training data.

For example, going back to the example image of a cat and its scores for the classes “cat”, “dog” and “ship”, we saw that the particular set of weights in that example was not very good at all: We fed in the pixels that depict a cat but the cat score came out very low (-96.8) compared to the other classes (dog score 437.9 and ship score 61.95).

Notice that it’s sometimes helpful to anthropomorphise the loss functions as we did above: The SVM “wants” a certain outcome in the sense that the outcome would yield a lower loss (which is good).

Recall that for the i-th example we are given the pixels of image \( x_i \) and the label \( y_i \) that specifies the index of the correct class.

The expression above sums over all incorrect classes (\(j \neq y_i\)), so we get two terms: You can see that the first term gives zero since [-7 - 13 + 10] gives a negative number, which is then thresholded to zero with the \(max(0,-)\) function.

You’ll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM), which uses the form \(max(0,-)^2\) that penalizes violated margins more strongly (quadratically instead of linearly).

One easy way to see this is that if some parameters W correctly classify all examples (so loss is zero for each example), then any multiple of these parameters \( \lambda W \) where \( \lambda >

The most common regularization penalty is the L2 norm that discourages large weights through an elementwise quadratic penalty over all parameters: In the expression above, we are summing up all the squared elements of \(W\).

For example, suppose that we have some input vector \(x = [1,1,1,1] \) and two weight vectors \(w_1 = [1,0,0,0]\), \(w_2 = [0.25,0.25,0.25,0.25] \).

Since the L2 penalty prefers smaller and more diffuse weight vectors, the final classifier is encouraged to take into account all input dimensions to small amounts rather than a few input dimensions and very strongly.

Here is the loss function (without regularization) implemented in Python, in both unvectorized and half-vectorized form: The takeaway from this section is that the SVM loss takes one particular approach to measuring how consistent the predictions on training data are with the ground truth labels.

The key to understanding this is that the magnitude of the weights \(W\) has direct effect on the scores (and hence also their differences): As we shrink all values inside \(W\) the score differences will become lower, and as we scale up the weights the score differences will all become higher.

Unlike the SVM which treats the outputs \(f(x_i,W)\) as (uncalibrated and possibly difficult to interpret) scores for each class, the Softmax classifier gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation that we will describe shortly.

W) = W x_i\) stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class and replace the hinge loss with a cross-entropy loss that has the form: where we are using the notation \(f_j\) to mean the j-th element of the vector of class scores \(f\).

The function \(f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}} \) is called the softmax function: It takes a vector of arbitrary real-valued scores (in \(z\)) and squashes it to a vector of values between zero and one that sum to one.

The cross-entropy between a “true” distribution \(p\) and an estimated distribution \(q\) is defined as: The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities ( \(q = e^{f_{y_i}} / \sum_j e^{f_j} \) as seen above) and the “true” distribution, which in this interpretation is the distribution where all probability mass is on the correct class (i.e.

Moreover, since the cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as \(H(p,q) = H(p) + D_{KL}(p||q)\), and the entropy of the delta function \(p\) is zero, this is also equivalent to minimizing the KL divergence between the two distributions (a measure of distance).

A nice feature of this view is that we can now also interpret the regularization term \(R(W)\) in the full loss function as coming from a Gaussian prior over the weight matrix \(W\), where instead of MLE we are performing the Maximum a posteriori (MAP) estimation.

The Softmax classifier gets its name from the softmax function, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied.

The reason we put the word “probabilities” in quotes, however, is that how peaky or diffuse these probabilities are depends directly on the regularization strength \(\lambda\) - which you are in charge of as input to the system.

The SVM does not care about the details of the individual scores: if they were instead [10, -100, -100] or [10, 9, 9] the SVM would be indifferent since the margin of 1 is satisfied and hence the loss is zero.

In other words, the Softmax classifier is never fully happy with the scores it produces: the correct class could always have a higher probability and the incorrect classes always a lower probability and the loss would always get better.

This can intuitively be thought of as a feature: For example, a car classifier which is likely spending most of its “effort” on the difficult problem of separating cars from trucks should not be influenced by the frog examples, which it already assigns very low scores to, and which likely cluster around a completely different side of the data cloud.

In summary, We now saw one way to take a dataset of images and map each one to class scores based on a set of parameters, and we saw two examples of loss functions that we can use to measure the quality of the predictions.

Everything you need to know about Neural Networks

Forward Propagation — Forward propagation is a process of feeding input values to the neural network and getting an output which we call predicted value.

Second layer takes values from first layer and applies multiplication, addition and activation operations and passes this value to the next layer.

In chain rule first we calculate the derivatives of error value with respect to the weight values of the last layer.

At each iteration we use back-propagation to calculate the derivative of the loss function with respect to each weight and subtract it from that weight.

Learning rate should be high enough so that it won’t take ages to converge, and it should be low enough so that it finds the local minima.

Normalized Inputs and Initial Weights

This video is part of the Udacity course "Deep Learning". Watch the full course at

But what *is* a Neural Network? | Chapter 1, deep learning

Subscribe to stay notified about new videos: Support more videos like this on Patreon: Special .

Backpropagation in 5 Minutes (tutorial)

Let's discuss the math behind back-propagation. We'll go over the 3 terms from Calculus you need to understand it (derivatives, partial derivatives, and the chain ...

Adam Optimization Algorithm (C2W2L08)

Lecture 3 | Loss Functions and Optimization

Lecture 3 continues our discussion of linear classifiers. We introduce the idea of a loss function to quantify our unhappiness with a model's predictions, and ...

Gradient descent, how neural networks learn | Chapter 2, deep learning

Subscribe for more (part 3 will be on backpropagation): Thanks to everybody supporting on Patreon

Linear regression (2): Gradient descent

Gradient and stochastic gradient descent; gradient computation for MSE.

Optimizing Weights

Watch on Udacity: Check out the full Advanced Operating Systems course for free ..

Cross Entropy

This video is part of the Udacity course "Deep Learning". Watch the full course at

Lecture 6 | Training Neural Networks I

In Lecture 6 we discuss many practical issues for training modern neural networks. We discuss different activation functions, the importance of data ...