# AI News, Implementing a Neural Network from Scratch in Python &#8211; An Introduction

## Implementing a Neural Network from Scratch in Python &#8211; An Introduction

Get the code: To follow along, all the code is also available as an iPython notebook on Github.

You can think of the blue dots as male patients and the red dots as female patients, with the x- and y- axis being medical measurements.

Our goal is to train a Machine Learning classifier that predicts the correct class (male of female) given the x- and y- coordinates.

This means that linear classifiers, such as Logistic Regression, won&#8217;t be able to fit the data unless you hand-engineer non-linear features (such as polynomials) that work well for the given dataset.

(Because we only have 2 classes we could actually get away with only one output node predicting 0 or 1, but having 2 makes it easier to extend the network to more classes later on).

The input to the network will be x- and y- coordinates and its output will be two probabilities, one for class 0 (&#8220;female&#8221;) and one for class 1 (&#8220;male&#8221;).

Because we want our network to output probabilities the activation function for the output layer will be the softmax, which is simply a way to convert raw scores to probabilities.

Our network makes predictions using forward propagation, which is just a bunch of matrix multiplications and the application of the activation function(s) we defined above.

is the input of layer and is the output of layer after applying the activation function.

If we have training examples and classes then the loss for our prediction with respect to the true labels is given by:

The formula looks complicated, but all it really does is sum over our training examples and add to the loss if we predicted the incorrect class.

The further away the two probability distributions (the correct labels) and (our predictions) are, the greater our loss will be.

We can use gradient descent to find the minimum and I will implement the most vanilla version of gradient descent, also called batch gradient descent with a fixed learning rate.

As an input, gradient descent needs the gradients (vector of derivatives) of the loss function with respect to our parameters: , , , .

To calculate these gradients we use the famous backpropagation algorithm, which is a way to efficiently calculate the gradients starting from the output.

We start by defining some useful variables and parameters for gradient descent: First let&#8217;s implement the loss function we defined above.

If we were to evaluate our model on a separate test set (and you should!) the model with a smaller hidden layer size would likely perform better due to better generalization.

Here are some things you can try to become more familiar with the code: All of the code is available as an iPython notebook on Github. Please leave questions or feedback in the comments!

In the section on linear classification we computed scores for different visual categories given the image using the formula $$s = W x$$, where $$W$$ was a matrix and $$x$$ was an input column vector containing all pixel data of the image.

In the case of CIFAR-10, $$x$$ is a [3072x1] column vector, and $$W$$ is a [10x3072] matrix, so that the output scores is a vector of 10 class scores.

There are several choices we could make for the non-linearity (which we’ll study below), but this one is a common choice and simply thresholds all activations that are below zero to zero.

Notice that the non-linearity is critical computationally - if we left it out, the two matrices could be collapsed to a single matrix, and therefore the predicted class scores would again be a linear function of the input.

three-layer neural network could analogously look like $$s = W_3 \max(0, W_2 \max(0, W_1 x))$$, where all of $$W_3, W_2, W_1$$ are parameters to be learned.

The area of Neural Networks has originally been primarily inspired by the goal of modeling biological neural systems, but has since diverged and become a matter of engineering and achieving good results in Machine Learning tasks.

Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10^14 - 10^15 synapses.

The idea is that the synaptic strengths (the weights $$w$$) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another.

Based on this rate code interpretation, we model the firing rate of the neuron with an activation function $$f$$, which represents the frequency of the spikes along the axon.

Historically, a common choice of activation function is the sigmoid function $$\sigma$$, since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1.

An example code for forward-propagating a single neuron might look as follows: In other words, each neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function), in this case the sigmoid $$\sigma(x) = 1/(1+e^{-x})$$.

As we saw with linear classifiers, a neuron has the capacity to “like” (activation near one) or “dislike” (activation near zero) certain linear regions of its input space.

With this interpretation, we can formulate the cross-entropy loss as we have seen in the Linear Classification section, and optimizing it would lead to a binary Softmax classifier (also known as logistic regression).

The regularization loss in both SVM/Softmax cases could in this biological view be interpreted as gradual forgetting, since it would have the effect of driving all synaptic weights $$w$$ towards zero after every parameter update.

The sigmoid non-linearity has the mathematical form $$\sigma(x) = 1 / (1 + e^{-x})$$ and is shown in the image above on the left.

The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).

Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: $$\tanh(x) = 2 \sigma(2x) -1$$.

Other types of units have been proposed that do not have the functional form $$f(w^Tx + b)$$ where a non-linearity is applied on the dot product between the weights and the data.

TLDR: “What neuron type should I use?” Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of “dead” units in a network.

For regular neural networks, the most common layer type is the fully-connected layer in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections.

Working with the two example networks in the above picture: To give you some context, modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10-20 layers (hence deep learning).

The full forward pass of this 3-layer neural network is then simply three matrix multiplications, interwoven with the application of the activation function: In the above code, W1,W2,W3,b1,b2,b3 are the learnable parameters of the network.

Notice also that instead of having a single input column vector, the variable x could hold an entire batch of training data (where each input example would be a column of x) and then all examples would be efficiently evaluated in parallel.

Neural Networks work well in practice because they compactly express nice, smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using our optimization algorithms (e.g.

Similarly, the fact that deeper networks (with multiple hidden layers) can work better than a single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal.

As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more.

We could train three separate neural networks, each with one hidden layer of some size and obtain the following classifiers: In the diagram above, we can see that Neural Networks with more neurons can express more complicated functions.

For example, the model with 20 hidden neurons fits all the training data but at the cost of segmenting the space into many disjoint red and green decision regions.

The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It’s clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e.

Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss.

In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima.

## Multilayer perceptron

A multilayer perceptron (MLP) is a class of feedforward artificial neural network.

MLP utilizes a supervised learning technique called backpropagation for training.[1][2] Its multiple layers and non-linear activation distinguish MLP from a linear perceptron.

It can distinguish data that is not linearly separable.[3] Multilayer perceptrons are sometimes colloquially referred to as 'vanilla' neural networks, especially when they have a single hidden layer.[4]

If a multilayer perceptron has a linear activation function in all neurons, that is, a linear function that maps the weighted inputs to the output of each neuron, then linear algebra shows that any number of layers can be reduced to a two-layer input-output model.

In MLPs some neurons use a nonlinear activation function that was developed to model the frequency of action potentials, or firing, of biological neurons.

The two common activation functions are both sigmoids, and are described by The first is a hyperbolic tangent that ranges from -1 to 1, while the other is the logistic function, which is similar in shape but ranges from 0 to 1.

More specialized activation functions include radial basis functions (used in radial basis networks, another class of supervised neural network models).

The MLP consists of three or more layers (an input and an output layer with one or more hidden layers) of nonlinearly-activating nodes making it a deep neural network.

Learning occurs in the perceptron by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the expected result.

This is an example of supervised learning, and is carried out through backpropagation, a generalization of the least mean squares algorithm in the linear perceptron.

The node weights are adjusted based on corrections that minimize the error in the entire output, given by Using gradient descent, the change in each weight is where

is the learning rate, which is selected to ensure that the weights quickly converge to a response, without oscillations.

The analysis is more difficult for the change in weights to a hidden node, but it can be shown that the relevant derivative is This depends on the change in weights of the

So to change the hidden layer weights, the output layer weights change according to the derivative of the activation function, and so this algorithm represents a backpropagation of the activation function.[5]

True perceptrons are formally a special case of artificial neurons that use a threshold activation function such as the Heaviside step function.

A true perceptron performs binary classification (either this or that), an MLP neuron is free to either perform classification or regression, depending upon its activation function.

The term 'multilayer perceptron' later was applied without respect to nature of the nodes/layers, which can be composed of arbitrarily defined artificial neurons, and not perceptrons specifically.

This interpretation avoids the loosening of the definition of 'perceptron' to mean an artificial neuron in general.

MLPs are useful in research for their ability to solve problems stochastically, which often allows approximate solutions for extremely complex problems like fitness approximation.

MLPs are universal function approximators as showed by Cybenko's theorem,[3] so they can be used to create mathematical models by regression analysis.

As classification is a particular case of regression when the response variable is categorical, MLPs make good classifier algorithms.

MLPs were a popular machine learning solution in the 1980s, finding applications in diverse fields such as speech recognition, image recognition, and machine translation software,[6] but thereafter faced strong competition from much simpler (and related[7]) support vector machines.

What is a Neural Network - Ep. 2 (Deep Learning SIMPLIFIED)

With plenty of machine learning tools currently available, why would you ever choose an artificial neural network over all the rest? This clip and the next could open your eyes to their awesome...

Neural Networks (1): Basics

The basic form of a feed-forward multi-layer perceptron / neural network; example activation functions.

Deep Learning with Tensorflow - Activation Functions

Enroll in the course for free at: Deep Learning with TensorFlow Introduction The majority of data in the world is unlabeled..

Activation Functions in Neural Networks (Sigmoid, ReLU, tanh, softmax)

Activation Functions in Neural Networks are used to contain the output between fixed values and also add a non linearity to the output. Activation Functions play an important role in Machine...

25. Computing a Neural Network's Output

Overview of a neural network with a hidden layer, 9/2/2015

Radial Basis Function Artificial Neural Networks

My web page:

Neural networks [2.4] : Training neural networks - hidden layer gradient

XOR as Perceptron Network Quiz Solution - Georgia Tech - Machine Learning