AI News, BOOK REVIEW: Single-Layer Neural Networks and Gradient Descent

Single-Layer Neural Networks and Gradient Descent

This article offers a brief glimpse of the history and basic concepts of machine learning.

We will take a look at the first algorithmically described neural network and the gradient descent algorithm in context of adaptive linear neurons, which will not only introduce the principles of machine learning but also serve as the basis for modern multilayer neural networks in future articles.

Thanks to machine learning, we enjoy robust email spam filters, convenient text and voice recognition, reliable web search engines, challenging chess players, and, hopefully soon, safe and efficient self-driving cars.

The perceptron is not only the first algorithmically described learning algorithm [1], but it is also very intuitive, easy to implement, and a good entry point to the (re-discovered) modern state-of-the-art machine learning algorithms: Artificial neural networks (or “deep learning” if you like).

To put the perceptron algorithm into the broader context of machine learning: The perceptron belongs to the category of supervised learning algorithms, single-layer binary linear classifiers to be more specific.

Next, we define an activation function g(\mathbf{z}) that takes a linear combination of the input values \mathbf{x} and weights \mathbf{w} as input (\mathbf{z} = w_1x_{1} + \dots + w_mx_{m}), and if g(\mathbf{z}) is greater than a defined threshold \theta we predict 1 and -1 otherwise;

in this case, this activation function g is an alternative form of a simple “unit step function,” which is sometimes also called “Heaviside step function.” (Please note that the unit step is classically defined as being equal to 0 if % <![CDATA[ z

To summarize the main points from the previous section: A perceptron receives multiple input signals, and if the sum of the input signals exceed a certain threshold it either returns a signal or remains “silent” otherwise.

What made this a “machine learning” algorithm was Frank Rosenblatt’s idea of the perceptron learning rule: The perceptron algorithm is about learning the weights for the input signals in order to draw linear decision boundary that allows us to discriminate between the two linearly separable classes +1 and -1.

Rosenblatt’s initial perceptron rule is fairly simple and can be summarized by the following steps: The output value is the class label predicted by the unit step function that we defined earlier (output =g(\mathbf{z})) and the weight update can be written more formally as w_j := w_j + \Delta w_j.

The value for updating the weights at each increment is calculated by the learning rule where \eta is the learning rate (a constant between 0.0 and 1.0), “target” is the true class label, and the “output” is the predicted class label.

w_j = \eta(1^{(i)} - 1^{(i)})\;x^{(i)}_{j} = 0 However, in case of a wrong prediction, the weights are being “pushed” towards the direction of the positive or negative target class, respectively: \Delta w_j = \eta(1^{(i)} - -1^{(i)})\;x^{(i)}_{j} = \eta(2)\;x^{(i)}_{j} \Delta

If the two classes can’t be separated by a linear decision boundary, we can set a maximum number of passes over the training dataset (“epochs”) and/or a threshold for the number of tolerated misclassifications.

Our intuition tells us that a decision boundary with a large margin between the classes (as indicated by the dashed line in the figure below) likely has a better generalization error than the decision boundary of the perceptron.

In contrast to the perceptron rule, the delta rule of the adaline (also known as Widrow-Hoff” rule or Adaline rule) updates the weights based on a linear activation function rather than a unit step function;

(The fraction \frac{1}{2} is just used for convenience to derive the gradient as we will see in the next paragraphs.) In order to minimize the SSE cost function, we will use gradient descent, a simple yet useful optimization algorithm that is often used in machine learning to find the local minimum of linear systems.

mentioned above, each update is updated by taking a step into the opposite direction of the gradient \Delta \mathbf{w} = - \eta \nabla J(\mathbf{w}), thus, we have to compute the partial derivative of the cost function for each weight in the weight vector: \Delta w_j = - \eta \frac{\partial J}{\partial w_j}.

The partial derivative of the SSE cost function for a particular weight can be calculated as follows: (t = target, o = output) And if we plug the results back into the learning rule, we get \Delta w_j = - \eta \frac{\partial J}{\partial w_j} = - \eta \sum_i (t^{(i)} - o^{(i)})(- x^{(i)}_{j}) = \eta \sum_i (t^{(i)} - o^{(i)})x^{(i)}_{j}, Eventually, we can apply a simultaneous weight update similar to the perceptron rule: \mathbf{w} := \mathbf{w} + \Delta \mathbf{w}.

Another advantage of online learning is that the classifier can be immediately updated as new training data arrives, e.g., in web applications, and old training data can be discarded if storage is an issue.

In later articles, we will take a look at different approaches to dynamically adjust the learning rate, the concepts of “One-vs-All” and “One-vs-One” for multi-class classification, regularization to overcome overfitting by introducing additional information, dealing with nonlinear problems and multilayer neural networks, different activation functions for artificial neurons, and related concepts such as logistic regression and support vector machines.

Perceptron Training

Watch on Udacity: Check out the full Advanced Operating Systems course for free ..

Neural Network Training (Part 3): Gradient Calculation

In this video we will see how to calculate the gradients of a neural network. The gradients are the individual error for each of the weights in the neural network.

Perceptron Update

Professor Abbeel steps through a multi-class perceptron looking at one training data item, and updating the perceptron weight vectors.

XOR as Perceptron Network Quiz Solution - Georgia Tech - Machine Learning

Watch on Udacity: Check out the full Advanced Operating Systems ..

But what *is* a Neural Network? | Chapter 1, deep learning

Subscribe to stay notified about new videos: Support more videos like this on Patreon: Special .

Lecture-11: Machine Learning: Perceptrons- Kernel Perceptron Learning Part-3/4

SANJEEV SHARMA, 12th Nov 2010: Machine Learning: Lecture-11: Kernel Perceptron Learning. CONTENTS: Simple Perceptron Algorithm, Voted Perceptron ...

10.2: Neural Networks: Perceptron Part 1 - The Nature of Code

In this video, I continue my machine learning series and build a simple Perceptron in Processing (Java). Perceptron Part 2: This ..

Backpropagation in 5 Minutes (tutorial)

Let's discuss the math behind back-propagation. We'll go over the 3 terms from Calculus you need to understand it (derivatives, partial derivatives, and the chain ...

Weight Initialization explained | A way to reduce the vanishing gradient problem

In this video, we'll talk about how the weights in an artificial neural network are initialized, how this initialization affects the training process, and what YOU can ...

Neural Networks Demystified [Part 3: Gradient Descent]

Neural Networks Demystified @stephencwelch Supporting Code: Link to Yann's Talk: ..