AI News, Deep Learning: Regularization Notes

Deep Learning: Regularization Notes

In previous article (long ago, now I am back!!) I talked about overfitting and the problems faced due to overfitting.

Many regularization approaches are based on limiting the capacity of models, such as neural networks, linear regression, or logistic regression, by adding a parameter norm penalty Ω(θ) to the objective function J.

X, y) + αΩ(θ) — {1} where α ∈[0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty term, Ω, relative to the standard objective function J.

We note that for neural networks, we typically choose to use a parameter norm penalty Ω that penalizes only the weights of the affine transformation at each layer and leaves the biases unregularized.

We therefore use the vector w to indicate all of the weights that should be affected by a norm penalty, while the vector θ denotes all of the parameters, including both w and the unregularized parameters.

Deep Learning: Regularization Notes

In previous article (long ago, now I am back!!) I talked about overfitting and the problems faced due to overfitting.

Many regularization approaches are based on limiting the capacity of models, such as neural networks, linear regression, or logistic regression, by adding a parameter norm penalty Ω(θ) to the objective function J.

X, y) + αΩ(θ) — {1} where α ∈[0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty term, Ω, relative to the standard objective function J.

We note that for neural networks, we typically choose to use a parameter norm penalty Ω that penalizes only the weights of the affine transformation at each layer and leaves the biases unregularized.

We therefore use the vector w to indicate all of the weights that should be affected by a norm penalty, while the vector θ denotes all of the parameters, including both w and the unregularized parameters.

Regularization (mathematics)

In mathematics, statistics, and computer science, particularly in the fields of machine learning and inverse problems, regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting.[1]

Empirical learning of classifiers (learning from a finite data set) is always an underdetermined problem, because in general we are trying to infer a function of any

x

1

x

2

x

n

Concrete notions of complexity used include restrictions for smoothness and bounds on the vector space norm.[2][page needed] A

From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.

Regularization can be used to learn simpler models, induce models to be sparse, introduce group structure into the learning problem, and more.

A simple form of regularization applied to integral equations, generally termed Tikhonov regularization after Andrey Nikolayevich Tikhonov, is essentially a trade-off between fitting the data and reducing a norm of the solution.

More recently, non-linear regularization methods, including total variation regularization, have become popular.

The goal of this learning problem is to find a function that fits or predicts the outcome (label) that minimizes the expected error over all possible inputs and labels.

f

n

available samples: Without bounds on the complexity of the function space (formally, the reproducing kernel Hilbert space) available, a model will be learned that incurs zero loss on the surrogate empirical error.

x

i

Regularization introduces a penalty for exploring certain regions of the function space used to build the model, which can improve generalization.

2

This is one of the most common forms of regularization, is also known as ridge regression, and is expressed as: In the case of a general function, we take the norm of the function in its reproducing kernel Hilbert space: As the

2

∇

w

w

d

3

d

2

T

Intuitively, a training procedure like gradient descent will tend to learn more and more complex functions as the number of iterations increases.

In practice, early stopping is implemented by training on a training set and measuring accuracy on a statistically independent validation set.

The exact solution to the unregularized least squares learning problem will minimize the empirical error, but may fail to generalize and minimize the expected error.

The algorithm above is equivalent to restricting the number of gradient descent iterations for the empirical risk with the gradient descent update: The base case is trivial.

j

An example is developing a simple predictive test for a disease in order to minimize the cost of performing medical tests while maximizing predictive power.

0

‖

0

0

1

0

1

1

A simple example is provided in the figure when the space of possible solutions lies on a 45 degree line.

1

2

regularization in elastic net regularization, which takes the following form: Elastic net regularization tends to have a grouping effect, where correlated input features are assigned equal weights.

1

1

1

min

w

∈

H

First define the proximal operator and then iterate The proximal method iteratively performs gradient descent and then projects the result back into the space permitted by

1

Groups of features can be regularized by a sparsity constraint, which can be useful for expressing certain prior knowledge into an optimization problem.

In the case of a linear model with non-overlapping known groups, a regularizer can be defined: This can be viewed as inducing a regularizer over the

2

1

This can be solved by the proximal method, where the proximal operator is a block-wise soft-thresholding function: The algorithm described for group sparsity without overlaps can be applied to the case where groups do overlap, in certain situations.

w

g

w

¯

g

w

¯

g

g

w

¯

g

The proximal operator cannot be computed in closed form, but can be effectively solved iteratively, inducing an inner iteration within the proximal method iteration.

Regularizers have been designed to guide learning algorithms to learn models that respect the structure of unsupervised training samples.

i

j

i

j

i

j

f

∈

R

m

i

i

u

l

u

u

This regularizer constrains the functions learned for each task to be similar to the overall average of the functions across all tasks.

An example is predicting blood iron levels measured at different times of the day, where each task represents a different person.

Well-known model selection techniques include the Akaike information criterion (AIC), minimum description length (MDL), and the Bayesian information criterion (BIC).

It involves subtracting the mean across every individual feature in the data, and has the geometric interpretation of centering the cloud of data around the origin along every dimension.

It only makes sense to apply this preprocessing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In

case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional preprocessing step.

Then, we can compute the covariance matrix that tells us about the correlation structure in the data: The (i,j) element of the data covariance matrix contains the covariance between i-th and j-th dimension of the data.

To decorrelate the data, we project the original (but zero-centered) data into the eigenbasis: Notice that the columns of U are a set of orthonormal vectors (norm of 1, and orthogonal to each other), so they can be regarded as basis vectors.

This is also sometimes refereed to as Principal Component Analysis (PCA) dimensionality reduction: After this operation, we would have reduced the original dataset of size [N x D] to one of size [N x 100], keeping the 100 dimensions of the data that contain the most variance.

The geometric interpretation of this transformation is that if the input data is a multivariable gaussian, then the whitened data will be a gaussian with zero mean and identity covariance matrix.

One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input.

Note that we do not know what the final value of every weight should be in the trained network, but with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative.

The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.

The implementation for one weight matrix might look like W = 0.01* np.random.randn(D,H), where randn samples from a zero mean, unit standard deviation gaussian.

With this formulation, every neuron’s weight vector is initialized as a random vector sampled from a multi-dimensional gaussian, so the neurons point in random direction in the input space.

That is, the recommended heuristic is to initialize each neuron’s weight vector as: w = np.random.randn(n) / sqrt(n), where n is the number of its inputs.

The sketch of the derivation is as follows: Consider the inner product \(s = \sum_i^n w_i x_i\) between the weights \(w\) and input \(x\), which gives the raw activation of a neuron before the non-linearity.

And since \(\text{Var}(aX) = a^2\text{Var}(X)\) for a random variable \(X\) and a scalar \(a\), this implies that we should draw from unit gaussian and then scale it by \(a = \sqrt{1/n}\), to make its variance \(1/n\).

In this paper, the authors end up recommending an initialization of the form \( \text{Var}(w) = 2/(n_{in} + n_{out}) \) where \(n_{in}, n_{out}\) are the number of units in the previous layer and the next layer.

A more recent paper on this topic, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification by He et al., derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be \(2.0/n\).

This gives the initialization w = np.random.randn(n) * sqrt(2.0/n), and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons.

Another way to address the uncalibrated variances problem is to set all weight matrices to zero, but to break symmetry every neuron is randomly connected (with weights sampled from a small gaussian as above) to a fixed number of neurons below it.

For ReLU non-linearities, some people like to use small constant value such as 0.01 for all biases because this ensures that all ReLU units fire in the beginning and therefore obtain and propagate some gradient.

However, it is not clear if this provides a consistent improvement (in fact some results seem to indicate that this performs worse) and it is more common to simply use 0 bias initialization.

A recently developed technique by Ioffe and Szegedy called Batch Normalization alleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training.

In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we’ll soon see), and before non-linearities.

It is common to see the factor of \(\frac{1}{2}\) in front because then the gradient of this term with respect to the parameter \(w\) is simply \(\lambda w\) instead of \(2 \lambda w\).

Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: W += -lambda * W towards zero.

L1 regularization is another relatively common form of regularization, where for each weight \(w\) we add the term \(\lambda \mid w \mid\) to the objective.

Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.

In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector \(\vec{w}\) of every neuron to satisfy \(\Vert \vec{w} \Vert_2 <

Vanilla dropout in an example 3-layer Neural Network would be implemented as follows: In the code above, inside the train_step function we have performed dropout twice: on the first hidden layer and on the second hidden layer.

It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction.

Since test-time performance is so critical, it is always preferable to use inverted dropout, which performs the scaling at train time, leaving the forward pass at test time untouched.

Inverted dropout looks as follows: There has a been a large amount of research after the first introduction of dropout that tries to understand the source of its power in practice, and its relation to the other regularization techniques.

As we already mentioned in the Linear Classification section, it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objective.

For example, a binary classifier for each category independently would take the form: where the sum is over all categories \(j\), and \(y_{ij}\) is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector \(f_j\) will be positive when the class is predicted to be present and negative otherwise.

A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as: Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is \(P(y = 0 \mid x;

The expression above can look scary but the gradient on \(f\) is in fact extremely simple and intuitive: \(\partial{L_i} / \partial{f_j} = y_{ij} - \sigma(f_j)\) (as you can double check yourself by taking the derivatives).

The L2 norm squared would compute the loss for a single example of the form: The reason the L2 norm is squared in the objective is that the gradient becomes much simpler, without changing the optimal parameters since squaring is a monotonic operation.

For example, if you are predicting star rating for a product, it might work much better to use 5 independent classifiers for ratings of 1-5 stars instead of a regression loss.

If you’re certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea.

The approach will have two major components: a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels.

For example, in CIFAR-10 we have a training set of N = 50,000 images, each with D = 32 x 32 x 3 = 3072 pixels, and K = 10, since there are 10 distinct classes (dog, cat, car, etc).

In this module we will start out with arguably the simplest possible function, a linear mapping: In the above equation, we are assuming that the image \(x_i\) has all of its pixels flattened out to a single column vector of shape [D x 1].

In CIFAR-10, \(x_i\) contains all pixels in the i-th image flattened into a single [3072 x 1] column, W is [10 x 3072] and b is [10 x 1], so 3072 numbers come into the function (the raw pixel values) and 10 numbers come out (the class scores).

You might expect that the “ship” classifier would then have a lot of positive weights across its blue channel weights (presence of blue increases score of ship), and negative weights in the red/green channels (presence of red/green decreases the score of ship).

We cannot visualize 3072-dimensional spaces, but if we imagine squashing all those dimensions into only two dimensions, then we can try to visualize what the classifier might be doing: As we saw above, every row of \(W\) is a classifier for one of the classes.

In particular, note that without the bias terms, plugging in \( x_i = 0 \) would always give score of zero regardless of the weights, so all lines would be forced to cross the origin.

Another way to think of it is that we are still effectively doing Nearest Neighbor, but instead of having thousands of training images we are only using a single image per class (although we will learn it, and it does not necessarily have to be one of the images in the training set), and we use the (negative) inner product as the distance instead of the L1 or L2 distance.

green car facing left, blue car facing front, etc.), and neurons on the next layer could combine these into a more accurate car score through a weighted sum of the individual car detectors.

A commonly used trick is to combine the two sets of parameters into a single matrix that holds both of them by extending the vector \(x_i\) with one additional dimension that always holds the constant \(1\) - a default bias dimension.

With the extra dimension, the new score function will simplify to a single matrix multiply: With our CIFAR-10 example, \(x_i\) is now [3073 x 1] instead of [3072 x 1] - (with the extra dimension holding the constant 1), and \(W\) is now [10 x 3073] instead of [10 x 3072].

In the case of images, this corresponds to computing a mean image across the training images and subtracting it from every image to get images where the pixels range from approximately [-127 … 127].

Moreover, we saw that we don’t have control over the data \( (x_i,y_i) \) (it is fixed and given), but we do have control over these weights and we want to set them so that the predicted class scores are consistent with the ground truth labels in the training data.

For example, going back to the example image of a cat and its scores for the classes “cat”, “dog” and “ship”, we saw that the particular set of weights in that example was not very good at all: We fed in the pixels that depict a cat but the cat score came out very low (-96.8) compared to the other classes (dog score 437.9 and ship score 61.95).

Notice that it’s sometimes helpful to anthropomorphise the loss functions as we did above: The SVM “wants” a certain outcome in the sense that the outcome would yield a lower loss (which is good).

Recall that for the i-th example we are given the pixels of image \( x_i \) and the label \( y_i \) that specifies the index of the correct class.

The expression above sums over all incorrect classes (\(j \neq y_i\)), so we get two terms: You can see that the first term gives zero since [-7 - 13 + 10] gives a negative number, which is then thresholded to zero with the \(max(0,-)\) function.

You’ll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM), which uses the form \(max(0,-)^2\) that penalizes violated margins more strongly (quadratically instead of linearly).

One easy way to see this is that if some parameters W correctly classify all examples (so loss is zero for each example), then any multiple of these parameters \( \lambda W \) where \( \lambda >

The most common regularization penalty is the L2 norm that discourages large weights through an elementwise quadratic penalty over all parameters: In the expression above, we are summing up all the squared elements of \(W\).

For example, suppose that we have some input vector \(x = [1,1,1,1] \) and two weight vectors \(w_1 = [1,0,0,0]\), \(w_2 = [0.25,0.25,0.25,0.25] \).

Since the L2 penalty prefers smaller and more diffuse weight vectors, the final classifier is encouraged to take into account all input dimensions to small amounts rather than a few input dimensions and very strongly.

Here is the loss function (without regularization) implemented in Python, in both unvectorized and half-vectorized form: The takeaway from this section is that the SVM loss takes one particular approach to measuring how consistent the predictions on training data are with the ground truth labels.

The key to understanding this is that the magnitude of the weights \(W\) has direct effect on the scores (and hence also their differences): As we shrink all values inside \(W\) the score differences will become lower, and as we scale up the weights the score differences will all become higher.

Unlike the SVM which treats the outputs \(f(x_i,W)\) as (uncalibrated and possibly difficult to interpret) scores for each class, the Softmax classifier gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation that we will describe shortly.

W) = W x_i\) stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class and replace the hinge loss with a cross-entropy loss that has the form: where we are using the notation \(f_j\) to mean the j-th element of the vector of class scores \(f\).

The function \(f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}} \) is called the softmax function: It takes a vector of arbitrary real-valued scores (in \(z\)) and squashes it to a vector of values between zero and one that sum to one.

The cross-entropy between a “true” distribution \(p\) and an estimated distribution \(q\) is defined as: The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities ( \(q = e^{f_{y_i}} / \sum_j e^{f_j} \) as seen above) and the “true” distribution, which in this interpretation is the distribution where all probability mass is on the correct class (i.e.

Moreover, since the cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as \(H(p,q) = H(p) + D_{KL}(p||q)\), and the entropy of the delta function \(p\) is zero, this is also equivalent to minimizing the KL divergence between the two distributions (a measure of distance).

A nice feature of this view is that we can now also interpret the regularization term \(R(W)\) in the full loss function as coming from a Gaussian prior over the weight matrix \(W\), where instead of MLE we are performing the Maximum a posteriori (MAP) estimation.

The Softmax classifier gets its name from the softmax function, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied.

The reason we put the word “probabilities” in quotes, however, is that how peaky or diffuse these probabilities are depends directly on the regularization strength \(\lambda\) - which you are in charge of as input to the system.

The SVM does not care about the details of the individual scores: if they were instead [10, -100, -100] or [10, 9, 9] the SVM would be indifferent since the margin of 1 is satisfied and hence the loss is zero.

In other words, the Softmax classifier is never fully happy with the scores it produces: the correct class could always have a higher probability and the incorrect classes always a lower probability and the loss would always get better.

This can intuitively be thought of as a feature: For example, a car classifier which is likely spending most of its “effort” on the difficult problem of separating cars from trucks should not be influenced by the frog examples, which it already assigns very low scores to, and which likely cluster around a completely different side of the data cloud.

In summary, We now saw one way to take a dataset of images and map each one to class scores based on a set of parameters, and we saw two examples of loss functions that we can use to measure the quality of the predictions.

Lecture 3 | Loss Functions and Optimization

Lecture 3 continues our discussion of linear classifiers. We introduce the idea of a loss function to quantify our unhappiness with a model's predictions, and ...

Lecture 18 - Epilogue

Epilogue - The map of machine learning. Brief views of Bayesian learning and aggregation methods. Lecture 18 of 18 of Caltech's Machine Learning Course ...

Mod-09 Lec-33 Kernel Functions for nonlinear SVMs; Mercer and positive definite Kernels

Pattern Recognition by Prof. P.S. Sastry, Department of Electronics & Communication Engineering, IISc Bangalore. For more details on NPTEL visit ...

Tutorial 3.3: Lorenzo Rosasco - Machine Learning Part 3

MIT RES.9-003 Brains, Minds and Machines Summer Course, Summer 2015 View the complete course: Instructor: Lorenzo ..

Open Borders? Immigration, Citizenship, and Nationalism in the 21st Century | Janus Forum Series

The Political Theory Project is proud to host David Miller, the Official Fellow and Professor in Social and Political Theory at Nuffield College in Oxford, and ...

Ethics of AI

Detecting people, optimising logistics, providing translations, composing art: artificial intelligence (AI) systems are not only changing what and how we are doing ...

Support vector machine

In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that ...

Least squares

The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more ...

CUNY 2017 Live Stream: Thursday March 30, Morning Session

Least squares

The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more ...