AI News, Deep Learning: Regularization Notes

Deep Learning: Regularization Notes

In previous article (long ago, now I am back!!) I talked about overfitting and the problems faced due to overfitting.

Many regularization approaches are based on limiting the capacity of models, such as neural networks, linear regression, or logistic regression, by adding a parameter norm penalty Ω(θ) to the objective function J.

X, y) + αΩ(θ) — {1} where α ∈[0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty term, Ω, relative to the standard objective function J.

We note that for neural networks, we typically choose to use a parameter norm penalty Ω that penalizes only the weights of the affine transformation at each layer and leaves the biases unregularized.

We therefore use the vector w to indicate all of the weights that should be affected by a norm penalty, while the vector θ denotes all of the parameters, including both w and the unregularized parameters.

Deep Learning: Regularization Notes

In previous article (long ago, now I am back!!) I talked about overfitting and the problems faced due to overfitting.

Many regularization approaches are based on limiting the capacity of models, such as neural networks, linear regression, or logistic regression, by adding a parameter norm penalty Ω(θ) to the objective function J.

X, y) + αΩ(θ) — {1} where α ∈[0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty term, Ω, relative to the standard objective function J.

We note that for neural networks, we typically choose to use a parameter norm penalty Ω that penalizes only the weights of the affine transformation at each layer and leaves the biases unregularized.

We therefore use the vector w to indicate all of the weights that should be affected by a norm penalty, while the vector θ denotes all of the parameters, including both w and the unregularized parameters.

It involves subtracting the mean across every individual feature in the data, and has the geometric interpretation of centering the cloud of data around the origin along every dimension.

It only makes sense to apply this preprocessing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In

case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional preprocessing step.

Then, we can compute the covariance matrix that tells us about the correlation structure in the data: The (i,j) element of the data covariance matrix contains the covariance between i-th and j-th dimension of the data.

To decorrelate the data, we project the original (but zero-centered) data into the eigenbasis: Notice that the columns of U are a set of orthonormal vectors (norm of 1, and orthogonal to each other), so they can be regarded as basis vectors.

This is also sometimes refereed to as Principal Component Analysis (PCA) dimensionality reduction: After this operation, we would have reduced the original dataset of size [N x D] to one of size [N x 100], keeping the 100 dimensions of the data that contain the most variance.

The geometric interpretation of this transformation is that if the input data is a multivariable gaussian, then the whitened data will be a gaussian with zero mean and identity covariance matrix.

One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input.

Note that we do not know what the final value of every weight should be in the trained network, but with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative.

The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.

The implementation for one weight matrix might look like W = 0.01* np.random.randn(D,H), where randn samples from a zero mean, unit standard deviation gaussian.

With this formulation, every neuron’s weight vector is initialized as a random vector sampled from a multi-dimensional gaussian, so the neurons point in random direction in the input space.

That is, the recommended heuristic is to initialize each neuron’s weight vector as: w = np.random.randn(n) / sqrt(n), where n is the number of its inputs.

The sketch of the derivation is as follows: Consider the inner product \(s = \sum_i^n w_i x_i\) between the weights \(w\) and input \(x\), which gives the raw activation of a neuron before the non-linearity.

And since \(\text{Var}(aX) = a^2\text{Var}(X)\) for a random variable \(X\) and a scalar \(a\), this implies that we should draw from unit gaussian and then scale it by \(a = \sqrt{1/n}\), to make its variance \(1/n\).

In this paper, the authors end up recommending an initialization of the form \( \text{Var}(w) = 2/(n_{in} + n_{out}) \) where \(n_{in}, n_{out}\) are the number of units in the previous layer and the next layer.

A more recent paper on this topic, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification by He et al., derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be \(2.0/n\).

This gives the initialization w = np.random.randn(n) * sqrt(2.0/n), and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons.

Another way to address the uncalibrated variances problem is to set all weight matrices to zero, but to break symmetry every neuron is randomly connected (with weights sampled from a small gaussian as above) to a fixed number of neurons below it.

For ReLU non-linearities, some people like to use small constant value such as 0.01 for all biases because this ensures that all ReLU units fire in the beginning and therefore obtain and propagate some gradient.

However, it is not clear if this provides a consistent improvement (in fact some results seem to indicate that this performs worse) and it is more common to simply use 0 bias initialization.

A recently developed technique by Ioffe and Szegedy called Batch Normalization alleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training.

In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we’ll soon see), and before non-linearities.

It is common to see the factor of \(\frac{1}{2}\) in front because then the gradient of this term with respect to the parameter \(w\) is simply \(\lambda w\) instead of \(2 \lambda w\).

Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: W += -lambda * W towards zero.

L1 regularization is another relatively common form of regularization, where for each weight \(w\) we add the term \(\lambda \mid w \mid\) to the objective.

Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.

In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector \(\vec{w}\) of every neuron to satisfy \(\Vert \vec{w} \Vert_2 <

Vanilla dropout in an example 3-layer Neural Network would be implemented as follows: In the code above, inside the train_step function we have performed dropout twice: on the first hidden layer and on the second hidden layer.

It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction.

Since test-time performance is so critical, it is always preferable to use inverted dropout, which performs the scaling at train time, leaving the forward pass at test time untouched.

Inverted dropout looks as follows: There has a been a large amount of research after the first introduction of dropout that tries to understand the source of its power in practice, and its relation to the other regularization techniques.

As we already mentioned in the Linear Classification section, it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objective.

For example, a binary classifier for each category independently would take the form: where the sum is over all categories \(j\), and \(y_{ij}\) is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector \(f_j\) will be positive when the class is predicted to be present and negative otherwise.

A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as: Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is \(P(y = 0 \mid x;

The expression above can look scary but the gradient on \(f\) is in fact extremely simple and intuitive: \(\partial{L_i} / \partial{f_j} = y_{ij} - \sigma(f_j)\) (as you can double check yourself by taking the derivatives).

The L2 norm squared would compute the loss for a single example of the form: The reason the L2 norm is squared in the objective is that the gradient becomes much simpler, without changing the optimal parameters since squaring is a monotonic operation.

For example, if you are predicting star rating for a product, it might work much better to use 5 independent classifiers for ratings of 1-5 stars instead of a regression loss.

If you’re certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea.

Regularization (mathematics)

In mathematics, statistics, and computer science, particularly in the fields of machine learning and inverse problems, regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting.[1]

Empirical learning of classifiers (learning from a finite data set) is always an underdetermined problem, because in general we are trying to infer a function of any

x

1

x

2

x

n

Concrete notions of complexity used include restrictions for smoothness and bounds on the vector space norm.[2][page needed]

From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.

Regularization can be used to learn simpler models, induce models to be sparse, introduce group structure into the learning problem, and more.

A simple form of regularization applied to integral equations, generally termed Tikhonov regularization after Andrey Nikolayevich Tikhonov, is essentially a trade-off between fitting the data and reducing a norm of the solution.

More recently, non-linear regularization methods, including total variation regularization, have become popular.

The goal of this learning problem is to find a function that fits or predicts the outcome (label) that minimizes the expected error over all possible inputs and labels.

f

n

Without bounds on the complexity of the function space (formally, the reproducing kernel Hilbert space) available, a model will be learned that incurs zero loss on the surrogate empirical error.

x

i

Regularization introduces a penalty for exploring certain regions of the function space used to build the model, which can improve generalization.

2

2

∇

w

w

d

3

d

2

T

Intuitively, a training procedure like gradient descent will tend to learn more and more complex functions as the number of iterations increases.

In practice, early stopping is implemented by training on a training set and measuring accuracy on a statistically independent validation set.

The exact solution to the unregularized least squares learning problem will minimize the empirical error, but may fail to generalize and minimize the expected error.

The algorithm above is equivalent to restricting the number of gradient descent iterations for the empirical risk

j

An example is developing a simple predictive test for a disease in order to minimize the cost of performing medical tests while maximizing predictive power.

0

‖

0

0

1

0

1

1

A simple example is provided in the figure when the space of possible solutions lies on a 45 degree line.

1

2

Elastic net regularization tends to have a grouping effect, where correlated input features are assigned equal weights.

1

1

1

min

w

∈

H

The proximal method iteratively performs gradient descent and then projects the result back into the space permitted by

1

Groups of features can be regularized by a sparsity constraint, which can be useful for expressing certain prior knowledge into an optimization problem.

2

1

The algorithm described for group sparsity without overlaps can be applied to the case where groups do overlap, in certain situations.

w

g

w

¯

g

w

¯

g

w

g

w

¯

g

The proximal operator cannot be computed in closed form, but can be effectively solved iteratively, inducing an inner iteration within the proximal method iteration.

Regularizers have been designed to guide learning algorithms to learn models that respect the structure of unsupervised training samples.

i

j

i

j

i

j

f

∈

R

m

i

i

u

l

u

u

This regularizer constrains the functions learned for each task to be similar to the overall average of the functions across all tasks.

An example is predicting blood iron levels measured at different times of the day, where each task represents a different person.

Well-known model selection techniques include the Akaike information criterion (AIC), minimum description length (MDL), and the Bayesian information criterion (BIC).

Lecture 3 | Loss Functions and Optimization

Lecture 3 continues our discussion of linear classifiers. We introduce the idea of a loss function to quantify our unhappiness with a model's predictions, and ...

Lecture 18 - Epilogue

Epilogue - The map of machine learning. Brief views of Bayesian learning and aggregation methods. Lecture 18 of 18 of Caltech's Machine Learning Course ...

Tutorial 3.3: Lorenzo Rosasco - Machine Learning Part 3

MIT RES.9-003 Brains, Minds and Machines Summer Course, Summer 2015 View the complete course: Instructor: Lorenzo ..

Mod-09 Lec-33 Kernel Functions for nonlinear SVMs; Mercer and positive definite Kernels

Pattern Recognition by Prof. P.S. Sastry, Department of Electronics & Communication Engineering, IISc Bangalore. For more details on NPTEL visit ...

Open Borders? Immigration, Citizenship, and Nationalism in the 21st Century | Janus Forum Series

The Political Theory Project is proud to host David Miller, the Official Fellow and Professor in Social and Political Theory at Nuffield College in Oxford, and ...

Ethics of AI

Detecting people, optimising logistics, providing translations, composing art: artificial intelligence (AI) systems are not only changing what and how we are doing ...

Support vector machine

In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that ...

CUNY 2017 Live Stream: Thursday March 30, Morning Session

Least squares

The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more ...

Least squares

The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more ...