# AI News, NVIDIA/DIGITS

## NVIDIA/DIGITS

Weight initialization can critically affect the speed at which a neural network is able to learn. Under

In this series of examples we will see how a flavor of the canonical LeNet neural network is able to learn from the MNIST dataset, under various weight initialization schemes.

example, in order to set the initial weights to a constant 0.2: The initialization of the bias terms can also be specified through a bias_filler field, although this tends to have a lesser impact on learning so we will focus on the initialization of weights in the following examples.

Most layers make it straightforward to initialize weights randomly from a uniform distribution over [-std, std] by calling their reset(std) method. However

you wish to use non-default weight initialization, you may write: Assuming model points to your existing model then the new model will be initialized according to the 'Xavier' method.

NOTE: at the time of writing, the torch-toolbox project does not handle all layers (most notably cudnn.SpatialConvolution) so you might need to tailor the script to your specific needs.

This performs reasonably well, considering the fact that this is a fully automatic way of setting the initial weights which does not require hand picking the range of the uniform distribution.

It involves subtracting the mean across every individual feature in the data, and has the geometric interpretation of centering the cloud of data around the origin along every dimension.

It only makes sense to apply this preprocessing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In

case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional preprocessing step.

Then, we can compute the covariance matrix that tells us about the correlation structure in the data: The (i,j) element of the data covariance matrix contains the covariance between i-th and j-th dimension of the data.

To decorrelate the data, we project the original (but zero-centered) data into the eigenbasis: Notice that the columns of U are a set of orthonormal vectors (norm of 1, and orthogonal to each other), so they can be regarded as basis vectors.

This is also sometimes refereed to as Principal Component Analysis (PCA) dimensionality reduction: After this operation, we would have reduced the original dataset of size [N x D] to one of size [N x 100], keeping the 100 dimensions of the data that contain the most variance.

The geometric interpretation of this transformation is that if the input data is a multivariable gaussian, then the whitened data will be a gaussian with zero mean and identity covariance matrix.

One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input.

Note that we do not know what the final value of every weight should be in the trained network, but with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative.

The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.

The implementation for one weight matrix might look like W = 0.01* np.random.randn(D,H), where randn samples from a zero mean, unit standard deviation gaussian.

With this formulation, every neuron’s weight vector is initialized as a random vector sampled from a multi-dimensional gaussian, so the neurons point in random direction in the input space.

That is, the recommended heuristic is to initialize each neuron’s weight vector as: w = np.random.randn(n) / sqrt(n), where n is the number of its inputs.

The sketch of the derivation is as follows: Consider the inner product $$s = \sum_i^n w_i x_i$$ between the weights $$w$$ and input $$x$$, which gives the raw activation of a neuron before the non-linearity.

And since $$\text{Var}(aX) = a^2\text{Var}(X)$$ for a random variable $$X$$ and a scalar $$a$$, this implies that we should draw from unit gaussian and then scale it by $$a = \sqrt{1/n}$$, to make its variance $$1/n$$.

In this paper, the authors end up recommending an initialization of the form $$\text{Var}(w) = 2/(n_{in} + n_{out})$$ where $$n_{in}, n_{out}$$ are the number of units in the previous layer and the next layer.

A more recent paper on this topic, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification by He et al., derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be $$2.0/n$$.

This gives the initialization w = np.random.randn(n) * sqrt(2.0/n), and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons.

Another way to address the uncalibrated variances problem is to set all weight matrices to zero, but to break symmetry every neuron is randomly connected (with weights sampled from a small gaussian as above) to a fixed number of neurons below it.

For ReLU non-linearities, some people like to use small constant value such as 0.01 for all biases because this ensures that all ReLU units fire in the beginning and therefore obtain and propagate some gradient.

However, it is not clear if this provides a consistent improvement (in fact some results seem to indicate that this performs worse) and it is more common to simply use 0 bias initialization.

A recently developed technique by Ioffe and Szegedy called Batch Normalization alleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training.

In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we’ll soon see), and before non-linearities.

It is common to see the factor of $$\frac{1}{2}$$ in front because then the gradient of this term with respect to the parameter $$w$$ is simply $$\lambda w$$ instead of $$2 \lambda w$$.

Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: W += -lambda * W towards zero.

L1 regularization is another relatively common form of regularization, where for each weight $$w$$ we add the term $$\lambda \mid w \mid$$ to the objective.

Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.

In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector $$\vec{w}$$ of every neuron to satisfy $$\Vert \vec{w} \Vert_2 &lt; Vanilla dropout in an example 3-layer Neural Network would be implemented as follows: In the code above, inside the train_step function we have performed dropout twice: on the first hidden layer and on the second hidden layer. It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction. Since test-time performance is so critical, it is always preferable to use inverted dropout, which performs the scaling at train time, leaving the forward pass at test time untouched. Inverted dropout looks as follows: There has a been a large amount of research after the first introduction of dropout that tries to understand the source of its power in practice, and its relation to the other regularization techniques. As we already mentioned in the Linear Classification section, it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objective. For example, a binary classifier for each category independently would take the form: where the sum is over all categories \(j$$, and $$y_{ij}$$ is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector $$f_j$$ will be positive when the class is predicted to be present and negative otherwise.

A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as: Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is $$P(y = 0 \mid x; The expression above can look scary but the gradient on \(f$$ is in fact extremely simple and intuitive: $$\partial{L_i} / \partial{f_j} = y_{ij} - \sigma(f_j)$$ (as you can double check yourself by taking the derivatives).

The L2 norm squared would compute the loss for a single example of the form: The reason the L2 norm is squared in the objective is that the gradient becomes much simpler, without changing the optimal parameters since squaring is a monotonic operation.

For example, if you are predicting star rating for a product, it might work much better to use 5 independent classifiers for ratings of 1-5 stars instead of a regression loss.

If you’re certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea.

## Kaixhin/nninit

Parameter initialisation schemes for Torch7 neural network modules.

Supported modules: Readme contents: nninit adds an init method to nn.Module, with the following API: The accessor argument is used to extract the tensor to be initialised from the module.

The initialiser argument is a function that takes the module, tensor, and further options;

it adjusts the tensor and returns the module, allowing init calls to be chained.

For example: The tensor is first accessed as a property of the module from the first element, and a subtensor is then extracted using Torch's indexing operator applied to the second element.

The initialisation scheme typically includes the gain for ReLU units, which has to be manually specified in nninit.kaiming with the option {gain = 'relu'}. Also

Sets (1 - sparsity) percent of the tensor to 0, where sparsity is between 0 and 1.

initialisation scheme described in the paper includes the gain for ReLU units, which has to be manually specified with the option {gain = 'relu'}. The

If the gain must be calculated from additional parameters, gain must be passed as table with the string as the first element as well as named parameters.

For example: To develop nninit/use it to test new initialisation schemes, git clone/download this repo and use luarocks make rocks/nninit-scm-1.rockspec to install nninit locally.

## NVIDIA/DIGITS

Weight initialization can critically affect the speed at which a neural network is able to learn. Under

In this series of examples we will see how a flavor of the canonical LeNet neural network is able to learn from the MNIST dataset, under various weight initialization schemes.

example, in order to set the initial weights to a constant 0.2: The initialization of the bias terms can also be specified through a bias_filler field, although this tends to have a lesser impact on learning so we will focus on the initialization of weights in the following examples.

Most layers make it straightforward to initialize weights randomly from a uniform distribution over [-std, std] by calling their reset(std) method. However

you wish to use non-default weight initialization, you may write: Assuming model points to your existing model then the new model will be initialized according to the 'Xavier' method.

NOTE: at the time of writing, the torch-toolbox project does not handle all layers (most notably cudnn.SpatialConvolution) so you might need to tailor the script to your specific needs.

This performs reasonably well, considering the fact that this is a fully automatic way of setting the initial weights which does not require hand picking the range of the uniform distribution.

Lecture 6 | Training Neural Networks I

In Lecture 6 we discuss many practical issues for training modern neural networks. We discuss different activation functions, the importance of data ...

Normalized Inputs and Initial Weights

This video is part of the Udacity course "Deep Learning". Watch the full course at

Build a Neural Net in 4 Minutes

How does a Neural network work? Its the basis of deep learning and the reason why image recognition, chatbots, self driving cars, and language translation ...

How to Make a Prediction - Intro to Deep Learning #1

Welcome to Intro to Deep Learning! This course is for anyone who wants to become a deep learning engineer. I'll take you from the very basics of deep learning ...

How to Predict Stock Prices Easily - Intro to Deep Learning #7

We're going to predict the closing price of the S&P 500 using a special type of recurrent neural network called an LSTM network. I'll explain why we use ...

How to Generate Art - Intro to Deep Learning #8

We're going to learn how to use deep learning to convert an image into the style of an artist that we choose. We'll go over the history of computer generated art, ...

Lecture 10: Neural Machine Translation and Models with Attention

Lecture 10 introduces translation, machine translation, and neural machine translation. Google's new NMT is highlighted followed by sequence models with ...

Neural networks [9.3] : Computer vision - parameter sharing

PyTorch in 5 Minutes

I'll explain PyTorch's key features and compare it to the current most popular deep learning framework in the world (Tensorflow). We'll then write out a short ...

Lecture 16: Dynamic Neural Networks for Question Answering

Lecture 16 addresses the question ""Can all NLP tasks be seen as question answering problems?"". Key phrases: Coreference Resolution, Dynamic Memory ...