# AI News, BrainJS/brain.js

## BrainJS/brain.js

You check out this fantastic screencast, which explains how to train a simple neural network using a real world dataset: How to create a neural network in the browser using Brain.js.

If you have node, you can install brain.js with npm: Or if you prefer yarn: Alternatively, you can install brain.js with bower: At present, the npm version of brain.js is approximately 1.0.0, featuring only Feed forward NN.

(Note that input objects do not need to be similar.) Each training pattern can either: Example using an array of numbers: Example using an array of arrays of numbers: Each training pattern can either: CAUTION: When using an array of values, you can use ANY value, however, the values are represented in the neural network by a single input.

Also, when deviating from strings, this gets into beta Example using direct strings: Example using strings with inputs and outputs: train() takes a hash of options as its second argument: The network will stop training whenever one of the two criteria is met: the training error has gone below the threshold (default 0.005), or the max number of iterations (default 20000) has been reached.

If the learning rate is closer to 1, it will train faster, but training results may be constrained to a local minimum and perform badly on new data.(Overfitting) The default learning rate is 0.3.

Serialize or load in the state of a trained network with JSON: You can also get a custom standalone function from a trained network that acts just like run(): NeuralNetwork() takes a hash of options: This parameter lets you specify which activation function your neural network should use.

There are currently four supported activation functions, sigmoid being the default: Here's a table (Thanks, Wikipedia!) summarizing a plethora of activation functions — Activation Function You can use this to specify the number of hidden layers in the network and the size of each layer.

For example, if you want two hidden layers - the first with 3 nodes and the second with 4 nodes, you'd give: By default brain.js uses one hidden layer with size proportionate to the size of the input array.

Table of Contents: In the previous sections we’ve discussed the static parts of a Neural Networks: how we can set up the network connectivity, the data, and the loss function.

The formula you may have seen for the finite difference approximation when evaluating the numerical gradient looks as follows: where $$h$$ is a very small number, in practice approximately 1e-5 or so.

In practice, it turns out that it is much better to use the centered difference formula of the form: This requires you to evaluate the loss function twice to check every single dimension of the gradient (so it is about 2 times as expensive), but the gradient approximation turns out to be much more precise.

You might be temped to keep track of the difference $$\mid f’_a - f’_n \mid$$ or its square and define the gradient check as failed if that difference is above a threshold.

Notice that normally the relative error formula only includes one of the two terms (either one), but I prefer to max (or add) both to make it symmetric and to prevent dividing by zero in the case where one of the two is zero (which can often happen, especially with ReLUs).

However, if your gradients per datapoint are very small, then additionally dividing them by the number of data points is starting to give very small numbers, which in turn will lead to more numerical issues.

If they are you may want to temporarily scale your loss function up by a constant to bring them to a “nicer” range where floats are more dense - ideally on the order of 1.0, where your float exponent is 0.

Additionally, a random initialization might not be the most “characteristic” point in the space of parameters and may in fact introduce pathological situations where the gradient seems to be correctly implemented but isn’t.

For instance, an SVM with very small weight initialization will assign almost exactly zero scores to all datapoints and the gradients will exhibit a particular pattern across all datapoints.

Therefore, to be safe it is best to use a short burn-in time during which the network is allowed to learn and perform the gradient check after the loss starts to go down.

One danger to be aware of is that the regularization loss may overwhelm the data loss, in which case the gradients will be primarily coming from the regularization term (which usually has a much simpler gradient expression).

In these cases, for example, the biases could only take up a tiny number of parameters from the whole vector, so it is important to not sample at random but to take this into account and check that all parameters receive the correct gradients.

Below is a cartoon diagram showing the loss over time, and especially what the shape might tell you about the learning rate: The amount of “wiggle” in the loss is related to the batch size.

When the batch size is the full dataset, the wiggle will be minimal because every gradient update should be improving the loss function monotonically (unless the learning rate is set too high).

Since learning progress generally takes an exponential form shape, the plot appears as a slightly more interpretable straight line, rather than a hockey stick.

with tanh neurons we would like to see a distribution of neuron activations between the full range of [-1,1], instead of seeing all neurons outputting zero, or all neurons being completely saturated at either -1 or 1.

In this section we highlight some established and common techniques you may see in practice, briefly describe their intuition, but leave a detailed analysis outside of the scope of the class.

The simplest form of update is to change the parameters along the negative gradient direction (since the gradient indicates the direction of increase, but we usually wish to minimize a loss function).

In particular, the loss can be interpreted as the height of a hilly terrain (and therefore also to the potential energy since $$U = mgh$$ and therefore $$U \propto h$$ ).

Instead, the physics view suggests an update in which the gradient only directly influences the velocity, which in turn has an effect on the position: Here we see an introduction of a v variable that is initialized at zero, and an additional hyperparameter (mu).

As an unfortunate misnomer, this variable is in optimization referred to as momentum (its typical value is about 0.9), but its physical meaning is more consistent with the coefficient of friction.

Similar to annealing schedules for learning rates (discussed later, below), optimization can sometimes benefit a little from momentum schedules, where the momentum is increased in later stages of learning.

The core idea behind Nesterov momentum is that when the current parameter vector is at some position x, then looking at the momentum update above, we know that the momentum term alone (i.e.

Therefore, if we are about to compute the gradient, we can treat the future approximate position x + mu * v as a “lookahead” - this is a point in the vicinity of where we are soon going to end up.

This is possible to achieve by manipulating the update above with a variable transform x_ahead = x + mu * v, and then expressing the update in terms of x_ahead instead of x.

Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function.

There are three common types of implementing the learning rate decay: In practice, we find that the step decay is slightly preferable because the hyperparameters it involves (the fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter $$k$$.

second, popular group of methods for optimization in context of deep learning is based on Newton’s method, which iterates the following update: Here, $$H f(x)$$ is the Hessian matrix, which is a square matrix of second-order partial derivatives of the function.

In particular, multiplying by the inverse Hessian leads the optimization to take more aggressive steps in directions of shallow curvature and shorter steps in directions of steep curvature.

However, even after we eliminate the memory concerns, a large downside of a naive application of L-BFGS is that it must be computed over the entire training set, which could contain millions of examples.

In this section we highlight some common adaptive methods you may encounter in practice: Adagrad is an adaptive learning rate method originally proposed by Duchi et al..

Notice that the weights that receive high gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased.

In particular, it uses a moving average of squared gradients instead, giving: Here, decay_rate is a hyperparameter and typical values are [0.9, 0.99, 0.999].

The (simplified) update looks as follows: Notice that the update looks exactly as RMSProp update, except the “smooth” version of the gradient m is used instead of the raw (and perhaps noisy) gradient vector dx.

The full Adam update also includes a bias correction mechanism, which compensates for the fact that in the first few time steps the vectors m,v are both initialized and therefore biased at zero, before they fully “warm up”.

The most common hyperparameters in context of Neural Networks include: But as we saw, there are many more relatively less sensitive hyperparameters, for example in per-parameter adaptive learning methods, the setting of momentum and its schedule, etc.

During the training, the worker will keep track of the validation performance after every epoch, and writes a model checkpoint (together with miscellaneous training statistics such as the loss over time) to a file, preferably on a shared file system.

Then there is a second program which we will call a master, which launches or kills workers across a computing cluster, and may additionally inspect the checkpoints written by workers and plot their training statistics, etc.

For example, a fixed change of adding 0.01 to a learning rate has huge effects on the dynamics if the learning rate is 0.001, but nearly no effect if the learning rate when it is 10.

Once we receive the results, it is important to double check that the final learning rate is not at the edge of this interval, or otherwise you may be missing more optimal hyperparameter setting beyond the interval.

Also, it can be helpful to perform the initial coarse search while only training for 1 epoch or even less, because many hyperparameter settings can lead the model to not learn at all, or immediately explode with infinite cost.

An interested reader may find the recent work from Geoff Hinton on “Dark Knowledge” inspiring, where the idea is to “distill” a good ensemble back to a single model by incorporating the ensemble log likelihoods into a modified objective.

It involves subtracting the mean across every individual feature in the data, and has the geometric interpretation of centering the cloud of data around the origin along every dimension.

It only makes sense to apply this preprocessing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In

case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional preprocessing step.

Then, we can compute the covariance matrix that tells us about the correlation structure in the data: The (i,j) element of the data covariance matrix contains the covariance between i-th and j-th dimension of the data.

To decorrelate the data, we project the original (but zero-centered) data into the eigenbasis: Notice that the columns of U are a set of orthonormal vectors (norm of 1, and orthogonal to each other), so they can be regarded as basis vectors.

This is also sometimes refereed to as Principal Component Analysis (PCA) dimensionality reduction: After this operation, we would have reduced the original dataset of size [N x D] to one of size [N x 100], keeping the 100 dimensions of the data that contain the most variance.

The geometric interpretation of this transformation is that if the input data is a multivariable gaussian, then the whitened data will be a gaussian with zero mean and identity covariance matrix.

One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input.

Note that we do not know what the final value of every weight should be in the trained network, but with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative.

The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.

The implementation for one weight matrix might look like W = 0.01* np.random.randn(D,H), where randn samples from a zero mean, unit standard deviation gaussian.

With this formulation, every neuron’s weight vector is initialized as a random vector sampled from a multi-dimensional gaussian, so the neurons point in random direction in the input space.

That is, the recommended heuristic is to initialize each neuron’s weight vector as: w = np.random.randn(n) / sqrt(n), where n is the number of its inputs.

The sketch of the derivation is as follows: Consider the inner product $$s = \sum_i^n w_i x_i$$ between the weights $$w$$ and input $$x$$, which gives the raw activation of a neuron before the non-linearity.

And since $$\text{Var}(aX) = a^2\text{Var}(X)$$ for a random variable $$X$$ and a scalar $$a$$, this implies that we should draw from unit gaussian and then scale it by $$a = \sqrt{1/n}$$, to make its variance $$1/n$$.

In this paper, the authors end up recommending an initialization of the form $$\text{Var}(w) = 2/(n_{in} + n_{out})$$ where $$n_{in}, n_{out}$$ are the number of units in the previous layer and the next layer.

A more recent paper on this topic, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification by He et al., derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be $$2.0/n$$.

This gives the initialization w = np.random.randn(n) * sqrt(2.0/n), and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons.

Another way to address the uncalibrated variances problem is to set all weight matrices to zero, but to break symmetry every neuron is randomly connected (with weights sampled from a small gaussian as above) to a fixed number of neurons below it.

For ReLU non-linearities, some people like to use small constant value such as 0.01 for all biases because this ensures that all ReLU units fire in the beginning and therefore obtain and propagate some gradient.

However, it is not clear if this provides a consistent improvement (in fact some results seem to indicate that this performs worse) and it is more common to simply use 0 bias initialization.

A recently developed technique by Ioffe and Szegedy called Batch Normalization alleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training.

In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we’ll soon see), and before non-linearities.

It is common to see the factor of $$\frac{1}{2}$$ in front because then the gradient of this term with respect to the parameter $$w$$ is simply $$\lambda w$$ instead of $$2 \lambda w$$.

Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: W += -lambda * W towards zero.

L1 regularization is another relatively common form of regularization, where for each weight $$w$$ we add the term $$\lambda \mid w \mid$$ to the objective.

Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.

In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector $$\vec{w}$$ of every neuron to satisfy $$\Vert \vec{w} \Vert_2 &lt; Vanilla dropout in an example 3-layer Neural Network would be implemented as follows: In the code above, inside the train_step function we have performed dropout twice: on the first hidden layer and on the second hidden layer. It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction. Since test-time performance is so critical, it is always preferable to use inverted dropout, which performs the scaling at train time, leaving the forward pass at test time untouched. Inverted dropout looks as follows: There has a been a large amount of research after the first introduction of dropout that tries to understand the source of its power in practice, and its relation to the other regularization techniques. As we already mentioned in the Linear Classification section, it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objective. For example, a binary classifier for each category independently would take the form: where the sum is over all categories \(j$$, and $$y_{ij}$$ is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector $$f_j$$ will be positive when the class is predicted to be present and negative otherwise.

A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as: Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is $$P(y = 0 \mid x; The expression above can look scary but the gradient on \(f$$ is in fact extremely simple and intuitive: $$\partial{L_i} / \partial{f_j} = y_{ij} - \sigma(f_j)$$ (as you can double check yourself by taking the derivatives).

The L2 norm squared would compute the loss for a single example of the form: The reason the L2 norm is squared in the objective is that the gradient becomes much simpler, without changing the optimal parameters since squaring is a monotonic operation.

For example, if you are predicting star rating for a product, it might work much better to use 5 independent classifiers for ratings of 1-5 stars instead of a regression loss.

If you’re certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea.

ROC Curves and Area Under the Curve (AUC) Explained

Transcript and screenshots: Visualization: Research paper: .

Lecture 7 | Training Neural Networks II

Lecture 7 continues our discussion of practical issues for training neural networks. We discuss different update rules commonly used to optimize neural networks ...

Regression forecasting and predicting - Practical Machine Learning Tutorial with Python p.5

In this video, make sure you define the X's like so. I flipped the last two lines by mistake: X = np.array(df.drop(['label'],1)) X = preprocessing.scale(X) X_lately ...

Distributed TensorFlow (TensorFlow Dev Summit 2017)

TensorFlow gives you the flexibility to scale up to hundreds of GPUs, train models with a huge number of parameters, and customize every last detail of the ...

MIT 6.S094: Deep Reinforcement Learning for Motion Planning

This is lecture 2 of course 6.S094: Deep Learning for Self-Driving Cars taught in Winter 2017. This lecture introduces types of machine learning, the neuron as a ...

SPACY'S ENTITY RECOGNITION MODEL: incremental parsing with Bloom embeddings & residual CNNs

spaCy v2.0's Named Entity Recognition system features a sophisticated word embedding strategy using subword features and "Bloom" embeddings, a deep ...

MMIS 643 Data Mining Assignment 3 Solutions

A. A neural network typically starts out with random coefficients (weights); hence, it produce essentially random predications when presented with its first case.

Lecture 13: Convolutional Neural Networks

Lecture 13 provides a mini tutorial on Azure and GPUs followed by research highlight "Character-Aware Neural Language Models." Also covered are CNN ...

Neural Net in C++ Tutorial on Vimeo

Lecture 5: Backpropagation and Project Advice

Lecture 5 discusses how neural networks can be trained using a distributed gradient descent technique known as back propagation. Key phrases: Neural ...