AI News, NIPS Proceedingsβ

NIPS Proceedingsβ

Part of: Advances in Neural Information Processing Systems 27 (NIPS 2014) Stochastic gradient descent algorithms for training linear and kernel predictors are gaining more and more importance, thanks to their scalability.

In fact, in theoretical works most of the time assumptions are made, for example, on the prior knowledge of the norm of the optimal solution, while in the practical world validation methods remain the only viable approach.

Stochastic gradient descent

Stochastic gradient descent (often shortened to SGD), also known as incremental gradient descent, is an iterative method for optimizing a differentiable objective function, a stochastic approximation of gradient descent optimization.

Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum: where the parameter

i

However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation.[1] Therefore, contemporary statistical theorists often consider stationary points of the likelihood function (or zeros of its derivative, the score function, and other estimating equations).

i

When used to minimize the above function, a standard (or 'batch') gradient descent method would perform the following iterations : where

When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients.

To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step.

This is very effective in the case of large-scale machine learning problems.[2] In stochastic (or 'on-line') gradient descent, the true gradient of

is approximated by a gradient at a single example: As the algorithm sweeps through the training set, it performs the above update for each training example.

compromise between computing the true gradient and the gradient at a single example is to compute the gradient against more than one training example (called a 'mini-batch') at each step.

This can perform significantly better than 'true' stochastic gradient descent described, because the code can make use of vectorization libraries rather than computing each step separately.

decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum.[3][4] This is in fact a consequence of the Robbins-Siegmund theorem.[5] Let's suppose we want to fit a straight line

1

2

1

2

n

y

1

^

y

2

^

y

n

^

The objective function to be minimized is: The last line in the above pseudocode for this specific problem will become: The key difference compared to standard (Batch) Gradient Descent is that only one piece of data from the dataset is used to calculate the step, and the piece of data is picked randomly at each step.

Stochastic gradient descent is a popular algorithm for training a wide range of models in machine learning, including (linear) support vector machines, logistic regression (see, e.g., Vowpal Wabbit) and graphical models.[6] When combined with the backpropagation algorithm, it is the de facto standard algorithm for training artificial neural networks.[7] Its use has been also reported in the Geophysics community, specifically to applications of Full Waveform Inversion (FWI).[8] Stochastic gradient descent competes with the L-BFGS algorithm,[citation needed] which is also widely used.

Stochastic gradient descent has been used since at least 1960 for training linear regression models, originally under the name ADALINE.[9] Another popular stochastic gradient descent algorithm is the least mean squares (LMS) adaptive filter.

A conceptually simple extension of stochastic gradient descent makes the learning rate a decreasing function ηt of the iteration number t, giving a learning rate schedule, so that the first iterations cause large changes in the parameters, while the later ones do only fine-tuning.

Such schedules have been known since the work of MacQueen on k-means clustering.[10] Further proposals include the momentum method, which appeared in Rumelhart, Hinton and Williams' seminal paper on backpropagation learning.[11] Stochastic gradient descent with momentum remembers the update Δ w at each iteration, and determines the next update as a linear combination of the gradient and the previous update:[12][13] that leads to: where the parameter

Momentum has been used successfully by computer scientists in the training of artificial neural networks for several decades.[14] Averaged stochastic gradient descent, invented independently by Ruppert and Polyak in the late 1980s, is ordinary stochastic gradient descent that records an average of its parameter vector over time.

That is, the update is the same as for ordinary stochastic gradient descent, but the algorithm also keeps track of[15] When optimization is done, this averaged parameter vector takes the place of w.

AdaGrad (for adaptive gradient algorithm) is a modified stochastic gradient descent with per-parameter learning rate, first published in 2011.[16][17] Informally, this increases the learning rate for more sparse parameters and decreases the learning rate for less sparse ones.

This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative.

Examples of such applications include natural language processing and image recognition.[16] It still has a base learning rate η, but this is multiplied with the elements of a vector {Gj,j} which is the diagonal of the outer product matrix where

i

The formula for an update is now or, written as per-parameter updates, Each {G(i,i)} gives rise to a scaling factor for the learning rate that applies to a single parameter wi.

G

i

∑

τ

=

1

t

g

τ

2

is the ℓ2 norm of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates.[14] While designed for convex problems, AdaGrad has been successfully applied to non-convex optimization.[18] RMSProp (for Root Mean Square Propagation) is also a method in which the learning rate is adapted for each of the parameters.

The idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.[19] So, first the running average is calculated in terms of means square, where,

Kalman-based Stochastic Gradient Descent (kSGD)[22] is an online and offline algorithm for learning parameters from statistical problems from quasi-likelihood models, which include linear models, non-linear models, generalized linear models, and neural networks with squared error loss as special cases.

For online learning problems, kSGD is a special case of the Kalman Filter for linear regression problems, a special case of the Extended Kalman Filter for non-linear regression problems, and can be viewed as an incremental Gauss-Newton method.

The benefits of kSGD, in comparison to other methods, are (1) it is not sensitive to the condition number of the problem ,[b] (2) it has a robust choice of hyperparameters, and (3) it has a stopping condition.

As noted by Patel,[22] for all problems besides linear regression, restarts are required to ensure convergence of the algorithm, but no theoretical or implementation details were given.

In a closely related, off-line, mini-batch method for non-linear regression analyzed by Bertsekas,[25] a forgetting factor was used in the covariance matrix update to prove convergence.

Bridging the Gap Between Theory and Practice in Machine Learning

Machine learning has become one of the most exciting research areas in the world, with various applications. However, there exists a noticeable gap between ...

On Gradient-Based Optimization: Accelerated, Stochastic and Nonconvex

Many new theoretical challenges have arisen in the area of gradient-based optimization for large-scale statistical data analysis, driven by the needs of ...

Yelawolf - American You

Pre-order the album Love Story now On iTunes: Google Play: Amazon MP3: .

Stochastic gradient descent

Stochastic gradient descent is a gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions.

Deep learning

Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using model architectures composed of multiple ...

Artificial neural network

In machine learning and related fields, artificial neural networks (ANNs) are computational models inspired by an animal's central nervous systems (in particular ...

Predictive analytics

Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, and data mining that analyze current and historical facts to ...

Auburn Coach Wife Kristi Malzahn Agrees with Match & eHarmony: Men are Jerks

My advice is this: Settle! That's right. Don't worry about passion or intense connection. Don't nix a guy based on his annoying habit of yelling "Bravo!" in movie ...