Stochastic gradient descent
Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the computational burden, achieving faster iterations in trade for a lower convergence rate.
While the basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s, stochastic gradient descent has become an important optimization method in machine learning.
Background
Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum:where the parameter that minimizes is to be estimated. Each summand function is typically associated with the -th observation in the data set.
In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation. The general class of estimators that arise as minimizers of sums are called M-estimators. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation. Therefore, contemporary statistical theorists often consider stationary points of the likelihood function.
The sum-minimization problem also arises for empirical risk minimization. In this case, is the value of the loss function at -th example, and is the empirical risk.
When used to minimize the above function, a standard gradient descent method would perform the following iterations:
where is a step size.
In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations.
However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems.
Iterative method
In stochastic gradient descent, the true gradient of is approximated by a gradient at a single example:As the algorithm sweeps through the training set, it performs the above update for each training example. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges.
In pseudocode, stochastic gradient descent can be presented as follows:
- Choose an initial vector of parameters and learning rate.
- Repeat until an approximate minimum is obtained:
- * Randomly shuffle examples in the training set.
- * For, do:
- **
A compromise between computing the true gradient and the gradient at a single example is to compute the gradient against more than one training example at each step. This can perform significantly better than "true" stochastic gradient descent described, because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step is averaged over more training examples.
The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the learning rates decrease with an appropriate rate,
and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum
when the objective function is convex or pseudoconvex,
and otherwise converges almost surely to a local minimum.
This is in fact a consequence of the Robbins–Siegmund theorem.
Example
Let's suppose we want to fit a straight line to a training set with observations and corresponding estimated responses using least squares. The objective function to be minimized is:The last line in the above pseudocode for this specific problem will become:
Note that in each iteration, only the gradient evaluated at a single point instead of evaluating at the set of all samples.
The key difference compared to standard Gradient Descent is that only one piece of data from the dataset is used to calculate the step, and the piece of data is picked randomly at each step.
Notable applications
Stochastic gradient descent is a popular algorithm for training a wide range of models in machine learning, including support vector machines, logistic regression and graphical models. When combined with the backpropagation algorithm, it is the de facto standard algorithm for training artificial neural networks. Its use has been also reported in the Geophysics community, specifically to applications of Full Waveform Inversion.Stochastic gradient descent competes with the L-BFGS algorithm, which is also widely used. Stochastic gradient descent has been used since at least 1960 for training linear regression models, originally under the name ADALINE.
Another stochastic gradient descent algorithm is the least mean squares adaptive filter.
Extensions and variants
Many improvements on the basic stochastic gradient descent algorithm have been proposed and used. In particular, in machine learning, the need to set a learning rate has been recognized as problematic. Setting this parameter too high can cause the algorithm to diverge; setting it too low makes it slow to converge. A conceptually simple extension of stochastic gradient descent makes the learning rate a decreasing function of the iteration number, giving a learning rate schedule, so that the first iterations cause large changes in the parameters, while the later ones do only fine-tuning. Such schedules have been known since the work of MacQueen on -means clustering. Practical guidance on choosing the step size in several variants of SGD is given by Spall.Implicit updates (ISGD)
As mentioned earlier, classical stochastic gradient descent is generally sensitive to learning rate. Fast convergence requires large learning rates but this may induce numerical instability. The problem can be largely solved by considering implicit updates whereby the stochastic gradient is evaluated at the next iterate rather than the current one:This equation is implicit since appears on both sides of the equation. It is a stochastic form of the proximal gradient method since the update
can also be written as:
As an example,
consider least squares with features and observations
. We wish to solve:
where indicates the inner product.
Note that could have "1" as the first element to include an intercept. Classical stochastic gradient descent proceeds as follows:
where is uniformly sampled between 1 and. Although theoretical convergence of this procedure happens under relatively mild assumptions, in practice the procedure can be quite unstable. In particular, when is misspecified so that has large absolute eigenvalues with high probability, the procedure may diverge numerically within a few iterations. In contrast, implicit stochastic gradient descent can be solved in closed-form as:
This procedure will remain numerically stable virtually for all as the learning rate is now normalized. Such comparison between classical and implicit stochastic gradient descent in the least squares problem is very similar to the comparison between least mean squares and
normalized least mean squares filter.
Even though a closed-form solution for ISGD is only possible in least squares, the procedure can be efficiently implemented in a wide range of models. Specifically, suppose that depends on only through a linear combination with features, so that we can write, where
may depend on as well but not on except through. Least squares obeys this rule, and so does logistic regression, and most generalized linear models. For instance, in least squares,, and in logistic regression, where is the logistic function. In Poisson regression,, and so on.
In such settings, ISGD is simply implemented as follows. Let, where is scalar.
Then, ISGD is equivalent to:
The scaling factor can be found through the bisection method since
in most regular models, such as the aforementioned generalized linear models, function is decreasing,
and thus the search bounds for are
Momentum
Further proposals include the momentum method, which appeared in Rumelhart, Hinton and Williams' paper on backpropagation learning. Stochastic gradient descent with momentum remembers the update at each iteration, and determines the next update as a linear combination of the gradient and the previous update:that leads to:
where the parameter which minimizes is to be estimated, is a step size and is an exponential decay factor between 0 and 1 that determines the relative contribution of the current gradient and earlier gradients to the weight change.
The name momentum stems from an analogy to momentum in physics: the weight vector, thought of as a particle traveling through parameter space, incurs acceleration from the gradient of the loss. Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfully by computer scientists in the training of artificial neural networks for several decades.
Averaging
Averaged stochastic gradient descent, invented independently by Ruppert and Polyak in the late 1980s, is ordinary stochastic gradient descent that records an average of its parameter vector over time. That is, the update is the same as for ordinary stochastic gradient descent, but the algorithm also keeps track ofWhen optimization is done, this averaged parameter vector takes the place of.
AdaGrad
AdaGrad is a modified stochastic gradient descent algorithm with per-parameter learning rate, first published in 2011. Informally, this increases the learning rate for sparser parameters and decreases the learning rate for ones that are less sparse. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition. It still has a base learning rate, but this is multiplied with the elements of a vector which is the diagonal of the outer product matrixwhere, the gradient, at iteration. The diagonal is given by
This vector is updated after every iteration. The formula for an update is now
or, written as per-parameter updates,
Each gives rise to a scaling factor for the learning rate that applies to a single parameter. Since the denominator in this factor, is the ℓ2 norm of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates.
While designed for convex problems, AdaGrad has been successfully applied to non-convex optimization.
RMSProp
RMSProp is also a method in which the learning rate is adapted for each of the parameters. The idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.So, first the running average is calculated in terms of means square,
where, is the forgetting factor.
And the parameters are updated as,
RMSProp has shown good adaptation of learning rate in different applications. RMSProp can be seen as a generalization of Rprop and is capable to work with mini-batches as well opposed to only full-batches.
Adam
Adam is an update to the RMSProp optimizer. In this optimization algorithm, running averages of both the gradients and the second moments of the gradients are used. Given parameters and a loss function, where indexes the current training iteration, Adam's parameter update is given by:where is a small scalar used to prevent division by 0, and and are the forgetting factors for gradients and second moments of gradients, respectively. Squaring and square-rooting is done elementwise.