Index

Stochastic Gradient Descent

The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large.
At the opposite extreme, Stochastic Gradient Descent picks a random instance in the training set at every step and computes the gradients based only on that single instance.
Obviously, working on a single instance at a time makes the algorithm much faster because it has very little data to manipulate at every iteration.
It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration (Stochastic GD can be implemented as an outof- core algorithm).
On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average.
Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down.
So once the algorithm stops, the final parameter values are good, but not optimal.

With Stochastic Gradient Descent, each training step is much faster but also much more stochastic than when using Batch Gradient Descent
When the cost function is very irregular, this can actually help the algorithm jump out of local minima, so Stochastic Gradient Descent has a better chance of finding the global minimum than Batch Gradient Descent does.
One advantage is that the frequent updates allow us to have a pretty detailed rate of improvement. The frequent updates are more computationally expensive as the approach of Batch Gradient Descent.
The frequency of those updates can also result in noisy gradients, which may cause the error rate to jump around, instead of slowly decreasing.