Decouple the computational batch size and minibatch size by accumulating gradients #1977

(With layers whose backward accumulates gradients), this effectively decouples the computational batch from the SGD minibatch. Each iteration accumulates gradients over iter_size batches, then parameters are updated.

Divide local rate by `iter_size` to normalize the gradient according to the full minibatch size and not only the computational batch size. Multiply the local decay by `iter_size` to counter the division of the local learning rate since the decay is multiplied by the rate in the update equation.

Compare the parameters after solving with a given batch size and the halved batch size + two iter accumulation of gradients equivalent. Note: the test net dummy data layer now makes constant data and random gaussian targets. This assures the standard and gradient accumulation cases check the same data. Otherwise the difference in batch sizes causes different orders of random number draws.

`SGDSolver::Normalize()` normalizes accumulated gradients by scaling inversely to the accumulation as `1 / iter_size`. This fixes accumulation for AdaGrad and is more obvious than fooling with rates and decays in 55585f5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple the computational batch size and minibatch size by accumulating gradients #1977

Decouple the computational batch size and minibatch size by accumulating gradients #1977

Commits on May 27, 2015

Commits on May 28, 2015