-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple the computational batch size and minibatch size by accumulating gradients #1977
Merged
Commits on May 27, 2015
-
zero-init param diffs and accumulate gradients
(With layers whose backward accumulates gradients), this effectively decouples the computational batch from the SGD minibatch. Each iteration accumulates gradients over iter_size batches, then parameters are updated.
Configuration menu - View commit details
-
Copy full SHA for 41cf06c - Browse repository at this point
Copy the full SHA 41cf06cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 539f879 - Browse repository at this point
Copy the full SHA 539f879View commit details -
1
Configuration menu - View commit details
-
Copy full SHA for 3262e46 - Browse repository at this point
Copy the full SHA 3262e46View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8cc9af0 - Browse repository at this point
Copy the full SHA 8cc9af0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 67b1ff3 - Browse repository at this point
Copy the full SHA 67b1ff3View commit details -
adjust local learning rate and decay according to gradient accumulation
Divide local rate by `iter_size` to normalize the gradient according to the full minibatch size and not only the computational batch size. Multiply the local decay by `iter_size` to counter the division of the local learning rate since the decay is multiplied by the rate in the update equation.
Configuration menu - View commit details
-
Copy full SHA for 55585f5 - Browse repository at this point
Copy the full SHA 55585f5View commit details
Commits on May 28, 2015
-
test equivalence of solving with accumulating gradients
Compare the parameters after solving with a given batch size and the halved batch size + two iter accumulation of gradients equivalent. Note: the test net dummy data layer now makes constant data and random gaussian targets. This assures the standard and gradient accumulation cases check the same data. Otherwise the difference in batch sizes causes different orders of random number draws.
Configuration menu - View commit details
-
Copy full SHA for 92ab737 - Browse repository at this point
Copy the full SHA 92ab737View commit details -
directly normalize accumulated gradients
`SGDSolver::Normalize()` normalizes accumulated gradients by scaling inversely to the accumulation as `1 / iter_size`. This fixes accumulation for AdaGrad and is more obvious than fooling with rates and decays in 55585f5.
Configuration menu - View commit details
-
Copy full SHA for 0e7a078 - Browse repository at this point
Copy the full SHA 0e7a078View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.