Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple the computational batch size and minibatch size by accumulating gradients #1977

Merged
merged 8 commits into from
May 30, 2015

Commits on May 27, 2015

  1. zero-init param diffs and accumulate gradients

    (With layers whose backward accumulates gradients), this effectively
    decouples the computational batch from the SGD minibatch. Each
    iteration accumulates gradients over iter_size batches, then parameters
    are updated.
    longjon authored and shelhamer committed May 27, 2015
    Configuration menu
    Copy the full SHA
    41cf06c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    539f879 View commit details
    Browse the repository at this point in the history
  3. 1 Configuration menu
    Copy the full SHA
    3262e46 View commit details
    Browse the repository at this point in the history
  4. accumulate gradients in (de)conv layers

    longjon authored and shelhamer committed May 27, 2015
    Configuration menu
    Copy the full SHA
    8cc9af0 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    67b1ff3 View commit details
    Browse the repository at this point in the history
  6. adjust local learning rate and decay according to gradient accumulation

    Divide local rate by `iter_size` to normalize the gradient according to
    the full minibatch size and not only the computational batch size.
    
    Multiply the local decay by `iter_size` to counter the division of the
    local learning rate since the decay is multiplied by the rate in the
    update equation.
    shelhamer committed May 27, 2015
    Configuration menu
    Copy the full SHA
    55585f5 View commit details
    Browse the repository at this point in the history

Commits on May 28, 2015

  1. test equivalence of solving with accumulating gradients

    Compare the parameters after solving with a given batch size and the
    halved batch size + two iter accumulation of gradients equivalent.
    
    Note: the test net dummy data layer now makes constant data and random
    gaussian targets. This assures the standard and gradient accumulation
    cases check the same data. Otherwise the difference in batch sizes
    causes different orders of random number draws.
    shelhamer committed May 28, 2015
    Configuration menu
    Copy the full SHA
    92ab737 View commit details
    Browse the repository at this point in the history
  2. directly normalize accumulated gradients

    `SGDSolver::Normalize()` normalizes accumulated gradients by scaling
    inversely to the accumulation as `1 / iter_size`.
    
    This fixes accumulation for AdaGrad and is more obvious than fooling
    with rates and decays in 55585f5.
    shelhamer committed May 28, 2015
    Configuration menu
    Copy the full SHA
    0e7a078 View commit details
    Browse the repository at this point in the history