Inconsistent weight decay logics in multiple optimizers #9881

eric-haibin-lin · 2018-02-25T09:22:17Z

piiswrong · 2018-02-26T20:07:52Z

Could you clarify which ones multiply wd with lr and which ones don't?

sxjscience · 2018-02-26T20:29:08Z

Just curious, is clip_gradient used anywhere?

eric-haibin-lin · 2018-02-28T02:59:07Z

@sxjscience supposedly it's used in all optimizers
AdaDelta doesn't multiply wd and lr.

szhengac · 2018-03-12T19:46:20Z

Unless explicitly specified, for most of the optimizers as implemented in packages such as TF and torch, wd is merged into the gradient before the gradient clipping. When the proximal operator is used, the wd term is not merged.

eric-haibin-lin · 2018-04-03T01:14:09Z

@szhengac thanks for your inputs.

My concern is that if we change the wd behavior now, the change is incompatible with previous versions of MXNet and users have to change their code for new hyperparameters. I don't think this is provides a good user experience.

@szha @piiswrong @szhengac What about specifically documenting the update rules for the existing optimizers like https://mxnet.incubator.apache.org/versions/master/api/python/optimization/optimization.html#mxnet.optimizer.SGD ? For new optimizers, during code review committers should check if the implementation is similar to pytorch/tf ones?

sxjscience · 2018-04-03T01:30:41Z

Looks good. We can write math formulas instead.

eric-haibin-lin · 2018-04-03T22:01:42Z

@sxjscience I could do that but math formulas is not very convenient when clip is involved..

astonzhang · 2018-05-18T22:08:49Z

Thank Haibin for raising such issues.

Besides, weight decay should only apply to weights (not bias). [1][2] Thus, users usually do

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': learning_rate,
                                       'wd': weight_decay})

by assuming that weight decay only applies to weights. However, our current implementation applies weight decay to all model parameters including bias.

References:

[1] Franklin, J. (2005). The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2), 83-85.

[2] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning (Vol. 1). Cambridge: MIT press.

safrooze · 2018-06-15T23:55:32Z

Also there is this paper that discusses how weight_decay must be applied independently from optimizer: https://arxiv.org/abs/1711.05101

This was recently implemented in TF: tensorflow/tensorflow#17438

eric-haibin-lin added Application API change Android and removed API change Android Application labels Feb 25, 2018

marcoabreu added Bug Optimizer labels Feb 25, 2018

eric-haibin-lin mentioned this issue Mar 1, 2018

Improve sparse adagrad update #9651

Merged

6 tasks

eric-haibin-lin mentioned this issue Apr 9, 2018

[MXNET-265] Update optimizer doc to clarify wd behaviors #10388

Merged

7 tasks

szha mentioned this issue Apr 9, 2018

[Discussion] MXNet 2.0 Roadmap (was: APIs that might be a good idea to break in 2.0) #9686

Closed

anirudhacharya mentioned this issue Oct 22, 2018

Fixing Weight Decay Regularization in Adam #12916

Closed

szha mentioned this issue Sep 13, 2019

[RFC] Apache MXNet 2.0 Roadmap #16167

Open

szhengac mentioned this issue Jan 21, 2020

[MXNET-#16167] Refactor Optimizer #17400

Merged

7 tasks

szha added v1.x Targeting v1.x branch won't fix labels Aug 1, 2020

szha closed this as completed Aug 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent weight decay logics in multiple optimizers #9881

Inconsistent weight decay logics in multiple optimizers #9881

eric-haibin-lin commented Feb 25, 2018 •

edited

Loading

piiswrong commented Feb 26, 2018

sxjscience commented Feb 26, 2018

eric-haibin-lin commented Feb 28, 2018

szhengac commented Mar 12, 2018

eric-haibin-lin commented Apr 3, 2018

sxjscience commented Apr 3, 2018

eric-haibin-lin commented Apr 3, 2018

astonzhang commented May 18, 2018 •

edited

Loading

safrooze commented Jun 15, 2018

Inconsistent weight decay logics in multiple optimizers #9881

Inconsistent weight decay logics in multiple optimizers #9881

Comments

eric-haibin-lin commented Feb 25, 2018 • edited Loading

Issue

Gradient Clipping

Weight Decay Not Used to Update Optimizer State

Other Optimizers

piiswrong commented Feb 26, 2018

sxjscience commented Feb 26, 2018

eric-haibin-lin commented Feb 28, 2018

szhengac commented Mar 12, 2018

eric-haibin-lin commented Apr 3, 2018

sxjscience commented Apr 3, 2018

eric-haibin-lin commented Apr 3, 2018

astonzhang commented May 18, 2018 • edited Loading

safrooze commented Jun 15, 2018

eric-haibin-lin commented Feb 25, 2018 •

edited

Loading

astonzhang commented May 18, 2018 •

edited

Loading