-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Inconsistent weight decay logics in multiple optimizers #9881
Comments
Could you clarify which ones multiply wd with lr and which ones don't? |
Just curious, is |
@sxjscience supposedly it's used in all optimizers |
Unless explicitly specified, for most of the optimizers as implemented in packages such as TF and torch, wd is merged into the gradient before the gradient clipping. When the proximal operator is used, the wd term is not merged. |
@szhengac thanks for your inputs. My concern is that if we change the wd behavior now, the change is incompatible with previous versions of MXNet and users have to change their code for new hyperparameters. I don't think this is provides a good user experience. @szha @piiswrong @szhengac What about specifically documenting the update rules for the existing optimizers like https://mxnet.incubator.apache.org/versions/master/api/python/optimization/optimization.html#mxnet.optimizer.SGD ? For new optimizers, during code review committers should check if the implementation is similar to pytorch/tf ones? |
Looks good. We can write math formulas instead. |
@sxjscience I could do that but math formulas is not very convenient when clip is involved.. |
Thank Haibin for raising such issues. Besides, weight decay should only apply to weights (not bias). [1][2] Thus, users usually do
by assuming that weight decay only applies to weights. However, our current implementation applies weight decay to all model parameters including bias. References: [1] Franklin, J. (2005). The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2), 83-85. [2] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning (Vol. 1). Cambridge: MIT press. |
Also there is this paper that discusses how weight_decay must be applied independently from optimizer: https://arxiv.org/abs/1711.05101 This was recently implemented in TF: tensorflow/tensorflow#17438 |
Issue
The default behaviors of many optimizers are not optimal/consistent for optimization. The desired implementation proposed below will help convergence.
Gradient Clipping
In Tensorflow/Pytorch, weight decay is usually applied before gradient clipping. Not clipping weight decay would lead to much larger update caused by wd regularization compared to the derivative of the loss. For the following optimizers, the weight decay term is applied after gradient clipping, which should be corrected:
Weight Decay Not Used to Update Optimizer State
The following optimizers apply wd on weight directly, and the state is not updated. This can make the training slow if a small learning rate is applied, while divergence if a large learning rate is used.
Other Optimizers
FTRL is a proximal optimizer which doesn't use weight decay for gradient clipping nor updating state, which is fine.
The following optimizers apply wd before clipping gradient, which is also fine:
The text was updated successfully, but these errors were encountered: