-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two bugs in AdamW #14539
Comments
Thank you for submitting this bug report and the investigation, @manuelciosici
You must have meant line 359 in the sentence above. Your investigation looks correct on both accounts, @manuelciosici. I was able to follow through your helpful notes. I suspect the denominator buglet was an optimization since epsilon is tiny and it's there only to avoid a division by zero. The missing part of the denominator is The decay part applied to t instead of t-1 does appear to be significant. Since I wasn't the one involved in writing this code (I only did a small adjustment) I will let @thomwolf and perhaps @LysandreJik and @sgugger to confirm. p.s. I did see references where the choice of epsilon was important. |
I was not the one who made the adjustments, which may have been made on purpose for some reason. I don't think the current behavior should be changed (even if different from the original paper) as it might break all reported results on all our examples, and this implementation of AdamW has worked quite well on all our tasks. Furthermore, PyTorch now has an implementation of AdamW, so people should use that one for a "bug-free" version. |
@manuelciosici, if you could indulge my curiosity - what was the impetus for checking the AdamW implementation? I'm just trying to understand the actual impact of this different implementation on the training stability/convergence/etc. Thank you. |
@stas00 I was reading it as a reference implementation while trying to understand One thing to note is that magnitude of both bugs is a function of AdamW's hyper-parameters (i.e., it is influenced by learning rate, epsilon, and weight decay). For example, for prompt tuning where learning rates can be as high as @sgugger I understand the concerns that fixing the optimizer will lead to misalignment with existing examples and documentation. However, ignoring the bugs is not good either. Since opening the issue, found that I was not the first one to discover the weight decay issue. I expect that, if the code stays as is, the two bugs will be rediscovered periodically. An alternative to ignoring the bugs would be for |
Yes, I agree with your last suggestion @manuelciosici and I think this is the right way to go. Deprecation with a removal of v5.0.0 sounds about right, and then the Are you interested of making a PR for this @manuelciosici ? |
Additionally to @sgugger's notes: the updated AdamW API should include a new arg like
The question is whether we switch HF Trainer to use torch's implementation by default or not. Also, if we are rewriting the optimizer API, perhaps we can add a generic So we can have:
|
We don't need to worry about BNB here, I was just suggesting to add a generic Adding BNB to transformers is a bit intricate since it calls for an embedding layernorm which we currently don't have. I will open an issue where we can discuss the details. That additional layernorm proved to be an essential for stability of gpt-104B training we are working on at BigScience. |
The plan is not to add any new optimizer to the Transformers library. It is a library of models, not optimizers, and no one in the team has the bandwidth to support other optimizers. We are deprecating the few we have. Adding support for optimizers implemented in other libraries is completely fine however. Adding an
Given the fact is is breaking, the Trainer should stay with the current optimizer for now, and we can either switch in v5 |
Apologies for not being clear. I was not proposing to add a new optimizer, but to add integration for a new optimizer. i.e. we will not need to support it. It's just that it's not just importing it, but requires some tweaks on our side. I will make a separate issue about it.
OK, so the default remains the current version. Here is the updated spec then: So with HF Trainer:
With AdamW class itself
Sylvain, please confirm that this is the correct spec before @manuelciosici starts on it. Thank you. |
Thanks for the summary @stas00, this looks great to me! |
@stas00 Thank you. I work on this during the weekend. |
The NVIDIA engineers have been profiling a few things and torch's AdamW is faster than ours (apparently apex's is even faster), so I will add this to the performance docs once I'm able to benchmark this when your PR is ready, @manuelciosici |
It appears that |
This implementation of Adamw, Although slower, seems to give me better performance then the pytorch one in terms of acc and F1. I'm not sure if I'm the only one with this result but if this is the case for multiple persons, deprecating it could be a shame. |
The key to understand is that it's not implementing AdamW, but a slightly different algorithm. Users expect exact algorithm implementation out of the box and if it's not exact it should be named differently. Perhaps |
Environment info
transformers
version: 4.13.0.dev0Who can help
@thomwolf and @stas00 should be able to help based on
git blame
Information
There are two bugs in the implementation of AdamW.
Here's the current code https://github.com/manuelciosici/transformers/blob/04683c0659aacf31a1e1df8aa2e6cf7b447a6f12/src/transformers/optimization.py#L324-L371
Weight decay bug
Look at lines 369-370. The weight decay is multiplied with
p.data
which no longer corresponds totheta_{t-1}
sincep.data
was modified in line 369. Below is a picture of Algorithm 2 from the original Adamw paper that shows on line 12 that the weight decay should be multiplied with the previous step's parameters (i.e.,theta_{t-1}
).From what I can tell, this is a regression since the original AdamW implementation in
transformers
applied weight decay properly. Here's the commit that introduces the bug ec07cf5#diff-40c6163602943c11431f1ec360299a7646bb436c691a646b9f54b2284f556ce0For confirmation that weight decay is currently buggy, see the original AdamW implementation, where, on line 74, the weight decay is multiplied with the old parameters as opposed to the new parameters that are calculated on line 71.
Denominator computation bug
The second bug appears in the computation of the denominator corresponding to line 10 in Algorithm 2 above. In the current code (see link in the
Information
section), on line 351, the denominator excludes the division bymath.sqrt(bias_correction2)
. On line 357, division bymath.sqrt(bias_correction2)
appears, but, by this time,eps
has already been added todenom
, making the division not equivalent to line 10 in Algorithm 10.From what I can tell, this bug was also introduced as part of commit ec07cf5#diff-40c6163602943c11431f1ec360299a7646bb436c691a646b9f54b2284f556ce0. The previous line
update = next_m / (next_v.sqrt() + group['e'])
was correct.For confirmation that the denominator is not properly calculated, see the original AdamW implementation, where, on line 64 the denominator is computed.
To reproduce
Steps to reproduce the behavior:
tests/test_optimization.py
test_compare_adamw_no_weight_decay
andtest_compare_adamw_with_weight_decay
should fail (see the attached failed_tests.txt)Expected behavior
The two implementations of AdamW should match their parameter updates.
Proposed fix
Checkout the branch at https://github.com/manuelciosici/transformers/tree/fix_adamw . It contains both the unit tests above and a fix for both bugs mentioned above.
I can make a PR once we agree on the two bugs and the fix.
The text was updated successfully, but these errors were encountered: