-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecates AdamW and adds --optim
#14744
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this. I left some nits and there is one line missing which shows we need a test of this new feature :-)
I'm also wondering if we shouldn't create an Enum for all optimizer values that are supported?
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@sgugger Thank you for the suggestions. I think an enum makes sense. I will add that and fix the missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on it, @manuelciosici
I made a few proposals about consistency
Please let me know if it's not clear.
(and I edited/removed my earlier comments on me thinking that the actual adafactor optimizer being deprecated, which was my reading the code incorrectly).
src/transformers/trainer.py
Outdated
@@ -818,17 +818,43 @@ def create_optimizer(self): | |||
"weight_decay": 0.0, | |||
}, | |||
] | |||
optimizer_cls = Adafactor if self.args.adafactor else AdamW | |||
if self.args.adafactor: | |||
if self.args.adafactor and self.args.optim not in {"adamw_hf", "adafactor"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic is unnecessary complicated, IMHO
- deprecate --adafactor in favor of --optim adafactor
- if
--adafactor
is passed set --optim adafactor
now you no longer need to ever consider self.args.adafactor
other than to deprecate it.
@manuelciosici. have we lost you or are you just on an extended vacation? Let's finish this up, so we can test out various optimizers. I will sync the recent changes in doc formatting. I was just told that |
update: see updated benchmarks here: I'm working on a neat HF Trainer benchmarking tool, so here it is applied to the changes introduced by this PR:
So torch's AdamW appears to be even slower than ours. So clearly apex's AdamW is the way to go speed-wise. Note, that the absolute and relative results will be different on a different GPU and a different finetuning setup, but most likely the current fastest optimizer will remain fastest, etc. Reproducibility and other info:
|
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
4d59fd5
to
d599a38
Compare
@stas00 You have not lost me. I've been on vacation and before that, I accidentally messed up my GitHub notification settings, so I didn't know you had also reviewed the code. I applied all the suggestions from you. I also changed to an Let me know if I should change anything else. |
@stas00 Let me know if I should squash all the commits to a single commit so they don't pollute |
@stas00 Thank you for the code cleanup, figuring out the mocking issue, and for your patience. This PR has been an educational experience (I didn't know about |
Since clearly you're interesting in easier testing, you may find some useful tidbits in this extensive doc: https://huggingface.co/docs/transformers/testing, e.g. for different parameterization situations https://huggingface.co/docs/transformers/testing#parametrization optimizers-wise I think the next interesting but challenging thing to add is BNB 8bit optimizer The other thing to potentially experiment with is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for waiting for me! This looks great (nice new tests!) but I have a few last comments.
src/transformers/trainer.py
Outdated
if args.optim == OptimizerNames.ADAFACTOR.value: | ||
optimizer_cls = Adafactor | ||
optimizer_kwargs.update({"scale_parameter": False, "relative_step": False}) | ||
elif args.optim == OptimizerNames.ADAM_HF.value: | ||
from .optimization import AdamW | ||
|
||
optimizer_cls = AdamW | ||
optimizer_kwargs.update(adam_kwargs) | ||
elif args.optim == OptimizerNames.ADAM_TORCH.value: | ||
from torch.optim import AdamW | ||
|
||
optimizer_cls = AdamW | ||
optimizer_kwargs.update(adam_kwargs) | ||
elif args.optim == OptimizerNames.ADAM_APEX_FUSED.value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very use of an enum. The args.optim
attributed should be converted to the enum type (you just need to set it with a type OptimizerNames
instead of str
and it will accept both str
and OptimizerNames
values as well as doing the conversion in the post init) and those tests should have all the .value
removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I've updated everything.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
I think the only other remaining item that I didn't hear you weigh on, @sgugger, is whether we should call these |
Oh sorry, I didn't catch that. It should adamw everywhere IMO. |
Thank you for validating that, @sgugger. @manuelciosici, so one more tiny change please Thank you! |
I removed Let me know if I should change anything else. |
but the rest needs to updated to match, e.g. currently many torch tests fail with:
|
@stas00 I just saw that. I'm trying to figure out what I misunderstood. |
@stas00 I surprised that fixes it. On my end, I just fixed it by adding |
It's the same as:
I'm not sure if you can see the failing tests from CI, so I thought it'd be faster to push in the fix as I saw where it was failing. Are you still planning to push a change? You said removing a unit test. Let us know when you're done. |
@stas00 Commit e73249c makes the unit tests pass, but it doesn't work when
So, calling a script on the command line (for example, I added another test for specifying the optimizer name as a string. Also, I removed I also changed the default value from
While
The first one leaks the internal object name, while the second indicates the string we want users to pass. Finally, I make
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looking great on my side! @stas00 I'll let you merge if you're happy as well.
Thanks again for all your work on this @manuelciosici !
I second that - thank you, @manuelciosici! |
What does this PR do?
Fixes #14539
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@stas00
Misc
@stas00
FusedAdam
based on your comment in Two bugs in AdamW #14539--optim adafactor
and--adafactor
server the same purpose, I marked--adafactor
as deprecated. I copy-pasted a deprecation warning that mentionstransformers
version 5. Let me know if the deprecation warning should say something else.