schedulefree optimizers #30079

winglian · 2024-04-06T03:39:21Z

What does this PR do?

integrates meta's https://github.com/facebookresearch/schedule_free for adamw & sgd

https://twitter.com/aaron_defazio/status/1776320004465582331

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@muellerzr @younesbelkada @pacman100

muellerzr · 2024-04-06T11:36:20Z

FYI this will need huggingface/accelerate#2631 as we need to upstream accelerate's ability to call train/eval on a wrapped optimizer

danielhanchen · 2024-04-07T02:09:23Z

Some thoughts:

I was trying to ask Aaron et al on Twitter if they did any transformer experiments, but to no avail. They said a paper will come in 1 or 2 months.
Aaron et al's past work on D-Adaptation won a best ICML paper, with their follow up work being Prodigy - but both on transformers did similar or worse than AdamW. https://twitter.com/danielhanchen/status/1775547139248341125
Superconvergence + LR range finder + Fast AI's Ranger21 optimizer was the goto optimizer for CNNs, and worked fabulously well, but on transformers, the learning rate range finder sadi 1e-3 was the best, whilst 1e-5 was better. However, the 1 cycle learning rate stuck. Learning rate finder for the trainer #16013
A huge issue is this needs tuning??! But how about a well tuned AdamW? Eg see https://twitter.com/kellerjordan0/status/1776716388037529843 which outperformed it using a tuned SGD.

I'm just a little bit reserved for now since the author themselves aren't providing any transformer benchmarks, nor have they compared their CNN baselines to superconvergence, which is the goto standard for fast training for CNNs. Likewise https://parameterfree.com/2023/08/30/yet-another-icml-award-fiasco/ wasn't pleasant.

PhilipMay · 2024-04-07T06:35:06Z

Should be very easy to test this on Phi-2 or TinyLlama when the implementation works?

younesbelkada

Great work @winglian ! 🤩 I left one minor comment, wdyt?

younesbelkada · 2024-04-08T09:31:50Z

src/transformers/trainer.py

@@ -3117,6 +3145,9 @@ def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor,
            `torch.Tensor`: The tensor with training loss on this batch.
        """
        model.train()
+        if "ScheduleFree" in self.optimizer.__class__.__name__:


maybe instead of checking the class name here we could inject an attribute _hf_schedule_free_optim to make sure we can support that in the future for other shcedule free optimizers, what do you think?

that would be on the Trainer class, right?

so the place that makes the most sense to set that would be in get_optimizer_cls_and_kwargs but that is a @staticmethod so has no access to the trainer object. We could do something along the lines of

setattr(self.optimizer, "_hf_schedule_free_optim", True)

after we instantiate the optimizer_cls but we would still have to do some sort of class name detection.

Alternatively we could pass another value in the return tuple specific to schedule_free optimizers (but that feels worse)

ahh good point yeah, in that case this is probably already fine I would say, thanks for investigating @winglian !

Rather than have it as a stateful attribute, could we instead move this logic out to a module-level function e.g.:

def _is_schedule_free_optimizer(optimizer): return "ScheduleFree" in optimizer__class__.__name__

?

This way:

The check is a bit more explicit within the code logic

we can easily adapt the checking in one place, rather than throughout the code, if we end up introducing e.g. a _is_schedule_free attribute or there's schedule free optimizers with slightly different names

PhilipMay · 2024-04-08T10:15:15Z

This PR should maybe also add a few lines to the README about "how to use this".

muellerzr · 2024-04-08T15:28:54Z

We've merged the accelerate portion in, so if anyone is trying this out in distributed fashions, you can do pip install git+https://github.com/huggingface/accelerate :)

src/transformers/trainer.py

bratao · 2024-04-14T16:54:46Z

There is any chance of this making into the main branch? I and other confirmed that the results are real. Thank you @winglian

pacman100

Super useful addition of scheduler free optimizers @winglian! It would be great to document the usage along with a minimal example.

HuggingFaceDocBuilderDev · 2024-04-29T12:44:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

CoffeeVampir3 · 2024-05-09T05:15:54Z

Is their any remaining work I could contribute towards getting this PR merged?

Cheers

winglian · 2024-05-31T18:55:23Z

@pacman100 @muellerzr @younesbelkada Can we get a new review to get this merged? Since the last check, I rebased, added some fixes and docs.

muellerzr

Thanks! Overall LG2M, let's pin schedulefree as a >= however.

Can you also run the quality checks? Afterwords at least from my end looks good to merge.

setup.py

winglian · 2024-06-01T01:11:06Z

@muellerzr ran the make quality/lint and also added a smoke test to the test suite for schedule free adam

muellerzr

Thanks a bunch! cc @LysandreJik for final review

younesbelkada

Thanks a lot !

amyeroberts

Thanks for adding!

Main comment is about the getattr logic in get_optimizer_cls_and_kwargs

amyeroberts · 2024-06-03T14:28:42Z

src/transformers/trainer.py

@@ -3117,6 +3145,9 @@ def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor,
            `torch.Tensor`: The tensor with training loss on this batch.
        """
        model.train()
+        if "ScheduleFree" in self.optimizer.__class__.__name__:


Rather than have it as a stateful attribute, could we instead move this logic out to a module-level function e.g.:

def _is_schedule_free_optimizer(optimizer): return "ScheduleFree" in optimizer__class__.__name__

?

This way:

The check is a bit more explicit within the code logic

we can easily adapt the checking in one place, rather than throughout the code, if we end up introducing e.g. a _is_schedule_free attribute or there's schedule free optimizers with slightly different names

amyeroberts · 2024-06-03T16:03:12Z

src/transformers/trainer.py

+            additional_optim_kwargs["warmup_steps"] = args.warmup_steps
+            additional_optim_kwargs.update(
+                {
+                    "weight_lr_power": float(getattr(torch, optim_args.get("weight_lr_power", 2.0))),


This doesn't seem right:

If we get "weight_lr_power" from optim_args I'm presuming it's a float as string e.g. "2.0"? I don't think torch.2.0 exists?

If optim_args doesn't have "weight_lr_power", then the second argument to getattr is a float, which isn't compatible

Suggested change

"weight_lr_power": float(getattr(torch, optim_args.get("weight_lr_power", 2.0))),

"weight_lr_power": float(optim_args.get("weight_lr_power", 2.0)),

amyeroberts · 2024-06-03T16:03:18Z

src/transformers/trainer.py

+            additional_optim_kwargs.update(
+                {
+                    "weight_lr_power": float(getattr(torch, optim_args.get("weight_lr_power", 2.0))),
+                    "r": float(getattr(torch, optim_args.get("r", 0.0))),


Suggested change

"r": float(getattr(torch, optim_args.get("r", 0.0))),

"r": float(optim_args.get("r", 0.0)),

github-actions · 2024-06-28T08:04:57Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

winglian · 2024-06-28T11:21:09Z

Will get back to this soon. Not stale 😅

github-actions · 2024-07-23T08:06:15Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

bratao · 2024-07-31T11:52:08Z

@winglian please don´t let it die

tmm1 · 2024-08-20T23:16:18Z

@amyeroberts I addressed your comments. LMK what else is required to push this through!

amyeroberts · 2024-08-22T14:25:35Z

src/transformers/trainer.py

+        if _is_schedule_free_optimizer(self.optimizer):
+            self.optimizer.train()


We shouldn't need to have optimizer specific logic in the main logic loops: this makes our training logic hard to handle and will become quickly too cluttered if many optimizers have to have specific logic.

cc @muellerzr who's been working on refactoring a lot of similar logic to this. Ideally all optimizers would use the same API and would support calling .train or .eval on the class, if this is required, but this is a large piece of work. Is this customization as-is acceptable?

Would it be preferable to check if the optimizer has the eval/train methods and call them, instead of checking the class name?

cc @adefazio

Yes, the way I do it in my code is by checking for the eval/train methods, i.e. duck typing.

is this approach acceptable?

Suggested change

if _is_schedule_free_optimizer(self.optimizer):

self.optimizer.train()

if hasattr(self.optimizer, 'train') and callable(getattr(self.optimizer, 'train')):

self.optimizer.train()

Yes, I think this would be preferable as it's more extensible to other optimizers, and then puts the responsibility for management on implementation / checking on the optimizer implementation, rather than us within trainer

amyeroberts

Thanks for iterating - looks good to me!

cc @muellerzr to confirm the conditional check on the eval and train methods are OK

amyeroberts · 2024-08-27T16:14:59Z

@winglian It looks like we'll need a bit more iteration on the eval and train method checks as it's causing failing tests atm. For the code quality tests, could you run make fixup and push the changes?

tmm1 · 2024-08-27T16:55:11Z

we'll need a bit more iteration on the eval and train method checks as it's causing failing tests atm

looks like we need to revisit huggingface/accelerate#2631 which i've done in huggingface/accelerate#3055

could you run make fixup and push the changes

👍

fizzAI · 2024-08-30T12:51:03Z

lgtm!
for what it's worth, there was some asking when this PR was originally opened about whether schedulefree's optims were effective on transformers; i did some testing a while ago, and SF adamw had (albiet marginally) a better loss/acc landscape than normal adamw on at least sequence classification finetuning, even after tuning hyperparams.

(purple line is ScheduleFreeAdamW, red line is just AdamW)

tmm1 · 2024-09-06T19:48:59Z

@amyeroberts can we kick off the tests again? does the build pick up the latest accelerate release automatically?

muellerzr

Thanks! Just a request for one more additional check for users who don't have the right minimum accelerate version. cc @LysandreJik for final 🤗

src/transformers/trainer.py

Co-authored-by: Aman Gupta Karmani <aman@tmm1.net>

LysandreJik

Thanks for the PR @winglian!

LGTM

winglian · 2024-09-11T14:43:45Z

Thanks for the PR @winglian!

all the credit goes to @tmm1 for getting this over the line!

llllvvuu · 2024-09-17T13:40:44Z

If I use this, can I ignore the lr that gets printed out by trainer? (looks like the default linear decay) Or do I need to change the trainer's lr scheduler to constant?

* schedulefree optimizers * fix train instead of eval for optimizer * fixes and update docs * chore: lint * add tests and drop overly-verbose _32bit suffix * chore: lint * fix for docs * fix code review issues * use duck-typing to avoid per-optimizer patches * fixup style * fixup style * warn if incorrect accelerate version with schedule free Co-authored-by: Aman Gupta Karmani <aman@tmm1.net> --------- Co-authored-by: Aman Karmani <aman@tmm1.net>

winglian mentioned this pull request Apr 6, 2024

add support for adamw schedulefree axolotl-ai-cloud/axolotl#1486

Closed

vasqu mentioned this pull request Apr 7, 2024

Schedule Free Optimizer (SGD, AdamW) #30087

Closed

younesbelkada reviewed Apr 8, 2024

View reviewed changes

hiyouga reviewed Apr 9, 2024

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

vasqu mentioned this pull request Apr 20, 2024

Schedule Free Optimizers (pytorch) and Sophia optimizer #30359

Open

pacman100 reviewed Apr 29, 2024

View reviewed changes

winglian force-pushed the schedule-free-optimizer branch from 87b9651 to 9e654bb Compare May 31, 2024 18:44

muellerzr approved these changes May 31, 2024

View reviewed changes

setup.py Outdated Show resolved Hide resolved

muellerzr approved these changes Jun 1, 2024

View reviewed changes

muellerzr requested a review from LysandreJik June 1, 2024 02:39

younesbelkada approved these changes Jun 3, 2024

View reviewed changes

younesbelkada requested a review from amyeroberts June 3, 2024 08:04

amyeroberts reviewed Jun 3, 2024

View reviewed changes

github-actions bot closed this Jul 31, 2024

amyeroberts reopened this Jul 31, 2024

winglian force-pushed the schedule-free-optimizer branch from 7f5516e to 2361182 Compare August 20, 2024 23:14

amyeroberts reviewed Aug 22, 2024

View reviewed changes

winglian force-pushed the schedule-free-optimizer branch from 2361182 to 5964f5d Compare August 27, 2024 15:37

amyeroberts approved these changes Aug 27, 2024

View reviewed changes

tmm1 mentioned this pull request Aug 27, 2024

use duck-typing to ensure underlying optimizer supports schedulefree hooks huggingface/accelerate#3055

Merged

winglian and others added 11 commits September 6, 2024 15:37

schedulefree optimizers

f515776

fix train instead of eval for optimizer

1b95027

fixes and update docs

40afb0d

chore: lint

2245f57

add tests and drop overly-verbose _32bit suffix

75c68e6

chore: lint

83ca40b

fix for docs

e5f5bce

fix code review issues

1145ede

use duck-typing to avoid per-optimizer patches

367ac00

fixup style

2e857a8

fixup style

179232c

winglian force-pushed the schedule-free-optimizer branch from b2b7725 to 179232c Compare September 6, 2024 19:38

muellerzr approved these changes Sep 6, 2024

View reviewed changes

src/transformers/trainer.py Show resolved Hide resolved

warn if incorrect accelerate version with schedule free

975f045

Co-authored-by: Aman Gupta Karmani <aman@tmm1.net>

LysandreJik approved these changes Sep 9, 2024

View reviewed changes

LysandreJik merged commit 62aecd8 into huggingface:main Sep 9, 2024
24 checks passed

winglian deleted the schedule-free-optimizer branch September 11, 2024 14:43

SangbumChoi mentioned this pull request Oct 9, 2024

AttributeError: 'SGD' object has no attribute 'train' #34034

Closed

4 tasks

	"weight_lr_power": float(getattr(torch, optim_args.get("weight_lr_power", 2.0))),
	"weight_lr_power": float(optim_args.get("weight_lr_power", 2.0)),

	"r": float(getattr(torch, optim_args.get("r", 0.0))),
	"r": float(optim_args.get("r", 0.0)),

		if _is_schedule_free_optimizer(self.optimizer):
		self.optimizer.train()

schedulefree optimizers #30079

schedulefree optimizers #30079

Conversation

winglian commented Apr 6, 2024

What does this PR do?

Before submitting

Who can review?

muellerzr commented Apr 6, 2024 • edited Loading

danielhanchen commented Apr 7, 2024

PhilipMay commented Apr 7, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PhilipMay commented Apr 8, 2024

muellerzr commented Apr 8, 2024

bratao commented Apr 14, 2024

pacman100 left a comment • edited Loading

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Apr 29, 2024

CoffeeVampir3 commented May 9, 2024

winglian commented May 31, 2024

muellerzr left a comment

Choose a reason for hiding this comment

winglian commented Jun 1, 2024

muellerzr left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 28, 2024

winglian commented Jun 28, 2024

github-actions bot commented Jul 23, 2024

bratao commented Jul 31, 2024

tmm1 commented Aug 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts commented Aug 27, 2024

tmm1 commented Aug 27, 2024

fizzAI commented Aug 30, 2024 • edited Loading

tmm1 commented Sep 6, 2024

muellerzr left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

winglian commented Sep 11, 2024

llllvvuu commented Sep 17, 2024

muellerzr commented Apr 6, 2024 •

edited

Loading

pacman100 left a comment •

edited

Loading

fizzAI commented Aug 30, 2024 •

edited

Loading