After resuming traing scheduler.step() will not update optimzer's learning rate #12812

lanslotttTT · 2022-04-20T04:39:18Z

I find a bug that when I resume training from a checkpoint ,the learning rate always equals the init_lr I set.After debugging, I found that the method scheduler.step() will not change the learning rate of optimizer. So I set it manually to avoid this bug.

    def on_epoch_start(self) -> None:
        self.optimizers().param_groups[0]['lr'] = self.lr_schedulers().get_lr()[0]

cc @Borda

rohitgr7 · 2022-04-20T08:27:13Z

did you check the actual learning here?

self.optimizers().param_groups[0]['lr']

since while resuming the optimizer's state is also restored which includes the learning rate as well.

lanslotttTT · 2022-04-20T08:35:38Z

I have checked that schedule and Optimizer have different learning rates.Schedule's learning rate is correct, but the Optimizer's learning rate cannot be updated by schedule

…

---- 回复的原邮件 ---- | 发件人 | Rohit ***@***.***> | | 日期 | 2022年04月20日 16:27 | | 收件人 | ***@***.***> | | 抄送至 | ***@***.******@***.***> | | 主题 | Re: [PyTorchLightning/pytorch-lightning] After resuming traing scheduler.step() will not update optimzer's learning rate (Issue #12812) | did you check the actual learning here? self.optimizers().param_groups[0]['lr'] since while resuming the optimizer's state is also restored which includes the learning rate as well. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

rohitgr7 · 2022-04-20T09:32:01Z

can you share a reproducible example using
https://colab.research.google.com/github/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report/bug_report_model.ipynb
?

stale · 2022-06-06T02:38:16Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

NiccoloCavagnero · 2022-06-14T15:20:33Z

Very same issue here.

stale · 2022-07-22T07:29:23Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

lamnguyenvux · 2022-07-28T07:43:54Z

Took me 2 days to train my model and then I realized that the learning rate wasn't updated after resuming from lastest checkpoint...

ga1i13o · 2022-10-12T15:15:35Z

Same issue for me too. The problem seems to be that, while the scheduler is correctly resumed, the optimizer object to which it is linked is a different one than the optimizer that is actually being used.

FrankZijlstra · 2022-10-17T11:21:32Z

I had the same issue and looked into it a little bit. It turns out that by default self.optimizers() returns from trainer.strategy._lightning_optimizers, and LightningOptimizer maintains a copy of the param_groups field. The parameters all are stored as references to the actual parameters, but the learning rate is not. This behaviour traces back to load_state_dict of the pytorch optimizer, which overwrites the param_groups list with a list from state dict, but it plugs back in the 'params' value. So at that point the copy of param_groups maintained by LightningOptimizer is no longer kept up-to-date.

I think a simple solution would be to have the strategy create/update its _lightning_optimizers after a restore from checkpoint. As a user, you can call self.optimizers(use_pl_optimizer=False).param_groups[0]['lr'] instead to fix the issue for now, though I don't know if not using the LightningOptimizer wrapper will have side effects when using the various training strategies.

Little example: After a fit() which restored from a checkpoint it looks like this (with a LR of 1e-4, and a scheduler starting at factor 1e-3):

trainer.optimizers[0].param_groups[0]['lr']
Out[36]: 0.00010000000000000009

trainer.strategy._lightning_optimizers[0].param_groups[0]['lr']
Out[37]: 1.0000000000000001e-07

lminer · 2022-10-25T16:48:22Z

Same issue here. The ability to manually adjust the learning rate seems pretty key for me, especially for long running jobs.

willi-menapace · 2022-11-01T09:20:36Z

It seems that in pytorch_lightning.core.optimizer the strategy is passed _optimizer with the correctly loaded learning rate, so training should not be affected by the resume if all changes to the learning rate happen through the scheduler and not manually, but it would be nice to have a fix for this.

#169 pytorch_lightning.core.optimizer
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)

Jgoldfeder · 2022-11-09T15:46:04Z

same issue for me as well

jngiam · 2023-01-28T01:08:20Z

I ran into this issue as well, and adding this seems to have fixed it for me:

def on_train_start(self):
    self.optimizers().param_groups = self.optimizers()._optimizer.param_groups

Making the two param_groups point to the same one seemed to resolve the issue. I would appreciate if someone would be able to comment if there are any pitfalls with this.

jropen · 2023-02-01T08:35:50Z

same issue for me as well

stale · 2023-03-19T17:51:44Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

stephenllh · 2023-04-11T10:46:12Z

any updates for this issue?

zxydi1992 · 2023-05-01T03:10:15Z

Found the same issue here, and this one should be flagged as bug instead.

ipoletaev · 2023-05-08T22:58:00Z

The reason is that lightning code wraps original optimizer into an additional layer causing wrong references used by the original optimizer's code. In other words, client's code calling optimizer.param_groups doesn't get what the original optimizer's class implementation gets by calling the same .param_groups inside of its step() function.

If you're using lightning's fabric: you can load optimizer's state from the checkpoint before doing wrapping with setup_optimizers. This will make sure .param_groups is properly referenced through the wrapper.

awaelchli · 2023-09-20T00:37:17Z

This was fixed in #18280
See my full reply here on another issue: #17296 (comment)

awaelchli · 2023-09-20T00:37:34Z

Thanks everyone who helped here and sorry I didn't see it earlier!

lanslotttTT added the needs triage Waiting to be triaged by maintainers label Apr 20, 2022

rohitgr7 added question Further information is requested and removed needs triage Waiting to be triaged by maintainers labels Apr 20, 2022

stale bot added the won't fix This will not be worked on label Jun 6, 2022

stale bot removed the won't fix This will not be worked on label Jun 14, 2022

stale bot added the won't fix This will not be worked on label Jul 22, 2022

stale bot removed the won't fix This will not be worked on label Jul 28, 2022

stale bot added the won't fix This will not be worked on label Mar 19, 2023

stale bot removed the won't fix This will not be worked on label Mar 29, 2023

zxydi1992 mentioned this issue May 1, 2023

After resuming traing scheduler.step() will not update optimzer's learning rate #17539

Closed

Borda added bug Something isn't working help wanted Open to be worked on and removed question Further information is requested labels May 3, 2023

Borda added the ver: 1.9.x label May 3, 2023

awaelchli closed this as completed Sep 20, 2023

awaelchli added this to the 2.0.x milestone Sep 20, 2023

awaelchli added optimization lr scheduler optimizer labels Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After resuming traing scheduler.step() will not update optimzer's learning rate #12812

After resuming traing scheduler.step() will not update optimzer's learning rate #12812

lanslotttTT commented Apr 20, 2022 •

edited by github-actions bot

Loading

rohitgr7 commented Apr 20, 2022

lanslotttTT commented Apr 20, 2022 via email

rohitgr7 commented Apr 20, 2022

stale bot commented Jun 6, 2022

NiccoloCavagnero commented Jun 14, 2022

stale bot commented Jul 22, 2022

lamnguyenvux commented Jul 28, 2022

ga1i13o commented Oct 12, 2022

FrankZijlstra commented Oct 17, 2022 •

edited

Loading

lminer commented Oct 25, 2022

willi-menapace commented Nov 1, 2022

Jgoldfeder commented Nov 9, 2022

jngiam commented Jan 28, 2023

jropen commented Feb 1, 2023

stale bot commented Mar 19, 2023

stephenllh commented Apr 11, 2023

zxydi1992 commented May 1, 2023 •

edited

Loading

ipoletaev commented May 8, 2023

awaelchli commented Sep 20, 2023

awaelchli commented Sep 20, 2023

After resuming traing scheduler.step() will not update optimzer's learning rate #12812

After resuming traing scheduler.step() will not update optimzer's learning rate #12812

Comments

lanslotttTT commented Apr 20, 2022 • edited by github-actions bot Loading

rohitgr7 commented Apr 20, 2022

lanslotttTT commented Apr 20, 2022 via email

rohitgr7 commented Apr 20, 2022

stale bot commented Jun 6, 2022

NiccoloCavagnero commented Jun 14, 2022

stale bot commented Jul 22, 2022

lamnguyenvux commented Jul 28, 2022

ga1i13o commented Oct 12, 2022

FrankZijlstra commented Oct 17, 2022 • edited Loading

lminer commented Oct 25, 2022

willi-menapace commented Nov 1, 2022

Jgoldfeder commented Nov 9, 2022

jngiam commented Jan 28, 2023

jropen commented Feb 1, 2023

stale bot commented Mar 19, 2023

stephenllh commented Apr 11, 2023

zxydi1992 commented May 1, 2023 • edited Loading

ipoletaev commented May 8, 2023

awaelchli commented Sep 20, 2023

awaelchli commented Sep 20, 2023

lanslotttTT commented Apr 20, 2022 •

edited by github-actions bot

Loading

FrankZijlstra commented Oct 17, 2022 •

edited

Loading

zxydi1992 commented May 1, 2023 •

edited

Loading