Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After resuming traing scheduler.step() will not update optimzer's learning rate #12812

Closed
lanslotttTT opened this issue Apr 20, 2022 · 20 comments
Labels
Milestone

Comments

@lanslotttTT
Copy link

lanslotttTT commented Apr 20, 2022

I find a bug that when I resume training from a checkpoint ,the learning rate always equals the init_lr I set.After debugging, I found that the method scheduler.step() will not change the learning rate of optimizer. So I set it manually to avoid this bug.

    def on_epoch_start(self) -> None:
        self.optimizers().param_groups[0]['lr'] = self.lr_schedulers().get_lr()[0]

cc @Borda

@lanslotttTT lanslotttTT added the needs triage Waiting to be triaged by maintainers label Apr 20, 2022
@rohitgr7
Copy link
Contributor

did you check the actual learning here?

self.optimizers().param_groups[0]['lr']

since while resuming the optimizer's state is also restored which includes the learning rate as well.

@rohitgr7 rohitgr7 added question Further information is requested and removed needs triage Waiting to be triaged by maintainers labels Apr 20, 2022
@lanslotttTT
Copy link
Author

lanslotttTT commented Apr 20, 2022 via email

@rohitgr7
Copy link
Contributor

@stale
Copy link

stale bot commented Jun 6, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jun 6, 2022
@NiccoloCavagnero
Copy link

Very same issue here.

@stale stale bot removed the won't fix This will not be worked on label Jun 14, 2022
@stale
Copy link

stale bot commented Jul 22, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jul 22, 2022
@lamnguyenvux
Copy link

Took me 2 days to train my model and then I realized that the learning rate wasn't updated after resuming from lastest checkpoint...

@stale stale bot removed the won't fix This will not be worked on label Jul 28, 2022
@ga1i13o
Copy link

ga1i13o commented Oct 12, 2022

Same issue for me too. The problem seems to be that, while the scheduler is correctly resumed, the optimizer object to which it is linked is a different one than the optimizer that is actually being used.

@FrankZijlstra
Copy link

FrankZijlstra commented Oct 17, 2022

I had the same issue and looked into it a little bit. It turns out that by default self.optimizers() returns from trainer.strategy._lightning_optimizers, and LightningOptimizer maintains a copy of the param_groups field. The parameters all are stored as references to the actual parameters, but the learning rate is not. This behaviour traces back to load_state_dict of the pytorch optimizer, which overwrites the param_groups list with a list from state dict, but it plugs back in the 'params' value. So at that point the copy of param_groups maintained by LightningOptimizer is no longer kept up-to-date.

I think a simple solution would be to have the strategy create/update its _lightning_optimizers after a restore from checkpoint. As a user, you can call self.optimizers(use_pl_optimizer=False).param_groups[0]['lr'] instead to fix the issue for now, though I don't know if not using the LightningOptimizer wrapper will have side effects when using the various training strategies.

Little example: After a fit() which restored from a checkpoint it looks like this (with a LR of 1e-4, and a scheduler starting at factor 1e-3):

trainer.optimizers[0].param_groups[0]['lr']
Out[36]: 0.00010000000000000009

trainer.strategy._lightning_optimizers[0].param_groups[0]['lr']
Out[37]: 1.0000000000000001e-07

@lminer
Copy link

lminer commented Oct 25, 2022

Same issue here. The ability to manually adjust the learning rate seems pretty key for me, especially for long running jobs.

@willi-menapace
Copy link

It seems that in pytorch_lightning.core.optimizer the strategy is passed _optimizer with the correctly loaded learning rate, so training should not be affected by the resume if all changes to the learning rate happen through the scheduler and not manually, but it would be nice to have a fix for this.

#169 pytorch_lightning.core.optimizer
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)

@Jgoldfeder
Copy link

same issue for me as well

@jngiam
Copy link

jngiam commented Jan 28, 2023

I ran into this issue as well, and adding this seems to have fixed it for me:

def on_train_start(self):
    self.optimizers().param_groups = self.optimizers()._optimizer.param_groups

Making the two param_groups point to the same one seemed to resolve the issue. I would appreciate if someone would be able to comment if there are any pitfalls with this.

@jropen
Copy link

jropen commented Feb 1, 2023

same issue for me as well

@stale
Copy link

stale bot commented Mar 19, 2023

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Mar 19, 2023
@stale stale bot removed the won't fix This will not be worked on label Mar 29, 2023
@stephenllh
Copy link

any updates for this issue?

@zxydi1992
Copy link

zxydi1992 commented May 1, 2023

Found the same issue here, and this one should be flagged as bug instead.

@Borda Borda added bug Something isn't working help wanted Open to be worked on and removed question Further information is requested labels May 3, 2023
@ipoletaev
Copy link

The reason is that lightning code wraps original optimizer into an additional layer causing wrong references used by the original optimizer's code. In other words, client's code calling optimizer.param_groups doesn't get what the original optimizer's class implementation gets by calling the same .param_groups inside of its step() function.

If you're using lightning's fabric: you can load optimizer's state from the checkpoint before doing wrapping with setup_optimizers. This will make sure .param_groups is properly referenced through the wrapper.

@awaelchli
Copy link
Contributor

This was fixed in #18280
See my full reply here on another issue: #17296 (comment)

@awaelchli
Copy link
Contributor

Thanks everyone who helped here and sorry I didn't see it earlier!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests