-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After resuming traing scheduler.step() will not update optimzer's learning rate #12812
Comments
did you check the actual learning here? self.optimizers().param_groups[0]['lr'] since while resuming the optimizer's state is also restored which includes the learning rate as well. |
I have checked that schedule and Optimizer have different learning rates.Schedule's learning rate is correct, but the Optimizer's learning rate cannot be updated by schedule
…---- 回复的原邮件 ----
| 发件人 | Rohit ***@***.***> |
| 日期 | 2022年04月20日 16:27 |
| 收件人 | ***@***.***> |
| 抄送至 | ***@***.******@***.***> |
| 主题 | Re: [PyTorchLightning/pytorch-lightning] After resuming traing scheduler.step() will not update optimzer's learning rate (Issue #12812) |
did you check the actual learning here?
self.optimizers().param_groups[0]['lr']
since while resuming the optimizer's state is also restored which includes the learning rate as well.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
can you share a reproducible example using |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Very same issue here. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Took me 2 days to train my model and then I realized that the learning rate wasn't updated after resuming from lastest checkpoint... |
Same issue for me too. The problem seems to be that, while the scheduler is correctly resumed, the optimizer object to which it is linked is a different one than the optimizer that is actually being used. |
I had the same issue and looked into it a little bit. It turns out that by default self.optimizers() returns from trainer.strategy._lightning_optimizers, and LightningOptimizer maintains a copy of the param_groups field. The parameters all are stored as references to the actual parameters, but the learning rate is not. This behaviour traces back to load_state_dict of the pytorch optimizer, which overwrites the param_groups list with a list from state dict, but it plugs back in the 'params' value. So at that point the copy of param_groups maintained by LightningOptimizer is no longer kept up-to-date. I think a simple solution would be to have the strategy create/update its _lightning_optimizers after a restore from checkpoint. As a user, you can call Little example: After a fit() which restored from a checkpoint it looks like this (with a LR of 1e-4, and a scheduler starting at factor 1e-3):
|
Same issue here. The ability to manually adjust the learning rate seems pretty key for me, especially for long running jobs. |
It seems that in pytorch_lightning.core.optimizer the strategy is passed _optimizer with the correctly loaded learning rate, so training should not be affected by the resume if all changes to the learning rate happen through the scheduler and not manually, but it would be nice to have a fix for this. #169 pytorch_lightning.core.optimizer |
same issue for me as well |
I ran into this issue as well, and adding this seems to have fixed it for me:
Making the two param_groups point to the same one seemed to resolve the issue. I would appreciate if someone would be able to comment if there are any pitfalls with this. |
same issue for me as well |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team! |
any updates for this issue? |
Found the same issue here, and this one should be flagged as bug instead. |
The reason is that lightning code wraps original optimizer into an additional layer causing wrong references used by the original optimizer's code. In other words, client's code calling If you're using lightning's fabric: you can load optimizer's state from the checkpoint before doing wrapping with |
This was fixed in #18280 |
Thanks everyone who helped here and sorry I didn't see it earlier! |
I find a bug that when I resume training from a checkpoint ,the learning rate always equals the init_lr I set.After debugging, I found that the method scheduler.step() will not change the learning rate of optimizer. So I set it manually to avoid this bug.
cc @Borda
The text was updated successfully, but these errors were encountered: