-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Fix "Warmup scheduler is loaded from the state dict, even when initializing fresh" #4196
Conversation
Hi @DrMatters! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
Hello! Is CI "cleaninstall_37" working properly on the current main? |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
Yeah you can ignore clean install. Can we talk more about testing steps? What have you tried? (Can you successfully resume both during and after warmup, as well as starting fresh?) |
@stephenroller I've tested the code using some cases as you asked me to. This helped me to catch some bugs, but now everything looks fine. |
Except for the tests themselves: |
I also suggest refactoring the "lr_scheduler.py" base class to use the PyTorch's either ChainedScheduler or SequentialLR instead of reimplementing the existing functionality. |
Can you fix the typo and also try running things without your changes and fixing the typo? |
I think fixed tests need to be part of the PR. This has always been a tricky part of code. |
Let me clarify things as I see them. There is another issue (#4242) I discovered while working on this PR. This issue is somewhat related to this PR, but fixing it is a whole different topic. It consists of 2 parts:
Do you think that it's possible to merge this PR as it is and create separate issues (like #4242) for the other problems? |
This PR doesn't seem to affect the tests at all- fixing the typo in the test results in 8 failed tests in |
I'm not sure to be honest. I haven't had time to look at this deeply. I respect that our broken tests aren't your problem, but I'm nervous about changing the behavior of any of this without strong verification of correctness. At best we are keeping a second bug and fixing one; at worst we are replacing one buggy behavior with a DIFFERENT buggy behavior, which I disfavor for stability reasons. The refactor may be an option because of newer pytorches. Historically it wasn't because LR schedulers load their LR from state_dict (since torch 1.7 I believe) which prevented us from using the clean abstractions, as changing LR mid flight is a pretty common thing to do. So without tests or graphs, I'm left without confidence in either implementation, and so feel inclined to just hold until we can figure out exactly how the tests are broken. I've asked @meganung to see if she can identify the root cause this week. Of course, if you want to help, any analysis (or a fix!) you provided would speed things up. Thanks for your patience! |
@meganung, With the changes presented there (#4242), this PR (#4196) is also passing tests |
Loading is removed from "def _init_warmup_scheduler" as it's already handled by "def load_state"
Initializing a warmup_scheduler is a relatively cheap (computationally) operation. If by condition the warmup scheduler should not be initialized, it's handled by "_is_lr_warming_up" Removing this condition allows for more maintainable code
…used in "def _init_warmup_scheduler"
LambdaLR is making a call of the provided function with step=last_epoch (by default = 0), but if loading already warmed up checkpoint, it updates learning rate for the optimizer as if it's the first step of warming up. Providing 'last_epoch' argument can solve this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Patch description
This patch fixes the bug described in the issue: #4195
Testing steps
I don't know much about testing, but probably we should have a test that checks the bug described in the issue above.
Other information
lr_scheduler.py needs a huge refactoring