Fix "Warmup scheduler is loaded from the state dict, even when initializing fresh" #4196

DrMatters · 2021-11-18T08:52:41Z

Patch description
This patch fixes the bug described in the issue: #4195

Testing steps
I don't know much about testing, but probably we should have a test that checks the bug described in the issue above.

Other information
lr_scheduler.py needs a huge refactoring

facebook-github-bot · 2021-11-18T08:52:45Z

Hi @DrMatters!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot · 2021-11-18T09:10:27Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

DrMatters · 2021-11-18T12:07:34Z

Hello! Is CI "cleaninstall_37" working properly on the current main?
It ends with "error: importlib-metadata 4.2.0 is installed but importlib-metadata>=4.4 is required by {'markdown'}", doesn't seem to be related to my changes

facebook-github-bot · 2021-11-18T13:03:13Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

stephenroller · 2021-11-18T21:58:18Z

Yeah you can ignore clean install.

Can we talk more about testing steps? What have you tried? (Can you successfully resume both during and after warmup, as well as starting fresh?)

DrMatters · 2021-12-09T20:51:11Z

@stephenroller I've tested the code using some cases as you asked me to. This helped me to catch some bugs, but now everything looks fine.

DrMatters · 2021-12-09T20:53:43Z

Except for the tests themselves:
https://github.com/facebookresearch/ParlAI/blob/main/tests/test_lr_schedulers.py#L205
if 'warump_updates' in kwargs:
There's a typo in the word 'warmup', which renders all "warmup" tests inactive.
If I correct the typo this makes a whole bunch of tests in the "lr_schedulers" fail. It's up to you.

DrMatters · 2021-12-09T21:01:53Z

I also suggest refactoring the "lr_scheduler.py" base class to use the PyTorch's either ChainedScheduler or SequentialLR instead of reimplementing the existing functionality.
It may help make warmup schedulers more comprehensible and easier to handle (storing/loading). What do you think of this?

stephenroller · 2021-12-10T09:58:12Z

Can you fix the typo and also try running things without your changes and fixing the typo?

DrMatters · 2021-12-10T12:57:27Z

@stephenroller

Can you fix the typo and also try running things without your changes and fixing the typo?

I've created a PR and an Issue for this: #4242 #4243 .
Can you review this PR independently from the issue with tests that I've found?

stephenroller · 2021-12-10T14:55:45Z

I think fixed tests need to be part of the PR. This has always been a tricky part of code.

DrMatters · 2021-12-10T15:32:08Z

Let me clarify things as I see them.
I've made a fix for issue #4159 Warmup scheduler is loaded from the state dict, even when initializing fresh.
All tests (except for the cleaninstall_37, which you told me to ignore) are passing.
My current PR is dedicated to fixing this issue.

There is another issue (#4242) I discovered while working on this PR. This issue is somewhat related to this PR, but fixing it is a whole different topic. It consists of 2 parts:

Fixing the typo in LR scheduler tests. I did it here: Fix lr scheduler tests & lr schedulers #4242.
Fixing a bunch of bugs which was exposed by the fixed tests. I think it may include huge refactoring including rewriting all schedulers to work with SequentialLR or ChainedScheduler instead of custom implementations. This is far beyond the scope of this PR.

Do you think that it's possible to merge this PR as it is and create separate issues (like #4242) for the other problems?

meganung · 2021-12-11T19:53:08Z

This PR doesn't seem to affect the tests at all- fixing the typo in the test results in 8 failed tests in test_lr_schedulers.py without the changes in the PR and with these changes as well.

stephenroller · 2021-12-11T20:23:21Z

I'm not sure to be honest. I haven't had time to look at this deeply.

I respect that our broken tests aren't your problem, but I'm nervous about changing the behavior of any of this without strong verification of correctness. At best we are keeping a second bug and fixing one; at worst we are replacing one buggy behavior with a DIFFERENT buggy behavior, which I disfavor for stability reasons.

The refactor may be an option because of newer pytorches. Historically it wasn't because LR schedulers load their LR from state_dict (since torch 1.7 I believe) which prevented us from using the clean abstractions, as changing LR mid flight is a pretty common thing to do.

So without tests or graphs, I'm left without confidence in either implementation, and so feel inclined to just hold until we can figure out exactly how the tests are broken.

I've asked @meganung to see if she can identify the root cause this week. Of course, if you want to help, any analysis (or a fix!) you provided would speed things up. Thanks for your patience!

DrMatters · 2021-12-12T11:24:00Z

@meganung,
@stephenroller,
I've fixed lr_schedulers by updating def _is_lr_warming_up (so they pass tests) in my PR #4242, but it is a kludge meant to help to identify the root cause of the incorrect behavior. Using the kludge, you can see and verify that the problem is coupled with the moment of switching the warmup scheduler to the main scheduler.

With the changes presented there (#4242), this PR (#4196) is also passing tests

Loading is removed from "def _init_warmup_scheduler" as it's already handled by "def load_state"

Initializing a warmup_scheduler is a relatively cheap (computationally) operation. If by condition the warmup scheduler should not be initialized, it's handled by "_is_lr_warming_up" Removing this condition allows for more maintainable code

…used in "def _init_warmup_scheduler"

LambdaLR is making a call of the provided function with step=last_epoch (by default = 0), but if loading already warmed up checkpoint, it updates learning rate for the optimizer as if it's the first step of warming up. Providing 'last_epoch' argument can solve this problem.

…p learning rates

stephenroller

Thanks!

DrMatters mentioned this pull request Nov 18, 2021

Warmup scheduler is loaded from the state dict, even when initializing fresh #4195

Closed

facebook-github-bot added the CLA Signed label Nov 18, 2021

DrMatters added 13 commits December 13, 2021 14:05

Move parameter description to the proper function

deb460f

Remove loading of state dict for the warmup scheduler

7393bda

Loading is removed from "def _init_warmup_scheduler" as it's already handled by "def load_state"

Set warmup_scheduler initial property in the __init__ method

d5ec11d

Remove conditions to init warmup_scheduler

56ea2a4

Initializing a warmup_scheduler is a relatively cheap (computationally) operation. If by condition the warmup scheduler should not be initialized, it's handled by "_is_lr_warming_up" Removing this condition allows for more maintainable code

Remove "states" parameter from "def _init_warmup_scheduler"

9c6b9d2

Remove "hard_reset" from the base LRScheduler class, as now it's not …

b85ee5c

…used in "def _init_warmup_scheduler"

Move loading states AFTER initializing warmup scheduler

4ce6bec

Don't init warmup scheduler when it's not needed, because it messes u…

e5494d1

…p learning rates

Fix initing warmup with wrong last_epoch

e69aa20

Fix initing warmup with wrong last_epoch

f5e524b

Fix problem with starting fresh scheduler from a checkpoint

f6f8f88

Update formatting according to black

ea72a78

stephenroller approved these changes Dec 13, 2021

View reviewed changes

stephenroller merged commit 78fb36d into facebookresearch:main Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix "Warmup scheduler is loaded from the state dict, even when initializing fresh" #4196

Fix "Warmup scheduler is loaded from the state dict, even when initializing fresh" #4196

DrMatters commented Nov 18, 2021

facebook-github-bot commented Nov 18, 2021

facebook-github-bot commented Nov 18, 2021

DrMatters commented Nov 18, 2021

facebook-github-bot commented Nov 18, 2021

stephenroller commented Nov 18, 2021

DrMatters commented Dec 9, 2021

DrMatters commented Dec 9, 2021

DrMatters commented Dec 9, 2021

stephenroller commented Dec 10, 2021

DrMatters commented Dec 10, 2021

stephenroller commented Dec 10, 2021

DrMatters commented Dec 10, 2021 •

edited

Loading

meganung commented Dec 11, 2021

stephenroller commented Dec 11, 2021

DrMatters commented Dec 12, 2021 •

edited

Loading

stephenroller left a comment

Fix "Warmup scheduler is loaded from the state dict, even when initializing fresh" #4196

Fix "Warmup scheduler is loaded from the state dict, even when initializing fresh" #4196

Conversation

DrMatters commented Nov 18, 2021

facebook-github-bot commented Nov 18, 2021

Action Required

Process

facebook-github-bot commented Nov 18, 2021

DrMatters commented Nov 18, 2021

facebook-github-bot commented Nov 18, 2021

stephenroller commented Nov 18, 2021

DrMatters commented Dec 9, 2021

DrMatters commented Dec 9, 2021

DrMatters commented Dec 9, 2021

stephenroller commented Dec 10, 2021

DrMatters commented Dec 10, 2021

stephenroller commented Dec 10, 2021

DrMatters commented Dec 10, 2021 • edited Loading

meganung commented Dec 11, 2021

stephenroller commented Dec 11, 2021

DrMatters commented Dec 12, 2021 • edited Loading

stephenroller left a comment

Choose a reason for hiding this comment

DrMatters commented Dec 10, 2021 •

edited

Loading

DrMatters commented Dec 12, 2021 •

edited

Loading