Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

Fix "Warmup scheduler is loaded from the state dict, even when initializing fresh" #4196

Merged
merged 13 commits into from
Dec 13, 2021

Conversation

DrMatters
Copy link
Contributor

Patch description
This patch fixes the bug described in the issue: #4195

Testing steps
I don't know much about testing, but probably we should have a test that checks the bug described in the issue above.

Other information
lr_scheduler.py needs a huge refactoring

@facebook-github-bot
Copy link

Hi @DrMatters!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@DrMatters
Copy link
Contributor Author

Hello! Is CI "cleaninstall_37" working properly on the current main?
It ends with "error: importlib-metadata 4.2.0 is installed but importlib-metadata>=4.4 is required by {'markdown'}", doesn't seem to be related to my changes

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@stephenroller
Copy link
Contributor

Yeah you can ignore clean install.

Can we talk more about testing steps? What have you tried? (Can you successfully resume both during and after warmup, as well as starting fresh?)

@DrMatters
Copy link
Contributor Author

@stephenroller I've tested the code using some cases as you asked me to. This helped me to catch some bugs, but now everything looks fine.

@DrMatters
Copy link
Contributor Author

Except for the tests themselves:
https://github.com/facebookresearch/ParlAI/blob/main/tests/test_lr_schedulers.py#L205
if 'warump_updates' in kwargs:
There's a typo in the word 'warmup', which renders all "warmup" tests inactive.
If I correct the typo this makes a whole bunch of tests in the "lr_schedulers" fail. It's up to you.

@DrMatters
Copy link
Contributor Author

I also suggest refactoring the "lr_scheduler.py" base class to use the PyTorch's either ChainedScheduler or SequentialLR instead of reimplementing the existing functionality.
It may help make warmup schedulers more comprehensible and easier to handle (storing/loading). What do you think of this?

@stephenroller
Copy link
Contributor

Can you fix the typo and also try running things without your changes and fixing the typo?

@DrMatters
Copy link
Contributor Author

@stephenroller

Can you fix the typo and also try running things without your changes and fixing the typo?

I've created a PR and an Issue for this: #4242 #4243 .
Can you review this PR independently from the issue with tests that I've found?

@stephenroller
Copy link
Contributor

I think fixed tests need to be part of the PR. This has always been a tricky part of code.

@DrMatters
Copy link
Contributor Author

DrMatters commented Dec 10, 2021

Let me clarify things as I see them.
I've made a fix for issue #4159 Warmup scheduler is loaded from the state dict, even when initializing fresh.
All tests (except for the cleaninstall_37, which you told me to ignore) are passing.
My current PR is dedicated to fixing this issue.

There is another issue (#4242) I discovered while working on this PR. This issue is somewhat related to this PR, but fixing it is a whole different topic. It consists of 2 parts:

  1. Fixing the typo in LR scheduler tests. I did it here: Fix lr scheduler tests & lr schedulers #4242.
  2. Fixing a bunch of bugs which was exposed by the fixed tests. I think it may include huge refactoring including rewriting all schedulers to work with SequentialLR or ChainedScheduler instead of custom implementations. This is far beyond the scope of this PR.

Do you think that it's possible to merge this PR as it is and create separate issues (like #4242) for the other problems?

@meganung
Copy link
Contributor

This PR doesn't seem to affect the tests at all- fixing the typo in the test results in 8 failed tests in test_lr_schedulers.py without the changes in the PR and with these changes as well.

@stephenroller
Copy link
Contributor

I'm not sure to be honest. I haven't had time to look at this deeply.

I respect that our broken tests aren't your problem, but I'm nervous about changing the behavior of any of this without strong verification of correctness. At best we are keeping a second bug and fixing one; at worst we are replacing one buggy behavior with a DIFFERENT buggy behavior, which I disfavor for stability reasons.

The refactor may be an option because of newer pytorches. Historically it wasn't because LR schedulers load their LR from state_dict (since torch 1.7 I believe) which prevented us from using the clean abstractions, as changing LR mid flight is a pretty common thing to do.

So without tests or graphs, I'm left without confidence in either implementation, and so feel inclined to just hold until we can figure out exactly how the tests are broken.

I've asked @meganung to see if she can identify the root cause this week. Of course, if you want to help, any analysis (or a fix!) you provided would speed things up. Thanks for your patience!

@DrMatters
Copy link
Contributor Author

DrMatters commented Dec 12, 2021

@meganung,
@stephenroller,
I've fixed lr_schedulers by updating def _is_lr_warming_up (so they pass tests) in my PR #4242, but it is a kludge meant to help to identify the root cause of the incorrect behavior. Using the kludge, you can see and verify that the problem is coupled with the moment of switching the warmup scheduler to the main scheduler.

With the changes presented there (#4242), this PR (#4196) is also passing tests

Loading is removed from "def _init_warmup_scheduler" as it's already handled by "def load_state"
Initializing a warmup_scheduler is a relatively cheap (computationally) operation.
If by condition the warmup scheduler should not be initialized, it's  handled by "_is_lr_warming_up"
Removing this condition allows for more maintainable code
LambdaLR is making a call of the provided function with step=last_epoch (by default = 0),
but if loading already warmed up checkpoint, it updates learning rate for the optimizer as if
it's the first step of warming up.
Providing 'last_epoch' argument can solve this problem.
Copy link
Contributor

@stephenroller stephenroller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@stephenroller stephenroller merged commit 78fb36d into facebookresearch:main Dec 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants