-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug-fix] DDP and automatic_optimization=False #4485
[bug-fix] DDP and automatic_optimization=False #4485
Conversation
…tps://github.com/PyTorchLightning/pytorch-lightning into bugfix/4444_ddp_and_automatic_optimization=False
Codecov Report
@@ Coverage Diff @@
## master #4485 +/- ##
=======================================
+ Coverage 90% 93% +3%
=======================================
Files 116 116
Lines 8858 8883 +25
=======================================
+ Hits 8012 8275 +263
+ Misses 846 608 -238 |
pytorch_lightning/core/lightning.py
Outdated
@@ -1094,7 +1095,9 @@ def training_step(...): | |||
self._verify_is_manual_optimization('manual_backward') | |||
|
|||
# backward | |||
self._running_manual_optim = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be easier to add an optional arg to backward
? or do we need the state _running_manual_optim
somewhere else as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question !
I thought about it, but it would mean broadcasting a new parameter into training_loop, accelerator, etc...
I though the simpler, the better. And now, it defines a scope around manual_optimization if we need it.
Maybe an actual context manager would be cleaner there.
@SeanNaren @Borda Any thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could reproduce the bug with ddp but not locally. Need more investigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes am confused why this bug only appears in ddp. Could you confirm you don't see the behaviour with ddp_cpu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need more investigation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @s-rog,
I have tried to pass it as a params, and it started to be ugly.
I preferred to revert back to my first idea.
I hope you understand 👍
Best regards,
Thomas Chaton
…tps://github.com/PyTorchLightning/pytorch-lightning into bugfix/4444_ddp_and_automatic_optimization=False
Hello @tchaton! Thanks for updating this PR.
Comment last updated at 2020-11-10 19:19:10 UTC |
|
…tps://github.com/PyTorchLightning/pytorch-lightning into bugfix/4444_ddp_and_automatic_optimization=False
…tps://github.com/PyTorchLightning/pytorch-lightning into bugfix/4444_ddp_and_automatic_optimization=False
@@ -33,6 +33,7 @@ timit_data/ | |||
.Python | |||
ide_layouts/ | |||
build/ | |||
_build/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how this one is created, well we can keep just curious what process did it :]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this one, so when moving doc from build to _build and serving it with sphinx-serve. They don't get added.
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Thanks @tchaton this was a real refactor/fix! |
self.trainer.train_loop.backward(loss, optimizer, -1, *args, **kwargs) | ||
self._running_manual_backward = False | ||
|
||
def manual_optimizer_step(self, optimizer: Optimizer, force_optimizer_step:bool = False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make sure to forward arguments when we do stuff like this?
ie: now the user can't use the args of .step...
so:
def manual_optimizer_step(self, *args, optimizer: Optimizer, force_optimizer_step:bool = False, **kwargs):
... # eventually forwards to:
optimizer.step(*args, **kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @williamFalcon. I will add a PR for it tomorrow.
* resolve bug * add self._running_manual_optim * update * update tests * update lightning module * resolve bug * update tests * update * resolve pep8 * update * replace by `ddp_spawn` * temporary fix * update * update * move update to training_loop * make both ddp_spawn * introduce `manual_optimizer_step` * update changelog * added changelog wrong place * add force_optimizer_step * update docstring for tests * update optimizer_step * update zero_grad * resolve flake8 * move update into manual_optimizer_step * add zero_grad * remove zero_grad tests * remove manual_backward in AMP, it doesn't help * update * loosen tests * update * update doc * add TODO * Removed unnecessary get model from native amp * Remove try except with pytest raise * Add seed, clean up imports, remove try catch to reproduce error * update code * update test * revert back * formatting * Update pytorch_lightning/core/lightning.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> (cherry picked from commit 7e08b0d)
This reverts commit 10488dc
* resolve bug * add self._running_manual_optim * update * update tests * update lightning module * resolve bug * update tests * update * resolve pep8 * update * replace by `ddp_spawn` * temporary fix * update * update * move update to training_loop * make both ddp_spawn * introduce `manual_optimizer_step` * update changelog * added changelog wrong place * add force_optimizer_step * update docstring for tests * update optimizer_step * update zero_grad * resolve flake8 * move update into manual_optimizer_step * add zero_grad * remove zero_grad tests * remove manual_backward in AMP, it doesn't help * update * loosen tests * update * update doc * add TODO * Removed unnecessary get model from native amp * Remove try except with pytest raise * Add seed, clean up imports, remove try catch to reproduce error * update code * update test * revert back * formatting * Update pytorch_lightning/core/lightning.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> (cherry picked from commit 7e08b0d)
* resolve bug * add self._running_manual_optim * update * update tests * update lightning module * resolve bug * update tests * update * resolve pep8 * update * replace by `ddp_spawn` * temporary fix * update * update * move update to training_loop * make both ddp_spawn * introduce `manual_optimizer_step` * update changelog * added changelog wrong place * add force_optimizer_step * update docstring for tests * update optimizer_step * update zero_grad * resolve flake8 * move update into manual_optimizer_step * add zero_grad * remove zero_grad tests * remove manual_backward in AMP, it doesn't help * update * loosen tests * update * update doc * add TODO * Removed unnecessary get model from native amp * Remove try except with pytest raise * Add seed, clean up imports, remove try catch to reproduce error * update code * update test * revert back * formatting * Update pytorch_lightning/core/lightning.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> (cherry picked from commit 7e08b0d)
* resolve bug * add self._running_manual_optim * update * update tests * update lightning module * resolve bug * update tests * update * resolve pep8 * update * replace by `ddp_spawn` * temporary fix * update * update * move update to training_loop * make both ddp_spawn * introduce `manual_optimizer_step` * update changelog * added changelog wrong place * add force_optimizer_step * update docstring for tests * update optimizer_step * update zero_grad * resolve flake8 * move update into manual_optimizer_step * add zero_grad * remove zero_grad tests * remove manual_backward in AMP, it doesn't help * update * loosen tests * update * update doc * add TODO * Removed unnecessary get model from native amp * Remove try except with pytest raise * Add seed, clean up imports, remove try catch to reproduce error * update code * update test * revert back * formatting * Update pytorch_lightning/core/lightning.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> (cherry picked from commit 7e08b0d)
* resolve bug * add self._running_manual_optim * update * update tests * update lightning module * resolve bug * update tests * update * resolve pep8 * update * replace by `ddp_spawn` * temporary fix * update * update * move update to training_loop * make both ddp_spawn * introduce `manual_optimizer_step` * update changelog * added changelog wrong place * add force_optimizer_step * update docstring for tests * update optimizer_step * update zero_grad * resolve flake8 * move update into manual_optimizer_step * add zero_grad * remove zero_grad tests * remove manual_backward in AMP, it doesn't help * update * loosen tests * update * update doc * add TODO * Removed unnecessary get model from native amp * Remove try except with pytest raise * Add seed, clean up imports, remove try catch to reproduce error * update code * update test * revert back * formatting * Update pytorch_lightning/core/lightning.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
What does this PR do?
In
automatic_optimizaiton=False
, if the user returned a detachedtensor
, it creates a RuntimeError.Still don't know the reason behind it, but now it catches it and returns a MisConfigurationError.
Also, this PR add
_running_manual_optimatization
to make sure that if the user returned loss at the end oftraining_step
,It won't be called and add gradients by doing
assert torch.sum(self.layer.weight.grad) == 0
Closes #4444
Closes #4485
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃