Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hotfix] ddp + manual_optimisation #4976

Merged
merged 31 commits into from
Dec 7, 2020
Merged

[hotfix] ddp + manual_optimisation #4976

merged 31 commits into from
Dec 7, 2020

Conversation

tchaton
Copy link
Contributor

@tchaton tchaton commented Dec 4, 2020

What does this PR do?

Fixes #4953

SeanNaren EDIT:

There are two parts to this fix:

  • We assumed that when accumulating gradients, we do not need to sync gradients across distributed processes within the training step. This is fine for automatic optimization, but where the user has complete control over all training flow in training_step we have to allow sync. Also we hard coded the behaviour to DDP, we now have custom parallel wrappers :)
  • In the custom DDP override, we've added training step/val step/test step into the internal forward function. The issue here is that only after we exit these calls do the reducer backward hooks get added which breaks manual optimization. The PR proposes that we allow the accelerator to add these hooks manually.

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃

@edenlightning edenlightning added the bug Something isn't working label Dec 4, 2020
@edenlightning edenlightning added this to the 1.1 milestone Dec 4, 2020
@pep8speaks
Copy link

pep8speaks commented Dec 5, 2020

Hello @tchaton! Thanks for updating this PR.

Line 911:121: E501 line too long (121 > 120 characters)
Line 974:34: E203 whitespace before ':'

Comment last updated at 2020-12-07 18:24:01 UTC

@codecov
Copy link

codecov bot commented Dec 5, 2020

Codecov Report

Merging #4976 (678f642) into master (68ba493) will increase coverage by 0%.
The diff coverage is 88%.

@@          Coverage Diff           @@
##           master   #4976   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         130     130           
  Lines        9527    9547   +20     
======================================
+ Hits         8843    8871   +28     
+ Misses        684     676    -8     

@tchaton tchaton added the distributed Generic distributed-related topic label Dec 5, 2020
@tchaton tchaton self-assigned this Dec 5, 2020
@tchaton tchaton added the priority: 0 High priority task label Dec 5, 2020
@tchaton tchaton changed the title Bug/fixfix ddp manual [hotfix] ddp + manual_optimisation Dec 5, 2020
@tchaton tchaton marked this pull request as ready for review December 6, 2020 10:18
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add return types to the newly added methods?

pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved
tests/trainer/optimization/test_manual_optimization.py Outdated Show resolved Hide resolved
tests/trainer/optimization/test_manual_optimization.py Outdated Show resolved Hide resolved
SeanNaren and others added 2 commits December 6, 2020 22:44
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Copy link
Contributor

@SeanNaren SeanNaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this over the line @tchaton

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tests passes but loss is NaN, please check again the gradient flow in the model.

tests/special_tests.sh Show resolved Hide resolved
pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved
make_manual_backward(loss_ones_gen, opt_dis)

# this will accumulate gradients for 2 batches and then call opt_gen.step()
opt_gen.step(closure=gen_closure, make_optimizer_step=batch_idx % 2 == 0, optim='sgd')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
opt_gen.step(closure=gen_closure, make_optimizer_step=batch_idx % 2 == 0, optim='sgd')
opt_gen.step(closure=gen_closure, make_optimizer_step=(batch_idx % 2 == 0), optim='sgd')

what is this extra argument optim="sgd"? This doesn't look right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an arbitrary arguments. Just to make sure it is properly given to optimizer.step(*args, **kwargs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, understood. Could we name it something else, like **extra_kwargs or similar so it is not misleading to be a required arg?

pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved
pytorch_lightning/overrides/data_parallel.py Outdated Show resolved Hide resolved
pytorch_lightning/overrides/data_parallel.py Show resolved Hide resolved
@SeanNaren SeanNaren self-requested a review December 6, 2020 23:05
@SeanNaren
Copy link
Contributor

Thanks for the review @awaelchli, I think we can reduce the priority and focus on getting this right! Seems like we need to test a lot more.

Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the already mentioned issues, I really like it. So no other requests from my side.

@SeanNaren
Copy link
Contributor

Just to put my thoughts for discussion:

  • The current PR adds too many edge cases of my liking. We should just do:
if manual_optimizations:
    model.add_backward_hooks()

We may not even need to worry about modifying the code too much other than defining the additional reduce hook function, for a first pass it would be good to check if calling prepare_for_backwards twice even changes any behaviour: https://github.com/pytorch/pytorch/blob/v1.4.0/torch/csrc/distributed/c10d/reducer.cpp#L496

@tchaton
Copy link
Contributor Author

tchaton commented Dec 7, 2020

the tests passes but loss is NaN, please check again the gradient flow in the model.

In manual optimization, we don't return the loss. Therefore, we get a loss=nan in the prob_bar. Need to take care of this. When calling self.log("train_loss", loss) it is fine :)

@SeanNaren
Copy link
Contributor

SeanNaren commented Dec 7, 2020

@awaelchli After offline discussion I think we should merge this PR as it stands, and create a followup issue to refactor the fix to be less intrusive whilst trying to sync upstream with the native DDP implementation. I think over the interim it's better since currently DDP is broken with manual optim. thoughts?

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will remove the request for changes

@tchaton tchaton merged commit 2393474 into master Dec 7, 2020
@tchaton tchaton deleted the bug/fixfix_ddp_manual branch December 7, 2020 19:32
@azraelkuan
Copy link

hi, when can we turn off find_unused_parameters as False to use DDP when using manual_optimization?
In naive pytorch, we do not need to set this, right?

@awaelchli
Copy link
Contributor

Docs

In naive pytorch, we do not need to set this, right?

If your model/optimization needs it in pytorch, it will need it in Lightning and vice versa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic priority: 0 High priority task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

manual_optimization does not work with ddp
9 participants