Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: gather_all_tensors cross GPUs in DDP #3319

Merged
merged 2 commits into from
Sep 3, 2020

Conversation

ShomyLiu
Copy link
Contributor

@ShomyLiu ShomyLiu commented Sep 2, 2020

What does this PR do?

This PR fixed the sharing the same underlying storage for all GPUs in gather_all_tensors function.

Fixes #3253

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Sep 2, 2020

Codecov Report

Merging #3319 into master will not change coverage.
The diff coverage is 0%.

@@          Coverage Diff           @@
##           master   #3319   +/-   ##
======================================
  Coverage      90%     90%           
======================================
  Files          90      90           
  Lines        8158    8158           
======================================
  Hits         7362    7362           
  Misses        796     796           

Copy link
Contributor

@s-rog s-rog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -301,7 +301,7 @@ def gather_all_tensors_if_available(result: Union[torch.Tensor],

world_size = torch.distributed.get_world_size(group)

gathered_result = world_size * [torch.zeros_like(result)]
gathered_result = [torch.zeros_like(result) for _ in range(world_size)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test to tests/metrics/test_converters.py that actually test that this function does what it is expected to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thanks for the advice, I will add a test for this RP.

@mergify mergify bot requested a review from a team September 2, 2020 07:33
@mergify mergify bot requested a review from a team September 2, 2020 09:35
@Borda Borda requested a review from SkafteNicki September 2, 2020 12:22
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting for a test to be added :]

@mergify mergify bot requested a review from a team September 2, 2020 12:24
@awaelchli awaelchli added bug Something isn't working distributed Generic distributed-related topic Metrics labels Sep 2, 2020
@pep8speaks
Copy link

pep8speaks commented Sep 3, 2020

Hello @ShomyLiu! Thanks for updating this PR.

Line 179:44: E203 whitespace before ','

Comment last updated at 2020-09-03 09:34:04 UTC

@ShomyLiu
Copy link
Contributor Author

ShomyLiu commented Sep 3, 2020

@SkafteNicki Hi, I found that the commit history has been disordered when I am going to update a test case in my PR.
need I close this PR and reopen a new PR for a clear commit history?

@SkafteNicki
Copy link
Member

@Borda can you help fixing the commit history (you seem to be good at git :])?

@SkafteNicki SkafteNicki mentioned this pull request Sep 3, 2020
7 tasks
@justusschock
Copy link
Member

@ShomyLiu can you enable push access for maintainers? I tried to push to your branch and it was rejected

@ShomyLiu
Copy link
Contributor Author

ShomyLiu commented Sep 3, 2020

@justusschock Hi, I have invited you into my branch, then you can push yourself.
Thanks

@justusschock justusschock force-pushed the bugfix/3253_gather-all branch from 7e0596d to ba8afb6 Compare September 3, 2020 09:33
@justusschock
Copy link
Member

I rebased on master. So this should be fine now I think :)

@ShomyLiu
Copy link
Contributor Author

ShomyLiu commented Sep 3, 2020

@justusschock Yeah, thanks for your effort. It is my first PR to lightning, I'm not so familiar with this whole merge process before.

@SkafteNicki
Copy link
Member

Just waiting for test to pass, and then we can get this merged :]

Copy link
Member

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@SkafteNicki SkafteNicki added the ready PRs ready to be merged label Sep 3, 2020
@mergify mergify bot requested a review from a team September 3, 2020 10:20
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM =)

@SkafteNicki SkafteNicki merged commit d521c1b into Lightning-AI:master Sep 3, 2020
@justusschock justusschock deleted the bugfix/3253_gather-all branch September 3, 2020 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

**gather_all_tensors_if_available** share the same underlying storage for all GPUs
7 participants