-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing logging and errors blocking multi GPU trianing of Torch models #1509
Conversation
I am afraid I made some formatting error according to Black. I don't really know what. Can anyone please advise? |
Hi, You can find the instruction here; you need to install the |
Ok, hopefully this time it works. :-) |
Codecov ReportBase: 94.06% // Head: 94.02% // Decreases project coverage by
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## master #1509 +/- ##
==========================================
- Coverage 94.06% 94.02% -0.05%
==========================================
Files 125 125
Lines 11095 11081 -14
==========================================
- Hits 10437 10419 -18
- Misses 658 662 +4
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @dennisbader do you want to have a look?
loss, | ||
batch_size=train_batch[0].shape[0], | ||
prog_bar=True, | ||
sync_dist=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PTL doc says Use with care as this may lead to a significant communication overhead.
.
Do we have any idea if/when this could cause issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far in practical testing on 8 GPU-s I noticed no adverse effects. Thus said, it depends also on the distribution strategy also. I used the default ddp_spawn
, as mentioned.
Maybe this advice should go somewhere into the documentation: #1287 (comment) |
That sounds like a good idea yes @solalatus . |
…unit8co#1509) * added fix for multi GPU as per https://pytorch-lightning.readthedocs.io/en/stable/extensions/logging.html#automatic-logging * trying to add complete logging in case of distributed to avoid deadlock * fixing the logging on epoch end for multigpu training * Black fixes for formatting errors * Added description of multi GPU setup do User Guide. --------- Co-authored-by: Julien Herzen <julien@unit8.co> Co-authored-by: madtoinou <32447896+madtoinou@users.noreply.github.com>
Fixes #1287, fixes #1385
Summary
Very small fix that enables multi GPU training to run without any problems.
Other Information
As the original Lightning documentation states here in case of multiple GPUs one has to choose how one synchronizes the logging between them. As a first attempt
rank_zero_only=True
is not a good solution, since it can lead to silent exception and breaking of every logging facility including progbar and Tensorboard. Hence thesync_dist=True
option was chosen, which waits, collects and averages the metrics to be logged from all GPUs. It works stable and has no observable effect on single GPU training.