Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing logging and errors blocking multi GPU trianing of Torch models #1509

Merged
merged 11 commits into from
Feb 21, 2023
Merged

Fixing logging and errors blocking multi GPU trianing of Torch models #1509

merged 11 commits into from
Feb 21, 2023

Conversation

solalatus
Copy link
Contributor

@solalatus solalatus commented Jan 23, 2023

Fixes #1287, fixes #1385

Summary

Very small fix that enables multi GPU training to run without any problems.

Other Information

As the original Lightning documentation states here in case of multiple GPUs one has to choose how one synchronizes the logging between them. As a first attempt rank_zero_only=True is not a good solution, since it can lead to silent exception and breaking of every logging facility including progbar and Tensorboard. Hence the sync_dist=True option was chosen, which waits, collects and averages the metrics to be logged from all GPUs. It works stable and has no observable effect on single GPU training.

@solalatus
Copy link
Contributor Author

I am afraid I made some formatting error according to Black. I don't really know what. Can anyone please advise?

@madtoinou
Copy link
Collaborator

Hi,

You can find the instruction here; you need to install the dev-all requirements, run pre-commit install and call it on your branch.

@solalatus
Copy link
Contributor Author

Ok, hopefully this time it works. :-)

@codecov-commenter
Copy link

codecov-commenter commented Jan 25, 2023

Codecov Report

Base: 94.06% // Head: 94.02% // Decreases project coverage by -0.05% ⚠️

Coverage data is based on head (f2fbf8b) compared to base (690b6f4).
Patch coverage: 100.00% of modified lines in pull request are covered.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1509      +/-   ##
==========================================
- Coverage   94.06%   94.02%   -0.05%     
==========================================
  Files         125      125              
  Lines       11095    11081      -14     
==========================================
- Hits        10437    10419      -18     
- Misses        658      662       +4     
Impacted Files Coverage Δ
darts/models/forecasting/pl_forecasting_module.py 93.79% <100.00%> (ø)
darts/utils/data/tabularization.py 99.27% <0.00%> (-0.73%) ⬇️
darts/timeseries.py 92.14% <0.00%> (-0.23%) ⬇️
darts/ad/anomaly_model/filtering_am.py 91.93% <0.00%> (-0.13%) ⬇️
...arts/models/forecasting/torch_forecasting_model.py 89.52% <0.00%> (-0.05%) ⬇️
darts/models/forecasting/block_rnn_model.py 98.24% <0.00%> (-0.04%) ⬇️
darts/models/forecasting/nhits.py 99.27% <0.00%> (-0.01%) ⬇️
darts/datasets/__init__.py 100.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@hrzn hrzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @dennisbader do you want to have a look?

loss,
batch_size=train_batch[0].shape[0],
prog_bar=True,
sync_dist=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PTL doc says Use with care as this may lead to a significant communication overhead..
Do we have any idea if/when this could cause issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far in practical testing on 8 GPU-s I noticed no adverse effects. Thus said, it depends also on the distribution strategy also. I used the default ddp_spawn, as mentioned.

@solalatus
Copy link
Contributor Author

Maybe this advice should go somewhere into the documentation: #1287 (comment)

@hrzn
Copy link
Contributor

hrzn commented Feb 16, 2023

Maybe this advice should go somewhere into the documentation: #1287 (comment)

That sounds like a good idea yes @solalatus .
Would you agree to add a short new subsection about multi-GPU usage to the GPU/TPU page of the userguide?
Thanks!

@solalatus
Copy link
Contributor Author

solalatus commented Feb 18, 2023

@hrzn I have added a description section to the userguide here, please have a look!

@hrzn hrzn merged commit 955e2b5 into unit8co:master Feb 21, 2023
alexcolpitts96 pushed a commit to alexcolpitts96/darts that referenced this pull request May 31, 2023
…unit8co#1509)

* added fix for multi GPU as per https://pytorch-lightning.readthedocs.io/en/stable/extensions/logging.html#automatic-logging

* trying to add complete logging in case of distributed to avoid deadlock

* fixing the logging on epoch end for multigpu training

* Black fixes for formatting errors

* Added description of multi GPU setup do User Guide.

---------

Co-authored-by: Julien Herzen <julien@unit8.co>
Co-authored-by: madtoinou <32447896+madtoinou@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Enabling multiple gpu causes AssertionError MULTI_GPU DATA_PARALLEL
4 participants