Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync module states during non-fit #17370

Merged
merged 8 commits into from
Apr 15, 2023
Merged

Sync module states during non-fit #17370

merged 8 commits into from
Apr 15, 2023

Conversation

carmocca
Copy link
Contributor

@carmocca carmocca commented Apr 14, 2023

What does this PR do?

Using DDP, we only wrap with DistributedDataParallel with trainer.fit.
Since the wrapper normally takes care of synchronizing the parameters and buffers, this was missing during evaluation.

This is not a big bug because most use cases will load a checkpoint during evaluation.

FSDP doesn't have this issue because we always use its wrapper.
Fabric doesn't have this issue because there's no logic around fitting vs non-fitting

cc @Borda @justusschock @awaelchli

@carmocca carmocca added bug Something isn't working strategy: ddp DistributedDataParallel labels Apr 14, 2023
@carmocca carmocca added this to the 2.0.x milestone Apr 14, 2023
@carmocca carmocca self-assigned this Apr 14, 2023
@carmocca carmocca added the pl Generic label for PyTorch Lightning package label Apr 14, 2023
@carmocca carmocca marked this pull request as ready for review April 14, 2023 01:59
@github-actions
Copy link
Contributor

github-actions bot commented Apr 14, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, lightning, 3.8, 1.11) success
pl-cpu (macOS-11, lightning, 3.9, 1.12) success
pl-cpu (macOS-11, lightning, 3.10, 1.13) success
pl-cpu (macOS-11, lightning, 3.10, 2.0) success
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11) success
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
pl-cpu (windows-2022, lightning, 3.8, 1.11) success
pl-cpu (windows-2022, lightning, 3.9, 1.12) success
pl-cpu (windows-2022, lightning, 3.10, 1.13) success
pl-cpu (windows-2022, lightning, 3.10, 2.0) success
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
pl-cpu (macOS-11, pytorch, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13) success
pl-cpu (windows-2022, pytorch, 3.8, 1.13) success

These checks are required after the changes to src/lightning/pytorch/overrides/distributed.py, src/lightning/pytorch/strategies/ddp.py, src/lightning/pytorch/utilities/distributed.py, tests/tests_pytorch/overrides/test_distributed.py, tests/tests_pytorch/utilities/test_distributed.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
pytorch-lightning (GPUs) success

These checks are required after the changes to src/lightning/pytorch/overrides/distributed.py, src/lightning/pytorch/strategies/ddp.py, src/lightning/pytorch/utilities/distributed.py, tests/tests_pytorch/overrides/test_distributed.py, tests/tests_pytorch/utilities/test_distributed.py.

🟢 pytorch_lightning: Docs
Check ID Status
make-doctest (pytorch) success
make-html (pytorch) success

These checks are required after the changes to src/lightning/pytorch/overrides/distributed.py, src/lightning/pytorch/strategies/ddp.py, src/lightning/pytorch/utilities/distributed.py, docs/source-pytorch/api_references.rst.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/lightning/pytorch/overrides/distributed.py, src/lightning/pytorch/strategies/ddp.py, src/lightning/pytorch/utilities/distributed.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.8) success
install-pkg (ubuntu-22.04, app, 3.10) success
install-pkg (ubuntu-22.04, fabric, 3.8) success
install-pkg (ubuntu-22.04, fabric, 3.10) success
install-pkg (ubuntu-22.04, pytorch, 3.8) success
install-pkg (ubuntu-22.04, pytorch, 3.10) success
install-pkg (ubuntu-22.04, lightning, 3.8) success
install-pkg (ubuntu-22.04, lightning, 3.10) success
install-pkg (ubuntu-22.04, notset, 3.8) success
install-pkg (ubuntu-22.04, notset, 3.10) success
install-pkg (macOS-12, app, 3.8) success
install-pkg (macOS-12, app, 3.10) success
install-pkg (macOS-12, fabric, 3.8) success
install-pkg (macOS-12, fabric, 3.10) success
install-pkg (macOS-12, pytorch, 3.8) success
install-pkg (macOS-12, pytorch, 3.10) success
install-pkg (macOS-12, lightning, 3.8) success
install-pkg (macOS-12, lightning, 3.10) success
install-pkg (macOS-12, notset, 3.8) success
install-pkg (macOS-12, notset, 3.10) success
install-pkg (windows-2022, app, 3.8) success
install-pkg (windows-2022, app, 3.10) success
install-pkg (windows-2022, fabric, 3.8) success
install-pkg (windows-2022, fabric, 3.10) success
install-pkg (windows-2022, pytorch, 3.8) success
install-pkg (windows-2022, pytorch, 3.10) success
install-pkg (windows-2022, lightning, 3.8) success
install-pkg (windows-2022, lightning, 3.10) success
install-pkg (windows-2022, notset, 3.8) success
install-pkg (windows-2022, notset, 3.10) success

These checks are required after the changes to src/lightning/pytorch/overrides/distributed.py, src/lightning/pytorch/strategies/ddp.py, src/lightning/pytorch/utilities/distributed.py.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

src/lightning/pytorch/overrides/distributed.py Outdated Show resolved Hide resolved
src/lightning/pytorch/overrides/distributed.py Outdated Show resolved Hide resolved
src/lightning/pytorch/overrides/distributed.py Outdated Show resolved Hide resolved
src/lightning/pytorch/overrides/distributed.py Outdated Show resolved Hide resolved
@mergify mergify bot added the ready PRs ready to be merged label Apr 14, 2023
@carmocca carmocca enabled auto-merge (squash) April 15, 2023 02:11
@carmocca carmocca merged commit 97a6186 into master Apr 15, 2023
@carmocca carmocca deleted the carmocca/module-states branch April 15, 2023 02:35
Borda pushed a commit that referenced this pull request Apr 24, 2023
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
(cherry picked from commit 97a6186)
Borda pushed a commit that referenced this pull request Apr 24, 2023
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
(cherry picked from commit 97a6186)
lantiga pushed a commit that referenced this pull request Apr 24, 2023
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
(cherry picked from commit 97a6186)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pl Generic label for PyTorch Lightning package ready PRs ready to be merged strategy: ddp DistributedDataParallel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants