Sync module states during non-fit #17370

carmocca · 2023-04-14T01:33:14Z

What does this PR do?

Using DDP, we only wrap with DistributedDataParallel with trainer.fit.
Since the wrapper normally takes care of synchronizing the parameters and buffers, this was missing during evaluation.

This is not a big bug because most use cases will load a checkpoint during evaluation.

FSDP doesn't have this issue because we always use its wrapper.
Fabric doesn't have this issue because there's no logic around fitting vs non-fitting

cc @Borda @justusschock @awaelchli

src/lightning/pytorch/overrides/distributed.py

src/lightning/pytorch/utilities/distributed.py

src/lightning/pytorch/overrides/distributed.py

github-actions · 2023-04-14T01:59:54Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow

Check ID	Status
pl-cpu (macOS-11, lightning, 3.8, 1.11)	success	✅
pl-cpu (macOS-11, lightning, 3.9, 1.12)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11)	success	✅
pl-cpu (windows-2022, lightning, 3.9, 1.12)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (macOS-11, pytorch, 3.8, 1.13)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13)	success	✅
pl-cpu (windows-2022, pytorch, 3.8, 1.13)	success	✅

These checks are required after the changes to src/lightning/pytorch/overrides/distributed.py, src/lightning/pytorch/strategies/ddp.py, src/lightning/pytorch/utilities/distributed.py, tests/tests_pytorch/overrides/test_distributed.py, tests/tests_pytorch/utilities/test_distributed.py.

🟢 pytorch_lightning: Azure GPU

Check ID	Status
pytorch-lightning (GPUs)	success	✅

These checks are required after the changes to src/lightning/pytorch/overrides/distributed.py, src/lightning/pytorch/strategies/ddp.py, src/lightning/pytorch/utilities/distributed.py, tests/tests_pytorch/overrides/test_distributed.py, tests/tests_pytorch/utilities/test_distributed.py.

🟢 pytorch_lightning: Docs

Check ID	Status
make-doctest (pytorch)	success	✅
make-html (pytorch)	success	✅

These checks are required after the changes to src/lightning/pytorch/overrides/distributed.py, src/lightning/pytorch/strategies/ddp.py, src/lightning/pytorch/utilities/distributed.py, docs/source-pytorch/api_references.rst.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/lightning/pytorch/overrides/distributed.py, src/lightning/pytorch/strategies/ddp.py, src/lightning/pytorch/utilities/distributed.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.8)	success	✅
install-pkg (ubuntu-22.04, app, 3.10)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.8)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.10)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.8)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.10)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.8)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.10)	success	✅
install-pkg (ubuntu-22.04, notset, 3.8)	success	✅
install-pkg (ubuntu-22.04, notset, 3.10)	success	✅
install-pkg (macOS-12, app, 3.8)	success	✅
install-pkg (macOS-12, app, 3.10)	success	✅
install-pkg (macOS-12, fabric, 3.8)	success	✅
install-pkg (macOS-12, fabric, 3.10)	success	✅
install-pkg (macOS-12, pytorch, 3.8)	success	✅
install-pkg (macOS-12, pytorch, 3.10)	success	✅
install-pkg (macOS-12, lightning, 3.8)	success	✅
install-pkg (macOS-12, lightning, 3.10)	success	✅
install-pkg (macOS-12, notset, 3.8)	success	✅
install-pkg (macOS-12, notset, 3.10)	success	✅
install-pkg (windows-2022, app, 3.8)	success	✅
install-pkg (windows-2022, app, 3.10)	success	✅
install-pkg (windows-2022, fabric, 3.8)	success	✅
install-pkg (windows-2022, fabric, 3.10)	success	✅
install-pkg (windows-2022, pytorch, 3.8)	success	✅
install-pkg (windows-2022, pytorch, 3.10)	success	✅
install-pkg (windows-2022, lightning, 3.8)	success	✅
install-pkg (windows-2022, lightning, 3.10)	success	✅
install-pkg (windows-2022, notset, 3.8)	success	✅
install-pkg (windows-2022, notset, 3.10)	success	✅

These checks are required after the changes to src/lightning/pytorch/overrides/distributed.py, src/lightning/pytorch/strategies/ddp.py, src/lightning/pytorch/utilities/distributed.py.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

src/lightning/pytorch/overrides/distributed.py

tests/tests_pytorch/overrides/test_distributed.py

src/lightning/pytorch/overrides/distributed.py

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> (cherry picked from commit 97a6186)

Sync module states during non-fit

b3709b4

carmocca added bug Something isn't working strategy: ddp DistributedDataParallel labels Apr 14, 2023

carmocca added this to the 2.0.x milestone Apr 14, 2023

carmocca self-assigned this Apr 14, 2023

carmocca added the pl Generic label for PyTorch Lightning package label Apr 14, 2023

Add test

0c0030b

carmocca commented Apr 14, 2023

View reviewed changes

src/lightning/pytorch/overrides/distributed.py Show resolved Hide resolved

src/lightning/pytorch/overrides/distributed.py Show resolved Hide resolved

src/lightning/pytorch/utilities/distributed.py Show resolved Hide resolved

carmocca commented Apr 14, 2023

View reviewed changes

src/lightning/pytorch/overrides/distributed.py Show resolved Hide resolved

carmocca marked this pull request as ready for review April 14, 2023 01:59

carmocca requested review from awaelchli, justusschock and williamFalcon as code owners April 14, 2023 01:59

carmocca mentioned this pull request Apr 14, 2023

[TPU] Add support for PJRT from PyTorch/XLA 2.0 #17352

Merged

7 tasks

Borda approved these changes Apr 14, 2023

View reviewed changes

Borda and others added 3 commits April 14, 2023 10:05

Apply suggestions from code review

40a8f3b

mypy and standalone

8273413

Update test_distributed.py

e948109

awaelchli approved these changes Apr 14, 2023

View reviewed changes

tests/tests_pytorch/overrides/test_distributed.py Show resolved Hide resolved

mergify bot added the ready PRs ready to be merged label Apr 14, 2023

awaelchli reviewed Apr 14, 2023

View reviewed changes

src/lightning/pytorch/overrides/distributed.py Show resolved Hide resolved

Borda and others added 3 commits April 14, 2023 23:52

Merge branch 'master' into carmocca/module-states

667a614

Copy implementation for torch 1.11

0477b8f

Remove from api ref

dda1225

carmocca requested review from edenlightning and lantiga as code owners April 15, 2023 02:11

carmocca enabled auto-merge (squash) April 15, 2023 02:11

carmocca merged commit 97a6186 into master Apr 15, 2023

carmocca deleted the carmocca/module-states branch April 15, 2023 02:35

Borda pushed a commit that referenced this pull request Apr 24, 2023

Sync module states during non-fit (#17370)

9ffa6fa

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> (cherry picked from commit 97a6186)

Borda pushed a commit that referenced this pull request Apr 24, 2023

Sync module states during non-fit (#17370)

e0d493c

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> (cherry picked from commit 97a6186)

lantiga pushed a commit that referenced this pull request Apr 24, 2023

Sync module states during non-fit (#17370)

2af135b

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> (cherry picked from commit 97a6186)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync module states during non-fit #17370

Sync module states during non-fit #17370

carmocca commented Apr 14, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Apr 14, 2023 •

edited

Loading

Sync module states during non-fit #17370

Sync module states during non-fit #17370

Conversation

carmocca commented Apr 14, 2023 • edited by github-actions bot Loading

What does this PR do?

github-actions bot commented Apr 14, 2023 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

carmocca commented Apr 14, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Apr 14, 2023 •

edited

Loading