Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove deadlock detection / process reconciliation logic #16204

Merged
merged 2 commits into from
Jan 18, 2023

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Dec 26, 2022

What does this PR do?

Deadlock detection / process reconciliation is a feature of the DDP strategy family. It is not documented, but it is turned on by default when ddp strategies are used, unless PL_RECONCILIATE_PROCESS=0 is set.

  • It has reliability issues, such as that it relies on the current working directory be shared among all nodes, which is often the case but not always.
  • TorchElastic already has mechanisms to detect failing processes and can prevent a hang, and can restart the job automatically. TorchElastic has been upstreamed to PyTorch recently.
  • The feature is implemented and tested only for the DDP strategy. Others who inherit from DDP are not tested. Other strategies that inherit from ParallelStrategy do not have this implemented.
  • With process reconciliation turned off, an exception that occurs only on one rank but not others will lead to a hang, but the error message surfaces regardless.
  • The main benefit of the feature is that processes get terminated immediately and don't hang. This could save you some money for example when running in cloud instances.

Part of #16410
Closes #12797

Does your PR introduce any breaking changes? If yes, please list them.

Yes, for everyone using the ddp strategy, and anyone else that enables PL_RECONCILIATE_PROCESS=1, the feature will no longer work. However, errors

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

cc @Borda @justusschock @awaelchli

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Dec 26, 2022
@lantiga lantiga deleted the branch lite/debug January 5, 2023 07:07
@lantiga lantiga closed this Jan 5, 2023
@awaelchli awaelchli reopened this Jan 9, 2023
@carmocca carmocca force-pushed the lite/debug branch 2 times, most recently from cafbc0c to f7a4b28 Compare January 17, 2023 13:06
@carmocca carmocca force-pushed the lite/debug-remove-process-reconciliation branch from a37c43d to a043ba8 Compare January 17, 2023 13:08
@carmocca
Copy link
Contributor

I rebased this

@awaelchli awaelchli added refactor strategy: ddp DistributedDataParallel labels Jan 17, 2023
@awaelchli awaelchli marked this pull request as ready for review January 17, 2023 13:23
@awaelchli awaelchli marked this pull request as draft January 17, 2023 13:24
@github-actions
Copy link
Contributor

github-actions bot commented Jan 17, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, pytorch, 3.8, 1.11) success
pl-cpu (macOS-11, pytorch, 3.9, 1.12) success
pl-cpu (macOS-11, pytorch, 3.10, 1.13) success
pl-cpu (macOS-11, pytorch, 3.8, 1.10, oldest) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.10) success
pl-cpu (ubuntu-20.04, pytorch, 3.9, 1.11) success
pl-cpu (ubuntu-20.04, pytorch, 3.10, 1.12) success
pl-cpu (ubuntu-20.04, pytorch, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.7, 1.10, oldest) success
pl-cpu (windows-2022, pytorch, 3.9, 1.11) success
pl-cpu (windows-2022, pytorch, 3.10, 1.12) success
pl-cpu (windows-2022, pytorch, 3.10, 1.13) success
pl-cpu (windows-2022, pytorch, 3.7, 1.10, oldest) success
pl-cpu (slow, macOS-11, pytorch, 3.7, 1.11) success
pl-cpu (slow, ubuntu-20.04, pytorch, 3.7, 1.11) success
pl-cpu (slow, windows-2022, pytorch, 3.7, 1.11) success
pl-cpu (macOS-11, lightning, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.13) success
pl-cpu (windows-2022, lightning, 3.8, 1.13) success

These checks are required after the changes to src/pytorch_lightning/strategies/bagua.py, src/pytorch_lightning/strategies/ddp.py, src/pytorch_lightning/strategies/fully_sharded_native.py, src/pytorch_lightning/strategies/parallel.py, src/pytorch_lightning/strategies/sharded.py, src/pytorch_lightning/trainer/call.py, src/pytorch_lightning/utilities/exceptions.py, tests/tests_pytorch/trainer/test_trainer.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
pytorch-lightning (GPUs) success

These checks are required after the changes to src/pytorch_lightning/strategies/bagua.py, src/pytorch_lightning/strategies/ddp.py, src/pytorch_lightning/strategies/fully_sharded_native.py, src/pytorch_lightning/strategies/parallel.py, src/pytorch_lightning/strategies/sharded.py, src/pytorch_lightning/trainer/call.py, src/pytorch_lightning/utilities/exceptions.py, tests/tests_pytorch/trainer/test_trainer.py.

🟢 pytorch_lightning: Azure HPU
Check ID Status
pytorch-lightning (HPUs) success

These checks are required after the changes to src/pytorch_lightning/strategies/bagua.py, src/pytorch_lightning/strategies/ddp.py, src/pytorch_lightning/strategies/fully_sharded_native.py, src/pytorch_lightning/strategies/parallel.py, src/pytorch_lightning/strategies/sharded.py, src/pytorch_lightning/trainer/call.py, src/pytorch_lightning/utilities/exceptions.py, tests/tests_pytorch/trainer/test_trainer.py.

🟢 pytorch_lightning: Azure IPU
Check ID Status
pytorch-lightning (IPUs) success

These checks are required after the changes to src/pytorch_lightning/strategies/bagua.py, src/pytorch_lightning/strategies/ddp.py, src/pytorch_lightning/strategies/fully_sharded_native.py, src/pytorch_lightning/strategies/parallel.py, src/pytorch_lightning/strategies/sharded.py, src/pytorch_lightning/trainer/call.py, src/pytorch_lightning/utilities/exceptions.py, tests/tests_pytorch/trainer/test_trainer.py.

🟢 pytorch_lightning: Docs
Check ID Status
make-doctest (pytorch) success
make-html (pytorch) success

These checks are required after the changes to src/pytorch_lightning/strategies/bagua.py, src/pytorch_lightning/strategies/ddp.py, src/pytorch_lightning/strategies/fully_sharded_native.py, src/pytorch_lightning/strategies/parallel.py, src/pytorch_lightning/strategies/sharded.py, src/pytorch_lightning/trainer/call.py, src/pytorch_lightning/utilities/exceptions.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/pytorch_lightning/strategies/bagua.py, src/pytorch_lightning/strategies/ddp.py, src/pytorch_lightning/strategies/fully_sharded_native.py, src/pytorch_lightning/strategies/parallel.py, src/pytorch_lightning/strategies/sharded.py, src/pytorch_lightning/trainer/call.py, src/pytorch_lightning/utilities/exceptions.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.7) success
install-pkg (ubuntu-22.04, app, 3.10) success
install-pkg (ubuntu-22.04, fabric, 3.7) success
install-pkg (ubuntu-22.04, fabric, 3.10) success
install-pkg (ubuntu-22.04, pytorch, 3.7) success
install-pkg (ubuntu-22.04, pytorch, 3.10) success
install-pkg (ubuntu-22.04, lightning, 3.7) success
install-pkg (ubuntu-22.04, lightning, 3.10) success
install-pkg (ubuntu-22.04, notset, 3.7) success
install-pkg (ubuntu-22.04, notset, 3.10) success
install-pkg (macOS-12, app, 3.7) success
install-pkg (macOS-12, app, 3.10) success
install-pkg (macOS-12, fabric, 3.7) success
install-pkg (macOS-12, fabric, 3.10) success
install-pkg (macOS-12, pytorch, 3.7) success
install-pkg (macOS-12, pytorch, 3.10) success
install-pkg (macOS-12, lightning, 3.7) success
install-pkg (macOS-12, lightning, 3.10) success
install-pkg (macOS-12, notset, 3.7) success
install-pkg (macOS-12, notset, 3.10) success
install-pkg (windows-2022, app, 3.7) success
install-pkg (windows-2022, app, 3.10) success
install-pkg (windows-2022, fabric, 3.7) success
install-pkg (windows-2022, fabric, 3.10) success
install-pkg (windows-2022, pytorch, 3.7) success
install-pkg (windows-2022, pytorch, 3.10) success
install-pkg (windows-2022, lightning, 3.7) success
install-pkg (windows-2022, lightning, 3.10) success
install-pkg (windows-2022, notset, 3.7) success
install-pkg (windows-2022, notset, 3.10) success

These checks are required after the changes to src/pytorch_lightning/strategies/bagua.py, src/pytorch_lightning/strategies/ddp.py, src/pytorch_lightning/strategies/fully_sharded_native.py, src/pytorch_lightning/strategies/parallel.py, src/pytorch_lightning/strategies/sharded.py, src/pytorch_lightning/trainer/call.py, src/pytorch_lightning/utilities/exceptions.py.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@awaelchli awaelchli marked this pull request as ready for review January 17, 2023 13:25
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a commit with some leftover removals to the fsdp strategy

src/pytorch_lightning/strategies/ddp.py Show resolved Hide resolved
@mergify mergify bot added the ready PRs ready to be merged label Jan 18, 2023
@carmocca carmocca merged commit 4165870 into lite/debug Jan 18, 2023
@carmocca carmocca deleted the lite/debug-remove-process-reconciliation branch January 18, 2023 12:37
@carmocca carmocca added this to the 2.0 milestone Jan 19, 2023
carmocca added a commit that referenced this pull request Jan 19, 2023
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
carmocca added a commit that referenced this pull request Jan 19, 2023
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
carmocca added a commit that referenced this pull request Jan 19, 2023
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
lantiga pushed a commit that referenced this pull request Jan 19, 2023
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pl Generic label for PyTorch Lightning package ready PRs ready to be merged refactor strategy: ddp DistributedDataParallel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants