Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multithreading checkpoint loading #17678

Merged
merged 14 commits into from
May 31, 2023

Conversation

Quasar-Kim
Copy link
Contributor

@Quasar-Kim Quasar-Kim commented May 22, 2023

What does this PR do?

Fixes #17665

Currently, loading checkpoint does not work when using a multithreading-based strategy (for example, XLA + PJRT). The context manager pl_legacy_patch makes legacy module lightning.pytorch.utilities.argparse_utils temporarily available in its context by adding it to the sys.modules on enter and removing it on exit. But when used with threading, because sys.modules is shared between threads, one of them tries to delete already deleted entry, causing exception (but silently on XLA).

This PR fixes this issue by introducing following changes:

  • Fixes broken test test_legacy_ckpt_threading which always passes even though exceptions are raising from threads.
  • Re-introduce locking to pl_legacy_patch context manager which was reverted in commit 277b0b8
Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label May 22, 2023
@awaelchli awaelchli changed the title Fix multithreading checkpoint loading [WIP] Fix multithreading checkpoint loading May 31, 2023
@awaelchli awaelchli marked this pull request as ready for review May 31, 2023 12:34
@awaelchli
Copy link
Contributor

FYI I'm marking it as ready for review to have the full CI running the changes.

@awaelchli awaelchli added bug Something isn't working checkpointing Related to checkpointing labels May 31, 2023
@awaelchli awaelchli added this to the 2.0.x milestone May 31, 2023
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I checked that the modified test case reproduces the problem (see past commit).

@awaelchli awaelchli changed the title [WIP] Fix multithreading checkpoint loading Fix multithreading checkpoint loading May 31, 2023
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli Do you remember why you removed the lock in #15237? I couldn't find any comment about it.

@mergify mergify bot added the ready PRs ready to be merged label May 31, 2023
@awaelchli
Copy link
Contributor

Me neither. I don't remember. RIP :(

@awaelchli awaelchli merged commit 1307b60 into Lightning-AI:master May 31, 2023
@Quasar-Kim Quasar-Kim deleted the multithreading-chkpt branch May 31, 2023 22:59
Borda pushed a commit that referenced this pull request Jun 2, 2023
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
(cherry picked from commit 1307b60)
lantiga pushed a commit that referenced this pull request Jun 2, 2023
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
(cherry picked from commit 1307b60)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working checkpointing Related to checkpointing pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot load checkpoint when using multithreaded distributed training
3 participants