-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multithreading checkpoint loading #17678
Fix multithreading checkpoint loading #17678
Conversation
FYI I'm marking it as ready for review to have the full CI running the changes. |
for more information, see https://pre-commit.ci
This reverts commit c38fcb1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I checked that the modified test case reproduces the problem (see past commit).
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awaelchli Do you remember why you removed the lock in #15237? I couldn't find any comment about it.
Me neither. I don't remember. RIP :( |
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: awaelchli <aedu.waelchli@gmail.com> (cherry picked from commit 1307b60)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: awaelchli <aedu.waelchli@gmail.com> (cherry picked from commit 1307b60)
What does this PR do?
Fixes #17665
Currently, loading checkpoint does not work when using a multithreading-based strategy (for example, XLA + PJRT). The context manager
pl_legacy_patch
makes legacy modulelightning.pytorch.utilities.argparse_utils
temporarily available in its context by adding it to thesys.modules
on enter and removing it on exit. But when used with threading, becausesys.modules
is shared between threads, one of them tries to delete already deleted entry, causing exception (but silently on XLA).This PR fixes this issue by introducing following changes:
test_legacy_ckpt_threading
which always passes even though exceptions are raising from threads.pl_legacy_patch
context manager which was reverted in commit 277b0b8Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist