-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Use function_actor_manager.lock when deserializing #16278
Conversation
@rkooo567 Wasn't sure who should look this over but saw that you were keeping track of this on the issue thread, feel free to unassign yourself if someone else is more appropriate |
cc @AmeerHajAli Might fix the issue that modin needs the private api for |
CC @ijrsvt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ckw017 Is there an easy way to test this? Or do you only see this arise in pretty complex scenarios?
I can't reproduce it at all locally, but it seems to come up pretty often in Github Actions. I'm guessing you might be able to force it to happen if you run a workload that A) has multiple imports to different parts of a large library to trigger the deadlock avoidance and B) spawns workers often (once pandas is imported successfully the first time the process should be fine for the rest of it's lifetime), which seems to be the case with some of the modin unit tests. fwiw I'm fairly certain this does address the bug. The testing above was done with nightly where the failure's happen roughly ~25% of runs. I tried adding a similar fix using 1.3 and had the following consecutive runs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investigating @ckw017 and thanks for tagging me @AmeerHajAli!
I'll be happy to remove the vile hack from Modin's codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good!
This is nice! @devin-petersohn this will solve the pandas not properly imported to Modin right ? |
Hey @suquark can you do the final check before we merge it? |
@rkooo567 Yes, this solves the import issue that was difficult to reproduce. |
@rkooo567 , can you please merge this? |
Hey Ameer. Let me have the last sanity check from ryan. I will ping him in person. Is it urgent to merge? |
Yes, we are trying to merge this soon. |
@ckw017 Can you merge the latest master? test_reference_count_2 failure seems spooky. I will also ping Ryan in person so we can move faster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. In the future we might have to hack the deserializer a bit to only lock on importing instead of the whole deserialization process. It could be done by reassigning __builtins__.__import__
or inject pickle global variable lookup mechanics.
ae066c7
to
9ba507c
Compare
Rebased with master and added a TODO linking to the new issue. Reviewers can merge at their discretion if build passes |
Ah rip I guess I just learned the fun way that rebasing makes it a lot harder to find the original failing CI run, but pretty sure it was identical to the failure here (timeout). Looks like reference_counting_2 is flaky on windows at the moment, but if it happens again we can investigate |
* use function_actor_manager.lock when deserializing * add comment and todo * better comment * fix comment
Why are these changes needed?
Addresses #7879 to some extent, and similar to #4718. Most of this is just speculation from what I've scraped together from various error messages, but it looks like there's a chance that worker.deserialize_objects is called at the same time as some other cloudpickle.loads call on a different thread. This can cause
importlib
's deadlock avoidance to kick in, which prevents certain attributes from being imported correctly.This PR just acquires the function_actor_manager lock before calling deserialize_objects (which eventually calls cloudpickle.loads). This is pretty much the same way #4718 handled the same issue but with the import_thread.
Testing
Motivated by some issues with trying to use ray client with Modin. It looks like we introduced a semi-hacky workaround sometime last year on the Modin repo to get around this problem using a private API, but for whatever reason it doesn't seem to work with client. I ran batches of some of the Modin unit tests on Github Actions with the following results:
Failures above were due to the
pandas.core
attribute not found error or some variant.Not actually sure if this is the exact case as the original issue though which would require testing against the original reproducer mentioned, but the errors are identical.
Further discussion
There's sort of a question here of whether we should introduce some kind of global import lock (can do some cursed stuff by reassigning
__builtins__.__import__
) or a lock onpickle.loads
calls, since it looks like this problem can happen an ytime we have multiple threads importing stuff. An even spookier edge case is that if a task that a user submits involves unpickling something as part of a function they wrote it might also trigger this issue, although the case is pretty unlikely.Checks
scripts/format.sh
to lint the changes in this PR.