Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Release function actor lock while waiting for actor class to be loaded by import thread #18175

Merged
merged 11 commits into from
Sep 1, 2021

Conversation

ckw017
Copy link
Member

@ckw017 ckw017 commented Aug 27, 2021

Why are these changes needed?

Actor manager now drops the lock introduced in #16278 which can cause deadlock described in #17980 while it's waiting for actor classes to be loaded.

Related issue number

Closes #16278

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ckw017 ckw017 changed the title Don't use function_actor_manager lock for worker deserialize [core] Don't use function_actor_manager lock for worker deserialize Aug 27, 2021
@ckw017 ckw017 changed the title [core] Don't use function_actor_manager lock for worker deserialize [core] Don't use function_actor_manager lock during worker deserialization Aug 27, 2021
@ckw017 ckw017 requested a review from ericl August 27, 2021 23:39
@ericl ericl self-assigned this Aug 28, 2021
@ericl
Copy link
Contributor

ericl commented Aug 28, 2021

This seems like a classic case where you should use a condition variable: https://docs.python.org/3/library/threading.html

Would the following fix the issue?

  1. Change the time.sleep in function_manager.py:load_actor_class_from_gcs to cv.wait() for some condition variable of the function manager.
  2. After each new class is registered in function_manager, call cv.notify().

This release the lock during load_actor_class_from_gcs().

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 28, 2021
@ericl
Copy link
Contributor

ericl commented Aug 28, 2021

Another workaround that I can see, if the issue is that you want to be holding the lock across the entire deserialization--- is to raise an exception if the class is missing instead of sleeping.

Then, we can catch that exception, sleep, and retry the entire deserialization at a higher level.

@ericl
Copy link
Contributor

ericl commented Aug 31, 2021

Seems that strategy is failing all tests, did you test locally? Otherwise probably should try the high level exception/retry.

@ckw017
Copy link
Member Author

ckw017 commented Aug 31, 2021

Sorry, lots of silly mistakes in that run. Retrying right now with fixes

@ckw017 ckw017 changed the title [core] Don't use function_actor_manager lock during worker deserialization [core] Release function actor lock while waiting for actor class to be loaded by import thread Sep 1, 2021
@ckw017 ckw017 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 1, 2021
@ckw017
Copy link
Member Author

ckw017 commented Sep 1, 2021

@ekl Windows failure looks unrelated, but lmk if you want me to rerun/investigate

@ericl ericl merged commit 1a10108 into ray-project:master Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants