Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Use function_actor_manager.lock when deserializing #16278

Merged
merged 4 commits into from
Jun 8, 2021

Conversation

ckw017
Copy link
Member

@ckw017 ckw017 commented Jun 6, 2021

Why are these changes needed?

Addresses #7879 to some extent, and similar to #4718. Most of this is just speculation from what I've scraped together from various error messages, but it looks like there's a chance that worker.deserialize_objects is called at the same time as some other cloudpickle.loads call on a different thread. This can cause importlib's deadlock avoidance to kick in, which prevents certain attributes from being imported correctly.

This PR just acquires the function_actor_manager lock before calling deserialize_objects (which eventually calls cloudpickle.loads). This is pretty much the same way #4718 handled the same issue but with the import_thread.

Testing

Motivated by some issues with trying to use ray client with Modin. It looks like we introduced a semi-hacky workaround sometime last year on the Modin repo to get around this problem using a private API, but for whatever reason it doesn't seem to work with client. I ran batches of some of the Modin unit tests on Github Actions with the following results:

  • Run 1: Without lock: 2/8 failures, with lock 0/8 failures
  • Run 2: Without lock: 3/8 failures, with lock 0/8 failures

Failures above were due to the pandas.core attribute not found error or some variant.

Not actually sure if this is the exact case as the original issue though which would require testing against the original reproducer mentioned, but the errors are identical.

Further discussion

There's sort of a question here of whether we should introduce some kind of global import lock (can do some cursed stuff by reassigning __builtins__.__import__) or a lock on pickle.loads calls, since it looks like this problem can happen an ytime we have multiple threads importing stuff. An even spookier edge case is that if a task that a user submits involves unpickling something as part of a function they wrote it might also trigger this issue, although the case is pretty unlikely.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ckw017
Copy link
Member Author

ckw017 commented Jun 6, 2021

@rkooo567 Wasn't sure who should look this over but saw that you were keeping track of this on the issue thread, feel free to unassign yourself if someone else is more appropriate

@ckw017
Copy link
Member Author

ckw017 commented Jun 6, 2021

cc @AmeerHajAli Might fix the issue that modin needs the private api for

@AmeerHajAli
Copy link
Contributor

CC @ijrsvt

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ckw017 Is there an easy way to test this? Or do you only see this arise in pretty complex scenarios?

@ckw017
Copy link
Member Author

ckw017 commented Jun 6, 2021

I can't reproduce it at all locally, but it seems to come up pretty often in Github Actions. I'm guessing you might be able to force it to happen if you run a workload that A) has multiple imports to different parts of a large library to trigger the deadlock avoidance and B) spawns workers often (once pandas is imported successfully the first time the process should be fine for the rest of it's lifetime), which seems to be the case with some of the modin unit tests.

fwiw I'm fairly certain this does address the bug. The testing above was done with nightly where the failure's happen roughly ~25% of runs. I tried adding a similar fix using 1.3 and had the following consecutive runs:
Run 1: Without the lock, 10 out of 12 runs fail because of the import issue
Run 2: With the lock, all 12 runs pass

Copy link
Member

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating @ckw017 and thanks for tagging me @AmeerHajAli!

I'll be happy to remove the vile hack from Modin's codebase.

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

@rkooo567
Copy link
Contributor

rkooo567 commented Jun 7, 2021

This is nice! @devin-petersohn this will solve the pandas not properly imported to Modin right ?

@rkooo567
Copy link
Contributor

rkooo567 commented Jun 7, 2021

Hey @suquark can you do the final check before we merge it?

@devin-petersohn
Copy link
Member

@rkooo567 Yes, this solves the import issue that was difficult to reproduce.

@AmeerHajAli
Copy link
Contributor

@rkooo567 , can you please merge this?

@rkooo567
Copy link
Contributor

rkooo567 commented Jun 8, 2021

Hey Ameer. Let me have the last sanity check from ryan. I will ping him in person. Is it urgent to merge?

@AmeerHajAli
Copy link
Contributor

Yes, we are trying to merge this soon.

@rkooo567
Copy link
Contributor

rkooo567 commented Jun 8, 2021

@ckw017 Can you merge the latest master? test_reference_count_2 failure seems spooky. I will also ping Ryan in person so we can move faster

Copy link
Member

@suquark suquark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. In the future we might have to hack the deserializer a bit to only lock on importing instead of the whole deserialization process. It could be done by reassigning __builtins__.__import__ or inject pickle global variable lookup mechanics.

@rkooo567
Copy link
Contributor

rkooo567 commented Jun 8, 2021

@suquark maybe we can create an issue? Or @ckw017 Can you add TODO in the code?

I will merge the PR once

  • Merge the latest master
  • Add Ryan's feedback to TODO.

@suquark
Copy link
Member

suquark commented Jun 8, 2021

@rkooo567 I create an issue about this: #16304. I think we only need to include the link to the issue in the TODO.

@ckw017
Copy link
Member Author

ckw017 commented Jun 8, 2021

Rebased with master and added a TODO linking to the new issue. Reviewers can merge at their discretion if build passes

@ckw017
Copy link
Member Author

ckw017 commented Jun 8, 2021

Ah rip I guess I just learned the fun way that rebasing makes it a lot harder to find the original failing CI run, but pretty sure it was identical to the failure here (timeout). Looks like reference_counting_2 is flaky on windows at the moment, but if it happens again we can investigate

@rkooo567 rkooo567 merged commit c8e3ed9 into ray-project:master Jun 8, 2021
@rkooo567 rkooo567 added the 1.4.1 label Jun 14, 2021
DmitriGekhtman pushed a commit that referenced this pull request Jun 20, 2021
* use function_actor_manager.lock when deserializing

* add comment and todo

* better comment

* fix comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants