Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Object manager] don't abort entire pull request on race condition from concurrent chunk receive - #2 #19216

Merged
merged 3 commits into from
Oct 8, 2021

Conversation

mwtian
Copy link
Member

@mwtian mwtian commented Oct 8, 2021

Why are these changes needed?

This PR re-applies d12e35c, and fixes the issue discovered after the original reverted commit.
#18955 contains the background information of the original commit.

The origin commit can cause threads stuck under the following condition:

  1. Push request 1 creates CondVar A, and starts to create buffer for object_id.
  2. Push request 2 for object_id waits on CondVar A.
  3. Push request 3 for object_id waits on CondVar A.
  4. Push request 1 failed to create buffer, signals to CondVar A, and returns.
  5. Push request 2 creates CondVar B, and starts to create buffer.
  6. Push request 3 wakes up from the signal by request 1, sees (object_id, CondVar B) in create_buffer_ops_, and keeps waiting on CondVar A. No thread will signal CondVar A again.

Eventually an object transfer would not complete, likely related to more threads stuck in limbo state like request 3. Hence the test stalled.

The original change and its fix in this PR passed 3 consecutive dask_on_ray_large_scale_test_no_spilling runs. For now we will rely on this nightly test to catch similar issues in future. If we can inject failures to create buffer, this issue might be reproducible in unit tests too.

Related issue number

#18062

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567
Copy link
Contributor

rkooo567 commented Oct 8, 2021

Is it possible to reproduce it in unit tests (can we try)? I think it is important to have good tests there to avoid regressions in the future. My guess is we can intentionally make highly concurrent workload that fails frequently (by increasing object manager threads and reducing the object store memory size with smaller chunk size of many objects)

@mwtian
Copy link
Member Author

mwtian commented Oct 8, 2021

I have tried with many chunks being pushed, but it did not reproduce the issue, probably because the issue arises only with failures. I can add that test. An additional step is to add a Ray internal config option to fail 50% of create buffer requests. Does it sound ok?

@rkooo567
Copy link
Contributor

rkooo567 commented Oct 8, 2021

@mwtian when you say failure, it is creation failure because it is already created (or lack of memory)?

isn't it reproducible just by increasing

config_.rpc_service_threads_number),
and decreasing
RAY_CONFIG(uint64_t, object_manager_default_chunk_size, 5 * 1024 * 1024)
?

@rkooo567
Copy link
Contributor

rkooo567 commented Oct 8, 2021

(If it doesn't work, I think adding artificial failure is not a bad idea)

@mwtian
Copy link
Member Author

mwtian commented Oct 8, 2021

With the original change, there wouldn't be any create buffer failure due to object already exists. I'm not 100% sure what failures are encountered during buffer creation in nightly tests, which seemed to happen rarely. Will add a unit test.

@mwtian mwtian added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 8, 2021
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Seems like quite a subtle bug.

@ericl ericl merged commit b066627 into master Oct 8, 2021
@ericl ericl deleted the test_wheels/mwtian-pull branch October 8, 2021 19:58
@ericl
Copy link
Contributor

ericl commented Oct 8, 2021

Will add a unit test.

We can do this as a follow-up; merging to get more data from nightly tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants