[Object manager] don't abort entire pull request on race condition from concurrent chunk receive - #2 #19216

mwtian · 2021-10-08T04:50:10Z

Why are these changes needed?

This PR re-applies d12e35c, and fixes the issue discovered after the original reverted commit.
#18955 contains the background information of the original commit.

The origin commit can cause threads stuck under the following condition:

Push request 1 creates CondVar A, and starts to create buffer for object_id.
Push request 2 for object_id waits on CondVar A.
Push request 3 for object_id waits on CondVar A.
Push request 1 failed to create buffer, signals to CondVar A, and returns.
Push request 2 creates CondVar B, and starts to create buffer.
Push request 3 wakes up from the signal by request 1, sees (object_id, CondVar B) in create_buffer_ops_, and keeps waiting on CondVar A. No thread will signal CondVar A again.

Eventually an object transfer would not complete, likely related to more threads stuck in limbo state like request 3. Hence the test stalled.

The original change and its fix in this PR passed 3 consecutive dask_on_ray_large_scale_test_no_spilling runs. For now we will rely on this nightly test to catch similar issues in future. If we can inject failures to create buffer, this issue might be reproducible in unit tests too.

Related issue number

#18062

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

This reverts commit 83cb992.

rkooo567 · 2021-10-08T05:05:36Z

Is it possible to reproduce it in unit tests (can we try)? I think it is important to have good tests there to avoid regressions in the future. My guess is we can intentionally make highly concurrent workload that fails frequently (by increasing object manager threads and reducing the object store memory size with smaller chunk size of many objects)

mwtian · 2021-10-08T06:07:38Z

I have tried with many chunks being pushed, but it did not reproduce the issue, probably because the issue arises only with failures. I can add that test. An additional step is to add a Ray internal config option to fail 50% of create buffer requests. Does it sound ok?

rkooo567 · 2021-10-08T06:15:30Z

@mwtian when you say failure, it is creation failure because it is already created (or lack of memory)?

isn't it reproducible just by increasing

ray/src/ray/object_manager/object_manager.cc

Line 92 in c4bc05b

config_.rpc_service_threads_number),

and decreasing

ray/src/ray/common/ray_config_def.h

Line 205 in c4bc05b

RAY_CONFIG(uint64_t, object_manager_default_chunk_size, 5 * 1024 * 1024)

?

rkooo567 · 2021-10-08T06:15:44Z

(If it doesn't work, I think adding artificial failure is not a bad idea)

mwtian · 2021-10-08T06:23:22Z

With the original change, there wouldn't be any create buffer failure due to object already exists. I'm not 100% sure what failures are encountered during buffer creation in nightly tests, which seemed to happen rarely. Will add a unit test.

ericl

Nice! Seems like quite a subtle bug.

ericl · 2021-10-08T19:58:42Z

Will add a unit test.

We can do this as a follow-up; merging to get more data from nightly tests.

Mingwei Tian added 3 commits October 7, 2021 13:18

Revert "Revert pull retry (#19068)"

5866f5a

This reverts commit 83cb992.

retry waiting on condvar correctly. fix clang-tidy.

9142eb4

fix test

c8af6d6

mwtian assigned ericl and rkooo567 Oct 8, 2021

mwtian added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 8, 2021

ericl approved these changes Oct 8, 2021

View reviewed changes

ericl merged commit b066627 into master Oct 8, 2021

ericl deleted the test_wheels/mwtian-pull branch October 8, 2021 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Object manager] don't abort entire pull request on race condition from concurrent chunk receive - #2 #19216

[Object manager] don't abort entire pull request on race condition from concurrent chunk receive - #2 #19216

mwtian commented Oct 8, 2021

rkooo567 commented Oct 8, 2021 •

edited

Loading

mwtian commented Oct 8, 2021 •

edited

Loading

rkooo567 commented Oct 8, 2021

rkooo567 commented Oct 8, 2021

mwtian commented Oct 8, 2021 •

edited

Loading

ericl left a comment

ericl commented Oct 8, 2021

[Object manager] don't abort entire pull request on race condition from concurrent chunk receive - #2 #19216

[Object manager] don't abort entire pull request on race condition from concurrent chunk receive - #2 #19216

Conversation

mwtian commented Oct 8, 2021

Why are these changes needed?

Related issue number

Checks

rkooo567 commented Oct 8, 2021 • edited Loading

mwtian commented Oct 8, 2021 • edited Loading

rkooo567 commented Oct 8, 2021

rkooo567 commented Oct 8, 2021

mwtian commented Oct 8, 2021 • edited Loading

ericl left a comment

Choose a reason for hiding this comment

ericl commented Oct 8, 2021

rkooo567 commented Oct 8, 2021 •

edited

Loading

mwtian commented Oct 8, 2021 •

edited

Loading

mwtian commented Oct 8, 2021 •

edited

Loading