-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Object manager] don't abort entire pull request on race condition from concurrent chunk receive - #2 #19216
Conversation
Is it possible to reproduce it in unit tests (can we try)? I think it is important to have good tests there to avoid regressions in the future. My guess is we can intentionally make highly concurrent workload that fails frequently (by increasing object manager threads and reducing the object store memory size with smaller chunk size of many objects) |
I have tried with many chunks being pushed, but it did not reproduce the issue, probably because the issue arises only with failures. I can add that test. An additional step is to add a Ray internal config option to fail 50% of create buffer requests. Does it sound ok? |
@mwtian when you say failure, it is creation failure because it is already created (or lack of memory)? isn't it reproducible just by increasing
ray/src/ray/common/ray_config_def.h Line 205 in c4bc05b
|
(If it doesn't work, I think adding artificial failure is not a bad idea) |
With the original change, there wouldn't be any create buffer failure due to object already exists. I'm not 100% sure what failures are encountered during buffer creation in nightly tests, which seemed to happen rarely. Will add a unit test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Seems like quite a subtle bug.
We can do this as a follow-up; merging to get more data from nightly tests. |
Why are these changes needed?
This PR re-applies d12e35c, and fixes the issue discovered after the original reverted commit.
#18955 contains the background information of the original commit.
The origin commit can cause threads stuck under the following condition:
Eventually an object transfer would not complete, likely related to more threads stuck in limbo state like request 3. Hence the test stalled.
The original change and its fix in this PR passed 3 consecutive
dask_on_ray_large_scale_test_no_spilling
runs. For now we will rely on this nightly test to catch similar issues in future. If we can inject failures to create buffer, this issue might be reproducible in unit tests too.Related issue number
#18062
Checks
scripts/format.sh
to lint the changes in this PR.