-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Revert "[core] Fix RAY_CHECK failure during shutdown due to plasma store race condition (#55367)" #56061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ore race condition (ray-project#55367)" This reverts commit dded833. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
43034a7 to
7e774f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly reverts commit dded833... to address a regression causing unhandled errors. The changes undo the error handling logic around plasma store operations, which was the source of the issue. While this fixes the immediate problem, it's worth noting that the original commit was intended to solve a race condition during shutdown. By reverting, that race condition may be re-introduced. I've added a comment in src/ray/core_worker/task_manager.cc highlighting a specific behavioral change in error handling that could impact local waiters. A follow-up to properly address the shutdown race condition without introducing new regressions would be beneficial.
| for (int i = 0; i < num_returns; i++) { | ||
| const auto object_id = ObjectID::FromIndex(task_id, /*index=*/i + 1); | ||
| // Always place an error marker in local memory to unblock waiters quickly. | ||
| in_memory_store_.Put(error, object_id); | ||
| // Best-effort plasma put if the object was meant to be in plasma. | ||
| if (store_in_plasma_ids.contains(object_id)) { | ||
| Status s = put_in_local_plasma_callback_(error, object_id); | ||
| if (!s.ok()) { | ||
| RAY_LOG(WARNING).WithField(object_id) | ||
| << "Failed to put error object in plasma: " << s; | ||
| } | ||
| put_in_local_plasma_callback_(error, object_id); | ||
| } else { | ||
| in_memory_store_.Put(error, object_id); | ||
| } | ||
| } | ||
| if (spec.ReturnsDynamic()) { | ||
| for (const auto &dynamic_return_id : spec.DynamicReturnIds()) { | ||
| in_memory_store_.Put(error, dynamic_return_id); | ||
| if (store_in_plasma_ids.contains(dynamic_return_id)) { | ||
| Status s = put_in_local_plasma_callback_(error, dynamic_return_id); | ||
| if (!s.ok()) { | ||
| RAY_LOG(WARNING).WithField(dynamic_return_id) | ||
| << "Failed to put error object in plasma: " << s; | ||
| } | ||
| put_in_local_plasma_callback_(error, dynamic_return_id); | ||
| } else { | ||
| in_memory_store_.Put(error, dynamic_return_id); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This revert changes the behavior of how error objects are stored. The previous logic (now reverted) always placed an error marker in the local memory store to unblock waiters quickly, and then made a best-effort attempt to put it in plasma. This version now puts the error in either plasma or the in-memory store.
This means if put_in_local_plasma_callback_ fails (which now results in a crash due to RAY_CHECK_OK), local waiters might not be unblocked. While this revert fixes the immediate regression, we should be mindful that we might be re-introducing a potential issue that the original commit aimed to solve. A follow-up PR to correctly handle plasma store failures without crashing, while ensuring local waiters are unblocked, would be ideal.
|
Close in favor of #56070 |
This reverts commit dded833.
Reverting due to unhandled errors, see: #55367 (review)