Skip to content

Conversation

@edoakes
Copy link
Collaborator

@edoakes edoakes commented Aug 28, 2025

This reverts commit dded833.

Reverting due to unhandled errors, see: #55367 (review)

@edoakes edoakes requested a review from a team as a code owner August 28, 2025 23:50
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Aug 28, 2025
…ore race condition (ray-project#55367)"

This reverts commit dded833.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@edoakes edoakes force-pushed the eoakes/revert-err branch from 43034a7 to 7e774f8 Compare August 28, 2025 23:51
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly reverts commit dded833... to address a regression causing unhandled errors. The changes undo the error handling logic around plasma store operations, which was the source of the issue. While this fixes the immediate problem, it's worth noting that the original commit was intended to solve a race condition during shutdown. By reverting, that race condition may be re-introduced. I've added a comment in src/ray/core_worker/task_manager.cc highlighting a specific behavioral change in error handling that could impact local waiters. A follow-up to properly address the shutdown race condition without introducing new regressions would be beneficial.

Comment on lines 1456 to 1472
for (int i = 0; i < num_returns; i++) {
const auto object_id = ObjectID::FromIndex(task_id, /*index=*/i + 1);
// Always place an error marker in local memory to unblock waiters quickly.
in_memory_store_.Put(error, object_id);
// Best-effort plasma put if the object was meant to be in plasma.
if (store_in_plasma_ids.contains(object_id)) {
Status s = put_in_local_plasma_callback_(error, object_id);
if (!s.ok()) {
RAY_LOG(WARNING).WithField(object_id)
<< "Failed to put error object in plasma: " << s;
}
put_in_local_plasma_callback_(error, object_id);
} else {
in_memory_store_.Put(error, object_id);
}
}
if (spec.ReturnsDynamic()) {
for (const auto &dynamic_return_id : spec.DynamicReturnIds()) {
in_memory_store_.Put(error, dynamic_return_id);
if (store_in_plasma_ids.contains(dynamic_return_id)) {
Status s = put_in_local_plasma_callback_(error, dynamic_return_id);
if (!s.ok()) {
RAY_LOG(WARNING).WithField(dynamic_return_id)
<< "Failed to put error object in plasma: " << s;
}
put_in_local_plasma_callback_(error, dynamic_return_id);
} else {
in_memory_store_.Put(error, dynamic_return_id);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This revert changes the behavior of how error objects are stored. The previous logic (now reverted) always placed an error marker in the local memory store to unblock waiters quickly, and then made a best-effort attempt to put it in plasma. This version now puts the error in either plasma or the in-memory store.

This means if put_in_local_plasma_callback_ fails (which now results in a crash due to RAY_CHECK_OK), local waiters might not be unblocked. While this revert fixes the immediate regression, we should be mindful that we might be re-introducing a potential issue that the original commit aimed to solve. A follow-up PR to correctly handle plasma store failures without crashing, while ensuring local waiters are unblocked, would be ideal.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Aug 29, 2025
@edoakes
Copy link
Collaborator Author

edoakes commented Sep 9, 2025

Close in favor of #56070

@edoakes edoakes closed this Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants