Revert "[core] Fix RAY_CHECK failure during shutdown due to plasma store race condition (#55367)" #56061

edoakes · 2025-08-28T23:50:38Z

This reverts commit dded833.

Reverting due to unhandled errors, see: #55367 (review)

…ore race condition (ray-project#55367)" This reverts commit dded833. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

gemini-code-assist

Code Review

This pull request correctly reverts commit dded833... to address a regression causing unhandled errors. The changes undo the error handling logic around plasma store operations, which was the source of the issue. While this fixes the immediate problem, it's worth noting that the original commit was intended to solve a race condition during shutdown. By reverting, that race condition may be re-introduced. I've added a comment in src/ray/core_worker/task_manager.cc highlighting a specific behavioral change in error handling that could impact local waiters. A follow-up to properly address the shutdown race condition without introducing new regressions would be beneficial.

gemini-code-assist · 2025-08-28T23:52:44Z

src/ray/core_worker/task_manager.cc

  for (int i = 0; i < num_returns; i++) {
    const auto object_id = ObjectID::FromIndex(task_id, /*index=*/i + 1);
-    // Always place an error marker in local memory to unblock waiters quickly.
-    in_memory_store_.Put(error, object_id);
-    // Best-effort plasma put if the object was meant to be in plasma.
    if (store_in_plasma_ids.contains(object_id)) {
-      Status s = put_in_local_plasma_callback_(error, object_id);
-      if (!s.ok()) {
-        RAY_LOG(WARNING).WithField(object_id)
-            << "Failed to put error object in plasma: " << s;
-      }
+      put_in_local_plasma_callback_(error, object_id);
+    } else {
+      in_memory_store_.Put(error, object_id);
    }
  }
  if (spec.ReturnsDynamic()) {
    for (const auto &dynamic_return_id : spec.DynamicReturnIds()) {
-      in_memory_store_.Put(error, dynamic_return_id);
      if (store_in_plasma_ids.contains(dynamic_return_id)) {
-        Status s = put_in_local_plasma_callback_(error, dynamic_return_id);
-        if (!s.ok()) {
-          RAY_LOG(WARNING).WithField(dynamic_return_id)
-              << "Failed to put error object in plasma: " << s;
-        }
+        put_in_local_plasma_callback_(error, dynamic_return_id);
+      } else {
+        in_memory_store_.Put(error, dynamic_return_id);
      }
    }
  }


This revert changes the behavior of how error objects are stored. The previous logic (now reverted) always placed an error marker in the local memory store to unblock waiters quickly, and then made a best-effort attempt to put it in plasma. This version now puts the error in either plasma or the in-memory store.

This means if put_in_local_plasma_callback_ fails (which now results in a crash due to RAY_CHECK_OK), local waiters might not be unblocked. While this revert fixes the immediate regression, we should be mindful that we might be re-introducing a potential issue that the original commit aimed to solve. A follow-up PR to correctly handle plasma store failures without crashing, while ensuring local waiters are unblocked, would be ideal.

edoakes · 2025-09-09T13:37:36Z

Close in favor of #56070

edoakes requested a review from a team as a code owner August 28, 2025 23:50

edoakes added the go add ONLY when ready to merge, run all tests label Aug 28, 2025

Revert "[core] Fix RAY_CHECK failure during shutdown due to plasma st…

7e774f8

…ore race condition (ray-project#55367)" This reverts commit dded833. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes force-pushed the eoakes/revert-err branch from 43034a7 to 7e774f8 Compare August 28, 2025 23:51

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

dayshah approved these changes Aug 29, 2025

View reviewed changes

ray-gardener bot added the core Issues that should be addressed in Ray Core label Aug 29, 2025

edoakes closed this Sep 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert "[core] Fix RAY_CHECK failure during shutdown due to plasma store race condition (#55367)" #56061

Revert "[core] Fix RAY_CHECK failure during shutdown due to plasma store race condition (#55367)" #56061

Uh oh!

edoakes commented Aug 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

edoakes commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Revert "[core] Fix RAY_CHECK failure during shutdown due to plasma store race condition (#55367)" #56061

Revert "[core] Fix RAY_CHECK failure during shutdown due to plasma store race condition (#55367)" #56061

Uh oh!

Conversation

edoakes commented Aug 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants