[core] Make FreeObjects non-fatal #57550

codope · 2025-10-08T07:23:16Z

Make FreeObjects non-fatal. Sometimes the RAY_CHECK fails due toIOError: No buffer space available, which is transient in most cases.

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

gemini-code-assist

Code Review

Thank you for this contribution. The change to make FreeObjects non-fatal and retryable is a great improvement for robustness. The implementation looks solid. I have a few suggestions to improve the retry logic's effectiveness and to enhance test coverage to ensure all paths of the new logic are validated.

src/ray/object_manager/object_buffer_pool.cc

src/ray/object_manager/tests/object_manager_test.cc

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

codope · 2025-10-08T08:44:16Z

src/ray/object_manager/object_buffer_pool.cc

+    if (!s.IsIOError()) {
+      RAY_LOG(WARNING) << "Plasma delete failed (non-IOError), not retrying: " << s;
+      return;
+    }


Should we make this fatal and only keep IOError non-fatal and retryable?

are other statuses possible?

potentially from PlasmaClient::Delete code path, Status::Invalid(...) if delete request construction fails.

codope · 2025-10-08T08:46:42Z

src/ray/object_manager/object_buffer_pool.cc

+      absl::MutexLock lock(&pool_mutex_);
+      s = store_client_->Delete(object_ids);


The lock was added in 3df1e1c but is it really required here? Per my understanding, FreeObjects doesn't read/modify ObjectBufferPool state (it only calls store_client_->Delete), and the plasma client is internally synchronized with its own mutex.

If we keep the lock, I would prefer to NOT lock across the entire retry loop. It'll serialize pool operations during the sleeps. Reacquiring pool_mutex_ only around each Delete keeps other pool operations (create/abort) unblocked during backoff. Wdyt @edoakes @dayshah ?

Ya looks right to me, it doesn't seem like this lock needs to be here.

ok, i'll remove that in a separate PR

ZacAttack · 2025-10-08T17:26:53Z

src/ray/object_manager/object_buffer_pool.cc

+    }
+    attempt++;
+  }
+  RAY_LOG(WARNING) << "Plasma delete failed after retries (non-fatal).";


So the original behavior was fatal, and I get that is the intention of this PR (make it non-fatal and a warning). Do we know why it was important enough to be fatal previously? Are there any correctness implications to truly being unable delete from plasma? Or was the original RAY_CHECK just muscle memory?

I think it's mostly the latter. Earlier patterns used RAY_CHECK_OK around plasma ops broadly. The fatal check likely persisted as a generic "must succeed" guard rather than a correctness requirement for delete itself.

Delete is best-effort and idempotent. If it fails, objects just linger in the plasma store. Potential memory pressure and delayed reclamation, but no data corruption or API-level inconsistency. There are additional retries at higher layers:

Local: LocalObjectManager::FlushFreeObjects runs periodically and will reattempt frees.

Remote: ObjectManager::RetryFreeObjects handles broadcast failures.

dayshah

I think there's a higher level question here. When is an IOError on an IPC possible? afaik, it should only happen on shutdown and this seems more like something where plasma could shut down before we stop trying to send the free ipc. If that's the case maybe this free should just give up on an IOError, no need for the retries or anything

dayshah · 2025-10-10T05:32:01Z

src/ray/object_manager/object_buffer_pool.cc

+      absl::MutexLock lock(&pool_mutex_);
+      s = store_client_->Delete(object_ids);


Ya looks right to me, it doesn't seem like this lock needs to be here.

dayshah · 2025-10-10T05:33:43Z

src/ray/object_manager/object_buffer_pool.cc

+    if (!s.IsIOError()) {
+      RAY_LOG(WARNING) << "Plasma delete failed (non-IOError), not retrying: " << s;
+      return;
+    }


are other statuses possible?

edoakes · 2025-10-10T16:19:12Z

Sometimes the RAY_CHECK fails due toIOError: No buffer space available. This is transient in most cases.

What specifically causes this IOError to happen and why do we think it's transient in most cases?

…ferPool

codope · 2025-10-16T10:30:45Z

When is an IOError on an IPC possible? Should we just give up instead of retry?

What specifically causes this IOError to happen and why do we think it's transient in most cases?

@dayshah @edoakes My understanding is IOError can happen during transport backpressure (aggressive shutdown could be one reason that could trigger it). When I checked the logs, the raylet event showed fatal error: errno: 55 (No buffer space available) - this is ENOBUFS. State dumps shows many outstanding remote FreeObjects with high latency:

[state-dump]   ObjectManagerService.grpc_client.FreeObjects - 66881 total (0 active), Execution time: mean = 34.706 ms, total = 2321.194 s

That led me to believe it's transient socket buffer exhaustion. We also see lot of node churn:

[2025-09-17 21:40:46,219] Node failure. node_id=7aadeaab...
[2025-09-17 21:40:46,220] Node failure. node_id=0bf3ef14...

and a lot of FreeObjects rpc timeouts around the node churn time:

[2025-09-17 21:40:45,514] Send free objects request failed due to RPC Error message: recvmsg:Operation timed out

Delete is idempotent and I think retry with short backoff allows buffers to drain so it would be helpful in this case.

dayshah · 2025-10-16T18:13:41Z

@codope so pretty sure the actual reason for this is one level higher. We send way too many free objects rpc's everywhere which then makes the free objects ipc. Per object freed we'll send an rpc to every node in the cluster telling it to call free objects. So because of the excessive # of requests the ipc socket buffer is probably going to fill up quite fast. @Sparks0219 is going to fix this.

Right now for this imo we should just make it non-fatal and log an error, the retry logic isn't that necessary since we'll evict secondary copies under memory pressure anyways.

codope · 2025-10-17T01:57:31Z

Right now for this imo we should just make it non-fatal and log an error, the retry logic isn't that necessary since we'll evict secondary copies under memory pressure anyways.

Sounds good, i'll remove the retry logic then.

…ferPool

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

src/ray/object_manager/object_buffer_pool.cc

src/ray/object_manager/tests/object_manager_test.cc

Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

edoakes · 2025-10-17T14:22:59Z

src/ray/object_manager/object_buffer_pool.cc

-  RAY_CHECK_OK(store_client_->Delete(object_ids));
+  Status s = store_client_->Delete(object_ids);
+  if (!s.ok()) {
+    RAY_LOG(ERROR) << "Failed to delete objects from plasma store: " << s;


For the sake of posterity, let's extend the warning message and/or add a comment indicating why it's ok that we ignore this error and proceed (because secondary copies will be evicted as needed).

This will save someone from seeing this, thinking it's a mistake, and spending an hour reverse engineering the behavior.

if it's ok to ignore this error, then it should be WARNING?

Good point. Else this will also log to the driver terminal.

src/ray/object_manager/object_buffer_pool.cc

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

[core] Make FreeObjects non-fatal and retryable

a62de6f

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

codope requested a review from a team as a code owner October 8, 2025 07:23

This comment was marked as outdated.

Sign in to view

gemini-code-assist bot reviewed Oct 8, 2025

View reviewed changes

src/ray/object_manager/object_buffer_pool.cc Outdated Show resolved Hide resolved

src/ray/object_manager/tests/object_manager_test.cc Outdated Show resolved Hide resolved

src/ray/object_manager/tests/object_manager_test.cc Outdated Show resolved Hide resolved

improve test and increase backoff duration

9fff4c1

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

codope commented Oct 8, 2025

View reviewed changes

codope requested review from dayshah and edoakes October 8, 2025 08:47

ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 8, 2025

ZacAttack reviewed Oct 8, 2025

View reviewed changes

codope added the go add ONLY when ready to merge, run all tests label Oct 10, 2025

codope assigned dayshah Oct 10, 2025

dayshah reviewed Oct 10, 2025

View reviewed changes

Merge remote-tracking branch 'origin/master' into core-2270-ObjectBuf…

bdd20fd

…ferPool

Merge remote-tracking branch 'origin/master' into core-2270-ObjectBuf…

1ef1e24

…ferPool

codope changed the title ~~[core] Make FreeObjects non-fatal and retryable~~ [core] Make FreeObjects non-fatal Oct 17, 2025

remove the retry and just log err

f804f91

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

codope mentioned this pull request Oct 17, 2025

[core] Make ObjectBufferPool::FreeObjects lock free #57833

Closed

dayshah approved these changes Oct 17, 2025

View reviewed changes

src/ray/object_manager/object_buffer_pool.cc Outdated Show resolved Hide resolved

src/ray/object_manager/tests/object_manager_test.cc Outdated Show resolved Hide resolved

codope and others added 2 commits October 17, 2025 13:23

Update log ObjectBufferPool::FreeObjects

508e35e

Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

remove new test; not much value

d3e0e70

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

edoakes reviewed Oct 17, 2025

View reviewed changes

jjyao reviewed Oct 21, 2025

View reviewed changes

src/ray/object_manager/object_buffer_pool.cc Outdated Show resolved Hide resolved

Update src/ray/object_manager/object_buffer_pool.cc

9aacf3e

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao enabled auto-merge (squash) October 21, 2025 18:04

jjyao merged commit 53908c8 into ray-project:master Oct 21, 2025
6 checks passed

		absl::MutexLock lock(&pool_mutex_);
		s = store_client_->Delete(object_ids);

[core] Make FreeObjects non-fatal #57550

[core] Make FreeObjects non-fatal #57550

Uh oh!

Conversation

codope commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codope Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dayshah left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edoakes commented Oct 10, 2025

Uh oh!

codope commented Oct 16, 2025

Uh oh!

dayshah commented Oct 16, 2025

Uh oh!

codope commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codope commented Oct 8, 2025 •

edited

Loading

codope Oct 9, 2025 •

edited

Loading

dayshah left a comment •

edited

Loading

codope commented Oct 17, 2025 •

edited

Loading