Skip to content

Conversation

@codope
Copy link
Contributor

@codope codope commented Oct 8, 2025

Make FreeObjects non-fatal. Sometimes the RAY_CHECK fails due toIOError: No buffer space available, which is transient in most cases.

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
@codope codope requested a review from a team as a code owner October 8, 2025 07:23
cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Thank you for this contribution. The change to make FreeObjects non-fatal and retryable is a great improvement for robustness. The implementation looks solid. I have a few suggestions to improve the retry logic's effectiveness and to enhance test coverage to ensure all paths of the new logic are validated.

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Comment on lines 342 to 345
if (!s.IsIOError()) {
RAY_LOG(WARNING) << "Plasma delete failed (non-IOError), not retrying: " << s;
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this fatal and only keep IOError non-fatal and retryable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are other statuses possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially from PlasmaClient::Delete code path, Status::Invalid(...) if delete request construction fails.

Comment on lines 336 to 337
absl::MutexLock lock(&pool_mutex_);
s = store_client_->Delete(object_ids);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock was added in 3df1e1c but is it really required here? Per my understanding, FreeObjects doesn't read/modify ObjectBufferPool state (it only calls store_client_->Delete), and the plasma client is internally synchronized with its own mutex.

If we keep the lock, I would prefer to NOT lock across the entire retry loop. It'll serialize pool operations during the sleeps. Reacquiring pool_mutex_ only around each Delete keeps other pool operations (create/abort) unblocked during backoff. Wdyt @edoakes @dayshah ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya looks right to me, it doesn't seem like this lock needs to be here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i'll remove that in a separate PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codope codope requested review from dayshah and edoakes October 8, 2025 08:47
@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 8, 2025
}
attempt++;
}
RAY_LOG(WARNING) << "Plasma delete failed after retries (non-fatal).";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the original behavior was fatal, and I get that is the intention of this PR (make it non-fatal and a warning). Do we know why it was important enough to be fatal previously? Are there any correctness implications to truly being unable delete from plasma? Or was the original RAY_CHECK just muscle memory?

Copy link
Contributor Author

@codope codope Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's mostly the latter. Earlier patterns used RAY_CHECK_OK around plasma ops broadly. The fatal check likely persisted as a generic "must succeed" guard rather than a correctness requirement for delete itself.

Delete is best-effort and idempotent. If it fails, objects just linger in the plasma store. Potential memory pressure and delayed reclamation, but no data corruption or API-level inconsistency. There are additional retries at higher layers:

  • Local: LocalObjectManager::FlushFreeObjects runs periodically and will reattempt frees.
  • Remote: ObjectManager::RetryFreeObjects handles broadcast failures.

@codope codope added the go add ONLY when ready to merge, run all tests label Oct 10, 2025
Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a higher level question here. When is an IOError on an IPC possible? afaik, it should only happen on shutdown and this seems more like something where plasma could shut down before we stop trying to send the free ipc. If that's the case maybe this free should just give up on an IOError, no need for the retries or anything

Comment on lines 336 to 337
absl::MutexLock lock(&pool_mutex_);
s = store_client_->Delete(object_ids);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya looks right to me, it doesn't seem like this lock needs to be here.

Comment on lines 342 to 345
if (!s.IsIOError()) {
RAY_LOG(WARNING) << "Plasma delete failed (non-IOError), not retrying: " << s;
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are other statuses possible?

@edoakes
Copy link
Collaborator

edoakes commented Oct 10, 2025

Sometimes the RAY_CHECK fails due toIOError: No buffer space available. This is transient in most cases.

What specifically causes this IOError to happen and why do we think it's transient in most cases?

@codope
Copy link
Contributor Author

codope commented Oct 16, 2025

When is an IOError on an IPC possible? Should we just give up instead of retry?

What specifically causes this IOError to happen and why do we think it's transient in most cases?

@dayshah @edoakes My understanding is IOError can happen during transport backpressure (aggressive shutdown could be one reason that could trigger it). When I checked the logs, the raylet event showed fatal error: errno: 55 (No buffer space available) - this is ENOBUFS. State dumps shows many outstanding remote FreeObjects with high latency:

[state-dump]   ObjectManagerService.grpc_client.FreeObjects - 66881 total (0 active), Execution time: mean = 34.706 ms, total = 2321.194 s

That led me to believe it's transient socket buffer exhaustion. We also see lot of node churn:

[2025-09-17 21:40:46,219] Node failure. node_id=7aadeaab...
[2025-09-17 21:40:46,220] Node failure. node_id=0bf3ef14...

and a lot of FreeObjects rpc timeouts around the node churn time:

[2025-09-17 21:40:45,514] Send free objects request failed due to RPC Error message: recvmsg:Operation timed out

Delete is idempotent and I think retry with short backoff allows buffers to drain so it would be helpful in this case.

@dayshah
Copy link
Contributor

dayshah commented Oct 16, 2025

@codope so pretty sure the actual reason for this is one level higher. We send way too many free objects rpc's everywhere which then makes the free objects ipc. Per object freed we'll send an rpc to every node in the cluster telling it to call free objects. So because of the excessive # of requests the ipc socket buffer is probably going to fill up quite fast. @Sparks0219 is going to fix this.

Right now for this imo we should just make it non-fatal and log an error, the retry logic isn't that necessary since we'll evict secondary copies under memory pressure anyways.

@codope
Copy link
Contributor Author

codope commented Oct 17, 2025

Right now for this imo we should just make it non-fatal and log an error, the retry logic isn't that necessary since we'll evict secondary copies under memory pressure anyways.

Sounds good, i'll remove the retry logic then.

@codope codope changed the title [core] Make FreeObjects non-fatal and retryable [core] Make FreeObjects non-fatal Oct 17, 2025
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
codope and others added 2 commits October 17, 2025 13:23
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
RAY_CHECK_OK(store_client_->Delete(object_ids));
Status s = store_client_->Delete(object_ids);
if (!s.ok()) {
RAY_LOG(ERROR) << "Failed to delete objects from plasma store: " << s;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of posterity, let's extend the warning message and/or add a comment indicating why it's ok that we ignore this error and proceed (because secondary copies will be evicted as needed).

This will save someone from seeing this, thinking it's a mistake, and spending an hour reverse engineering the behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's ok to ignore this error, then it should be WARNING?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Else this will also log to the driver terminal.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao enabled auto-merge (squash) October 21, 2025 18:04
@jjyao jjyao merged commit 53908c8 into ray-project:master Oct 21, 2025
6 checks passed
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants