-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Make FreeObjects non-fatal #57550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
Thank you for this contribution. The change to make FreeObjects non-fatal and retryable is a great improvement for robustness. The implementation looks solid. I have a few suggestions to improve the retry logic's effectiveness and to enhance test coverage to ensure all paths of the new logic are validated.
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
| if (!s.IsIOError()) { | ||
| RAY_LOG(WARNING) << "Plasma delete failed (non-IOError), not retrying: " << s; | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this fatal and only keep IOError non-fatal and retryable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are other statuses possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially from PlasmaClient::Delete code path, Status::Invalid(...) if delete request construction fails.
| absl::MutexLock lock(&pool_mutex_); | ||
| s = store_client_->Delete(object_ids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock was added in 3df1e1c but is it really required here? Per my understanding, FreeObjects doesn't read/modify ObjectBufferPool state (it only calls store_client_->Delete), and the plasma client is internally synchronized with its own mutex.
If we keep the lock, I would prefer to NOT lock across the entire retry loop. It'll serialize pool operations during the sleeps. Reacquiring pool_mutex_ only around each Delete keeps other pool operations (create/abort) unblocked during backoff. Wdyt @edoakes @dayshah ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya looks right to me, it doesn't seem like this lock needs to be here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i'll remove that in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
| attempt++; | ||
| } | ||
| RAY_LOG(WARNING) << "Plasma delete failed after retries (non-fatal)."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the original behavior was fatal, and I get that is the intention of this PR (make it non-fatal and a warning). Do we know why it was important enough to be fatal previously? Are there any correctness implications to truly being unable delete from plasma? Or was the original RAY_CHECK just muscle memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's mostly the latter. Earlier patterns used RAY_CHECK_OK around plasma ops broadly. The fatal check likely persisted as a generic "must succeed" guard rather than a correctness requirement for delete itself.
Delete is best-effort and idempotent. If it fails, objects just linger in the plasma store. Potential memory pressure and delayed reclamation, but no data corruption or API-level inconsistency. There are additional retries at higher layers:
- Local:
LocalObjectManager::FlushFreeObjectsruns periodically and will reattempt frees. - Remote:
ObjectManager::RetryFreeObjectshandles broadcast failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a higher level question here. When is an IOError on an IPC possible? afaik, it should only happen on shutdown and this seems more like something where plasma could shut down before we stop trying to send the free ipc. If that's the case maybe this free should just give up on an IOError, no need for the retries or anything
| absl::MutexLock lock(&pool_mutex_); | ||
| s = store_client_->Delete(object_ids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya looks right to me, it doesn't seem like this lock needs to be here.
| if (!s.IsIOError()) { | ||
| RAY_LOG(WARNING) << "Plasma delete failed (non-IOError), not retrying: " << s; | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are other statuses possible?
What specifically causes this |
@dayshah @edoakes My understanding is IOError can happen during transport backpressure (aggressive shutdown could be one reason that could trigger it). When I checked the logs, the raylet event showed fatal error: That led me to believe it's transient socket buffer exhaustion. We also see lot of node churn: and a lot of FreeObjects rpc timeouts around the node churn time: Delete is idempotent and I think retry with short backoff allows buffers to drain so it would be helpful in this case. |
|
@codope so pretty sure the actual reason for this is one level higher. We send way too many free objects rpc's everywhere which then makes the free objects ipc. Per object freed we'll send an rpc to every node in the cluster telling it to call free objects. So because of the excessive # of requests the ipc socket buffer is probably going to fill up quite fast. @Sparks0219 is going to fix this. Right now for this imo we should just make it non-fatal and log an error, the retry logic isn't that necessary since we'll evict secondary copies under memory pressure anyways. |
Sounds good, i'll remove the retry logic then. |
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
| RAY_CHECK_OK(store_client_->Delete(object_ids)); | ||
| Status s = store_client_->Delete(object_ids); | ||
| if (!s.ok()) { | ||
| RAY_LOG(ERROR) << "Failed to delete objects from plasma store: " << s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of posterity, let's extend the warning message and/or add a comment indicating why it's ok that we ignore this error and proceed (because secondary copies will be evicted as needed).
This will save someone from seeing this, thinking it's a mistake, and spending an hour reverse engineering the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's ok to ignore this error, then it should be WARNING?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Else this will also log to the driver terminal.
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Make FreeObjects non-fatal. Sometimes the RAY_CHECK fails due to
IOError: No buffer space available, which is transient in most cases.