-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[core] Fix deadlock when cancelling stale requests on in-order actors #57746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a deadlock that occurred when canceling stale requests for in-order actors. The approach of replacing the mutex-guarded boolean stopping_ with a std::atomic<bool> is a clean and correct solution to the re-entrant lock problem. The changes in task_receiver.cc and task_receiver.h are well-implemented. Additionally, moving and extending the Python tests to cover the in-order execution case is a valuable addition that ensures the fix is properly verified. Overall, this is a solid improvement to the codebase's stability. I have one minor suggestion for code simplification.
…n-order actors (ray-project#57746) Signed-off-by: dayshah <dhyey2019@gmail.com>
…ray-project#57746) Signed-off-by: dayshah <dhyey2019@gmail.com>
…ray-project#57746) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: xgui <xgui@anyscale.com>
…#57746) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…ray-project#57746) Signed-off-by: dayshah <dhyey2019@gmail.com>
…ray-project#57746) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…ray-project#57746) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Problem
Today we hold
stop_muwhile calling intoActorSchedulingQueue::Add. This calls intoActorSchedulingQueue::ScheduleRequestswhich can potentially cancel and therefore call the cancel callback which also tries to acquirestop_muin the same call stack.Cancel inside
ActorSchedulingQueue::ScheduleRequestsray/src/ray/core_worker/task_execution/actor_scheduling_queue.cc
Line 174 in c6b8c9f
cancel_callback grabbing
stop_muray/src/ray/core_worker/task_execution/task_receiver.cc
Line 176 in c6b8c9f
Grabbing
stop_muinTaskReceiver::HandleTaskwhich eventually leads intoActorSchedulingQueue::ScheduleRequestsray/src/ray/core_worker/task_execution/task_receiver.cc
Line 195 in c6b8c9f
Solution
The solution here is just to turn
stopping_into an atomic bool. The mutex only exists to protect this.Extra
test_transient_error_retrytotest_push_actor_task_failureand moving it andtest_update_object_location_batch_failureto test_core_worker_fault_tolerance