Skip to content

Conversation

@dayshah
Copy link
Contributor

@dayshah dayshah commented Oct 15, 2025

Problem

Today we hold stop_mu while calling into ActorSchedulingQueue::Add. This calls into ActorSchedulingQueue::ScheduleRequests which can potentially cancel and therefore call the cancel callback which also tries to acquire stop_mu in the same call stack.

Cancel inside ActorSchedulingQueue::ScheduleRequests

cancel_callback grabbing stop_mu

absl::MutexLock lock(&stop_mu_);

Grabbing stop_mu in TaskReceiver::HandleTask which eventually leads into ActorSchedulingQueue::ScheduleRequests

absl::MutexLock lock(&stop_mu_);

Solution

The solution here is just to turn stopping_ into an atomic bool. The mutex only exists to protect this.

Extra

  • Renaming test_transient_error_retry to test_push_actor_task_failure and moving it and test_update_object_location_batch_failure to test_core_worker_fault_tolerance
  • out_of_order=False case for the push task failure test

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah requested a review from a team as a code owner October 15, 2025 20:07
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Oct 15, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a deadlock that occurred when canceling stale requests for in-order actors. The approach of replacing the mutex-guarded boolean stopping_ with a std::atomic<bool> is a clean and correct solution to the re-entrant lock problem. The changes in task_receiver.cc and task_receiver.h are well-implemented. Additionally, moving and extending the Python tests to cover the in-order execution case is a valuable addition that ensures the fix is properly verified. Overall, this is a solid improvement to the codebase's stability. I have one minor suggestion for code simplification.

@dayshah dayshah enabled auto-merge (squash) October 15, 2025 21:37
@dayshah dayshah merged commit b64aa65 into ray-project:master Oct 15, 2025
6 checks passed
@dayshah dayshah deleted the fix-deadlock branch October 15, 2025 21:53
dayshah added a commit to dayshah/ray that referenced this pull request Oct 15, 2025
…n-order actors (ray-project#57746)

Signed-off-by: dayshah <dhyey2019@gmail.com>
aslonnie pushed a commit that referenced this pull request Oct 16, 2025
…n-order actors (#57746) (#57768)

## Description
Cherry picking #57746

Signed-off-by: dayshah <dhyey2019@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…ray-project#57746)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
…#57746)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ray-project#57746)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ray-project#57746)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants