[Core] Fix the issue where multiple multithreaded calls to ray.get may cause hanging. #54495

Catch-Bull · 2025-07-10T02:56:17Z

Why are these changes needed?

In the current implementation, a worker process corresponds to only one get request ID. When any thread in the worker completes its get operation, it clears the progress of all get requests in the entire worker. This can easily lead to traffic waste in a multi-threaded environment and may even cause the following job models to hang:

Thread A: Calls ray.get on a very large object, which usually takes more than 10 seconds.
Thread B: Periodically calls ray.get on a very small object, which typically completes within 1 second. When this get operation completes, it clears the progress of Thread A’s ray.get request, effectively causing Thread A to hang.

Related issue number

Closes #54007

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

gemini-code-assist

Summary of Changes

Hello @Catch-Bull, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a crucial fix for ray.get operations in multi-threaded Ray applications. By introducing unique request identifiers for each ray.get call and updating the underlying communication and dependency management mechanisms, it ensures that concurrent object fetching requests within a single worker are properly isolated. This enhancement significantly improves the stability and reliability of Ray when dealing with multi-threaded workloads, preventing previously observed hanging issues.

Highlights

Problem Addressed: This pull request fixes a critical issue where ray.get operations in multi-threaded environments could lead to hangs. Previously, a worker's ray.get state was globally managed, causing one ray.get call to inadvertently cancel the progress of another concurrent ray.get operation within the same worker.
Core Solution: Per-Request IDs: The fundamental change introduces a unique request_id for each ray.get operation. This allows the Raylet's dependency manager to track and manage multiple concurrent ray.get requests from a single worker independently, preventing interference.
Communication Protocol Enhancements: The Raylet communication protocol (node_manager.fbs) has been updated. Messages like FetchOrReconstruct, NotifyUnblocked, and NotifyDirectCallTaskUnblocked now include this new request_id, enabling fine-grained control over object fetching and cancellation at the Raylet level.
Dependency Management Refinement: The DependencyManager in the Raylet has been significantly refactored. It now maintains a map of active get_requests per worker, keyed by their request_id, and tracks object dependencies with a reference count per worker, ensuring correct management of concurrent ray.get calls.
New Test Case: A new comprehensive test, test_ray_actor_get_in_multiple_threads, has been added. This test specifically validates the fix in a multi-threaded actor environment, simulating the problematic scenario of a long-running ray.get being interrupted by a short one.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a solution to a ray.get hanging issue in multi-threaded environments by associating each get request with a unique ID. The changes are comprehensive, spanning from the core worker and raylet to the underlying RPC protocol, and are accompanied by a targeted test case that effectively reproduces the original problem. A potential null pointer dereference has been identified and needs to be addressed.

gemini-code-assist · 2025-07-10T02:58:44Z

src/ray/raylet/node_manager.cc

    const TaskID &current_task_id,
-    bool ray_get) {
+    bool ray_get,
+    uint64_t *get_request_id) {


The get_request_id parameter can be a nullptr if AsyncResolveObjects is called with its default value. However, dependency_manager_.StartOrUpdateGetRequest dereferences this pointer without a null check, which will lead to a crash if ray_get is true. While the current call sites seem to avoid this condition (e.g., ray_get is false when get_request_id is null), this is fragile and should be made more robust. Please add a check to ensure get_request_id is not null when ray_get is true.

Suggested change

uint64_t *get_request_id) {

if (ray_get) {

RAY_CHECK(get_request_id) << "get_request_id must be provided for ray.get()";

dependency_manager_.StartOrUpdateGetRequest(

worker->WorkerId(), required_object_refs, get_request_id);

} else {

dependency_manager_.StartOrUpdateWaitRequest(worker->WorkerId(),

required_object_refs);

jjyao · 2025-07-11T15:42:20Z

@dayshah @israbbani could you review this one?

jjyao

Thanks for working on the fix

jjyao · 2025-07-18T03:50:38Z

src/ray/raylet/dependency_manager.cc

    }
-    get_request.second = new_request_id;
+
+    if (*request_id != new_request_id) {


when can they be the same?

Every time a GET request is sent, the request_id passed in will be 0 (since the type of this request id is uint64_t, which cannot be null, so 0 is used to represent null), and the ID of the first pull request will also be 0.

I suddenly realized that there is an issue here. If a worker sends two GET requests simultaneously, the operations on the request with ID 0 would be unsafe. Therefore, it should be required that the pull request ID starts from 1.

My mistake. I thought the request id started from 0.

jjyao · 2025-07-18T03:53:51Z

src/ray/raylet/dependency_manager.h

  /// should be used to cancel the pull request in the object manager once the
  /// worker cancels the `ray.get` request.
-  absl::flat_hash_map<WorkerID, std::pair<absl::flat_hash_set<ObjectID>, uint64_t>>
+  absl::flat_hash_map<WorkerID,


Nice. Needs to update the comment above to reflect the latest

jjyao · 2025-07-18T03:54:46Z

src/ray/raylet/dependency_manager.h

  /// \param required_objects The objects required by the worker.
+  /// \param request_id The request ID.
  /// \return Void.
  void StartOrUpdateGetRequest(const WorkerID &worker_id,


When will we update an existing get request after this PR, shouldn't we also start a new pull with a new id?

In our current implementation of the GET request, a single GET request is divided into k batches and send k FetchRequest to local raylet. Therefore, this update is actually a merge in terms of semantics.

edoakes · 2025-07-23T16:18:37Z

@Catch-Bull can you help me understand the exact semantics of the request_id, both before and after your change?

Previously, this concept was not exposed to the worker at all. We are adding a lot of complexity and possibility for new bugs by adding it. This might be warranted, but I would like to see if there's an alternative solution that avoids this dependency if possible.

src/ray/core_worker/store_provider/plasma_store_provider.cc

Catch-Bull · 2025-07-24T07:27:31Z

@Catch-Bull can you help me understand the exact semantics of the request_id, both before and after your change?

Previously, this concept was not exposed to the worker at all. We are adding a lot of complexity and possibility for new bugs by adding it. This might be warranted, but I would like to see if there's an alternative solution that avoids this dependency if possible.

@edoakes I think my PR doesn't change the semantics of the request ID (I think what we are discussing here is the get request ID). In fact, both the wait request ID and the get request ID are pull request IDs returned by the pull manager. The only difference is whether the pull request is triggered by ray.get or ray.wait.

Before this PR, a core worker would only have one active get request ID. After the PR, each unfinished ray.get will have its own active get request ID. In this way, ray.get calls in different threads will no longer interfere with each other.

The reason why the worker must hold the request ID is due to the current implementation of CoreWorkerPlasmaStoreProvider::Get. It does not synchronously wait for the completion of all get requests. Instead, it divides the requests into several batches, each time waiting for a shorter period of time (batch_timeout). This means it needs to send multiple get requests to try to modify the get request. In order to locate the corresponding get request, it must hold the request ID.

edoakes · 2025-07-25T21:23:15Z

Thanks for the explanation, it's really helpful. Let me do a little bit of digging through the code to build up my context and see if there are any alternative ways to simplify the request ID logic in general.

edoakes · 2025-07-30T00:36:25Z

Haven't forgotten about this, just buried in the queue

edoakes

The PR basically LGTM, stylistic comments

edoakes · 2025-07-30T18:33:55Z

src/ray/core_worker/store_provider/memory_store/memory_store.cc

+    /// No get request needs to be canceled.
+    RAY_RETURN_NOT_OK(raylet_client_->NotifyDirectCallTaskUnblocked(
+        /*get_request_id=*/std::numeric_limits<uint64_t>::max()));


let's give the get_request_id argument a default value that corresponds to not canceling any get request

edoakes · 2025-07-30T18:34:48Z

src/ray/core_worker/store_provider/memory_store/memory_store.cc

  if (should_notify_raylet) {
-    RAY_CHECK_OK(raylet_client_->NotifyDirectCallTaskUnblocked());
+    /// No get request needs to be canceled.
+    RAY_RETURN_NOT_OK(raylet_client_->NotifyDirectCallTaskUnblocked(


IMO we should do some cleanup of naming around these RPC methods. This should be called something like AcquireResourcesAndCancelGetRequest.

I'll take a stab at cleaning it up after this PR

edoakes · 2025-07-30T18:40:10Z

src/ray/raylet/node_manager.cc

+    absl::flat_hash_set<uint64_t> get_request_ids;
+    if (message->get_request_id() < std::numeric_limits<uint64_t>::max()) {
+      get_request_ids.insert(message->get_request_id());
+    }


why is this a set given the size always seems to be 0 or 1?

edoakes · 2025-07-30T18:41:00Z

src/ray/raylet/node_manager.cc

    auto message = flatbuffers::GetRoot<protocol::NotifyUnblocked>(message_data);
-    AsyncResolveObjectsFinish(client, from_flatbuf<TaskID>(*message->task_id()));
+    absl::flat_hash_set<uint64_t> get_request_ids;
+    if (message->get_request_id() < std::numeric_limits<uint64_t>::max()) {


IMO it'd be more natural to use 0 as the indicator for no get request and ensure that 0 is never generated as a get request ID

python/ray/tests/test_threaded_actor.py

edoakes · 2025-07-30T20:10:10Z

python/ray/tests/test_threaded_actor.py

+    ],
+    indirect=True,
+)
+def test_ray_actor_get_in_multiple_threads(ray_start_cluster_head):


this test doesn't belong in test_threaded_actor.py. it's not actually even using a threaded actor at the moment, it's starting its own thread instead

maybe we can introduce a new file called test_api_thread_safety.py so we can centralize similar fixes in the future there

also, I believe the bug is not specific to usage from an actor. this can also happen from a driver. should we test from both?

edoakes · 2025-07-30T22:44:38Z

python/ray/tests/test_threaded_actor.py

+            # put into worker plasma. Ensure that each get returns immediately.
+            # Before this fix, this get would repeatedly clear the pull progress.


I think there is also a more deterministic way to write this test, roughly:

Have 3 nodes (head, worker 1, worker 2).

Run a task on worker 1 to create object 1, run a task on worker 2 to create object 2.

ray.wait(fetch_local=False) for both objects to be created

Remove node 1, which will lose the primary copy of object 1

Call ray.get(object_1) in thread 1

Call ray.get(object_2) in thread 2, let it complete

Create a new version of node 1, which will allow object 1 to be reconstructed. This would hang without your change and should work with it.

edoakes · 2025-07-31T22:21:46Z

@Catch-Bull I am going to work on cleaning up some of the naming around this area, first PR here: #55081

Will cause merge conflicts here, lmk if you want me to wait (depending on how long you think it'll take until you can address review)

Catch-Bull · 2025-08-01T09:30:55Z

@Catch-Bull I am going to work on cleaning up some of the naming around this area, first PR here: #55081

Will cause merge conflicts here, lmk if you want me to wait (depending on how long you think it'll take until you can address review)
@edoakes I'm ooo recently, perhaps you could merge your PR it first. I'll revise it according to your comments when I'm back at work next week, then rebase master.

edoakes · 2025-08-03T18:57:13Z

@edoakes I'm ooo recently, perhaps you could merge your PR it first. I'll revise it according to your comments when I'm back at work next week, then rebase master.

Sounds good. If it gets gnarly I can help fix the conflicts too.

The `NotifyUnblocked` naming is legacy from before Ray 1.0. Also removed `task_id` from various places that we didn't need it. Note there is an ongoing bugfix to cancel only the specific get request instead of all requests for the worker: #54495 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

The `NotifyUnblocked` naming is legacy from before Ray 1.0. Also removed `task_id` from various places that we didn't need it. Note there is an ongoing bugfix to cancel only the specific get request instead of all requests for the worker: #54495 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

) The `NotifyUnblocked` naming is legacy from before Ray 1.0. Also removed `task_id` from various places that we didn't need it. Note there is an ongoing bugfix to cancel only the specific get request instead of all requests for the worker: ray-project#54495 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

The `NotifyUnblocked` naming is legacy from before Ray 1.0. Also removed `task_id` from various places that we didn't need it. Note there is an ongoing bugfix to cancel only the specific get request instead of all requests for the worker: #54495 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

The `NotifyUnblocked` naming is legacy from before Ray 1.0. Also removed `task_id` from various places that we didn't need it. Note there is an ongoing bugfix to cancel only the specific get request instead of all requests for the worker: #54495 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: sampan <sampan@anyscale.com>

github-actions · 2025-08-18T00:42:50Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

edoakes · 2025-08-19T16:12:39Z

@Catch-Bull just checking in, are you still planning to pick this back up?

save save save Signed-off-by: hejialing.hjl <hejialing.hjl@bytedance.com>

Signed-off-by: hejialing.hjl <hejialing.hjl@bytedance.com>

Catch-Bull · 2025-08-22T11:06:45Z

@Catch-Bull just checking in, are you still planning to pick this back up?

@edoakes Sorry, I've been a bit busy lately, so this PR got delayed. I've already rebased it. Could you take another look?

Signed-off-by: hejialing.hjl <hejialing.hjl@bytedance.com>

edoakes · 2025-08-26T20:37:05Z

@israbbani is going to take over reviewing (there are a few other related issues he's working on too)

danielgafni · 2025-09-11T10:34:54Z

nevermind, it was my bad

) The `NotifyUnblocked` naming is legacy from before Ray 1.0. Also removed `task_id` from various places that we didn't need it. Note there is an ongoing bugfix to cancel only the specific get request instead of all requests for the worker: ray-project#54495 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

israbbani · 2025-09-15T23:24:03Z

@Catch-Bull I'll pick this up this week. I'll let you know once I've read through the description and the PR.

israbbani · 2025-09-18T14:39:41Z

@Catch-Bull I'm ready to review this. It looks like there are merge conflicts. Can you merge master into the branch and we can kick off CI?

) The `NotifyUnblocked` naming is legacy from before Ray 1.0. Also removed `task_id` from various places that we didn't need it. Note there is an ongoing bugfix to cancel only the specific get request instead of all requests for the worker: ray-project#54495 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

edoakes · 2025-11-06T16:45:54Z

replaced by: #57911

Catch-Bull force-pushed the fix_ray_get branch from d780027 to 340a6bd Compare July 10, 2025 02:56

gemini-code-assist bot reviewed Jul 10, 2025

View reviewed changes

Catch-Bull self-assigned this Jul 10, 2025

Catch-Bull requested a review from jjyao July 10, 2025 02:57

Catch-Bull added the go add ONLY when ready to merge, run all tests label Jul 10, 2025

gemini-code-assist bot reviewed Jul 10, 2025

View reviewed changes

cszhu added community-contribution Contributed by the community core Issues that should be addressed in Ray Core labels Jul 10, 2025

Catch-Bull changed the title ~~[Core] Fixed the issue of hanging when executing ray.get in a multi-threaded environment.~~ [Core] Fix the issue where multiple multithreaded calls to ray.get may cause hanging. Jul 11, 2025

jjyao assigned israbbani and dayshah Jul 11, 2025

jjyao reviewed Jul 18, 2025

View reviewed changes

Catch-Bull requested a review from a team as a code owner July 18, 2025 09:56

jjyao assigned edoakes and unassigned dayshah Jul 21, 2025

edoakes reviewed Jul 23, 2025

View reviewed changes

src/ray/core_worker/store_provider/plasma_store_provider.cc Outdated Show resolved Hide resolved

edoakes reviewed Jul 30, 2025

View reviewed changes

edoakes mentioned this pull request Aug 4, 2025

[core] Rename NotifyUnblocked to CancelGetRequest #55081

Merged

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 18, 2025

edoakes added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Aug 18, 2025

hejialing.hjl added 3 commits August 22, 2025 17:02

save

040a8cf

save save save Signed-off-by: hejialing.hjl <hejialing.hjl@bytedance.com>

save

b324a8a

Signed-off-by: hejialing.hjl <hejialing.hjl@bytedance.com>

save

eff3bf5

Signed-off-by: hejialing.hjl <hejialing.hjl@bytedance.com>

Catch-Bull force-pushed the fix_ray_get branch from 33a9c81 to 5284b83 Compare August 22, 2025 11:05

save

2c206a0

Signed-off-by: hejialing.hjl <hejialing.hjl@bytedance.com>

Catch-Bull force-pushed the fix_ray_get branch from 5284b83 to 2c206a0 Compare August 25, 2025 06:49

danielgafni mentioned this pull request Sep 11, 2025

[Core] ray.get hangs inside actors with distributed Ray #56452

Closed

edoakes closed this Nov 6, 2025

-    uint64_t *get_request_id) {
+  if (ray_get) {
+    RAY_CHECK(get_request_id) << "get_request_id must be provided for ray.get()";
+    dependency_manager_.StartOrUpdateGetRequest(
+        worker->WorkerId(), required_object_refs, get_request_id);
+  } else {
+    dependency_manager_.StartOrUpdateWaitRequest(worker->WorkerId(),
+                                                 required_object_refs);

		# put into worker plasma. Ensure that each get returns immediately.
		# Before this fix, this get would repeatedly clear the pull progress.

[Core] Fix the issue where multiple multithreaded calls to ray.get may cause hanging. #54495

[Core] Fix the issue where multiple multithreaded calls to ray.get may cause hanging. #54495

Uh oh!

Conversation

Catch-Bull commented Jul 10, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

jjyao commented Jul 11, 2025

Uh oh!

jjyao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edoakes commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Catch-Bull commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edoakes commented Jul 25, 2025

Uh oh!

edoakes commented Jul 30, 2025

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edoakes commented Jul 31, 2025

Uh oh!

Catch-Bull commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edoakes commented Aug 3, 2025

Uh oh!

github-actions bot commented Aug 18, 2025

edoakes commented Jul 23, 2025 •

edited

Loading

Catch-Bull commented Jul 24, 2025 •

edited

Loading

Catch-Bull commented Aug 1, 2025 •

edited

Loading

danielgafni commented Sep 11, 2025 •

edited

Loading