Skip to content

Conversation

@Catch-Bull
Copy link
Contributor

Why are these changes needed?

In the current implementation, a worker process corresponds to only one get request ID. When any thread in the worker completes its get operation, it clears the progress of all get requests in the entire worker. This can easily lead to traffic waste in a multi-threaded environment and may even cause the following job models to hang:

  • Thread A: Calls ray.get on a very large object, which usually takes more than 10 seconds.
  • Thread B: Periodically calls ray.get on a very small object, which typically completes within 1 second. When this get operation completes, it clears the progress of Thread A’s ray.get request, effectively causing Thread A to hang.

Related issue number

Closes #54007

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Catch-Bull, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a crucial fix for ray.get operations in multi-threaded Ray applications. By introducing unique request identifiers for each ray.get call and updating the underlying communication and dependency management mechanisms, it ensures that concurrent object fetching requests within a single worker are properly isolated. This enhancement significantly improves the stability and reliability of Ray when dealing with multi-threaded workloads, preventing previously observed hanging issues.

Highlights

  • Problem Addressed: This pull request fixes a critical issue where ray.get operations in multi-threaded environments could lead to hangs. Previously, a worker's ray.get state was globally managed, causing one ray.get call to inadvertently cancel the progress of another concurrent ray.get operation within the same worker.
  • Core Solution: Per-Request IDs: The fundamental change introduces a unique request_id for each ray.get operation. This allows the Raylet's dependency manager to track and manage multiple concurrent ray.get requests from a single worker independently, preventing interference.
  • Communication Protocol Enhancements: The Raylet communication protocol (node_manager.fbs) has been updated. Messages like FetchOrReconstruct, NotifyUnblocked, and NotifyDirectCallTaskUnblocked now include this new request_id, enabling fine-grained control over object fetching and cancellation at the Raylet level.
  • Dependency Management Refinement: The DependencyManager in the Raylet has been significantly refactored. It now maintains a map of active get_requests per worker, keyed by their request_id, and tracks object dependencies with a reference count per worker, ensuring correct management of concurrent ray.get calls.
  • New Test Case: A new comprehensive test, test_ray_actor_get_in_multiple_threads, has been added. This test specifically validates the fix in a multi-threaded actor environment, simulating the problematic scenario of a long-running ray.get being interrupted by a short one.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Catch-Bull Catch-Bull self-assigned this Jul 10, 2025
@Catch-Bull Catch-Bull requested a review from jjyao July 10, 2025 02:57
@Catch-Bull Catch-Bull added the go add ONLY when ready to merge, run all tests label Jul 10, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a solution to a ray.get hanging issue in multi-threaded environments by associating each get request with a unique ID. The changes are comprehensive, spanning from the core worker and raylet to the underlying RPC protocol, and are accompanied by a targeted test case that effectively reproduces the original problem. A potential null pointer dereference has been identified and needs to be addressed.

const TaskID &current_task_id,
bool ray_get) {
bool ray_get,
uint64_t *get_request_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The get_request_id parameter can be a nullptr if AsyncResolveObjects is called with its default value. However, dependency_manager_.StartOrUpdateGetRequest dereferences this pointer without a null check, which will lead to a crash if ray_get is true. While the current call sites seem to avoid this condition (e.g., ray_get is false when get_request_id is null), this is fragile and should be made more robust. Please add a check to ensure get_request_id is not null when ray_get is true.

Suggested change
uint64_t *get_request_id) {
if (ray_get) {
RAY_CHECK(get_request_id) << "get_request_id must be provided for ray.get()";
dependency_manager_.StartOrUpdateGetRequest(
worker->WorkerId(), required_object_refs, get_request_id);
} else {
dependency_manager_.StartOrUpdateWaitRequest(worker->WorkerId(),
required_object_refs);

@cszhu cszhu added community-contribution Contributed by the community core Issues that should be addressed in Ray Core labels Jul 10, 2025
@Catch-Bull Catch-Bull changed the title [Core] Fixed the issue of hanging when executing ray.get in a multi-threaded environment. [Core] Fix the issue where multiple multithreaded calls to ray.get may cause hanging. Jul 11, 2025
@jjyao
Copy link
Collaborator

jjyao commented Jul 11, 2025

@dayshah @israbbani could you review this one?

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on the fix

}
get_request.second = new_request_id;

if (*request_id != new_request_id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when can they be the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time a GET request is sent, the request_id passed in will be 0 (since the type of this request id is uint64_t, which cannot be null, so 0 is used to represent null), and the ID of the first pull request will also be 0.

I suddenly realized that there is an issue here. If a worker sends two GET requests simultaneously, the operations on the request with ID 0 would be unsafe. Therefore, it should be required that the pull request ID starts from 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake. I thought the request id started from 0.

/// should be used to cancel the pull request in the object manager once the
/// worker cancels the `ray.get` request.
absl::flat_hash_map<WorkerID, std::pair<absl::flat_hash_set<ObjectID>, uint64_t>>
absl::flat_hash_map<WorkerID,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Needs to update the comment above to reflect the latest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

/// \param required_objects The objects required by the worker.
/// \param request_id The request ID.
/// \return Void.
void StartOrUpdateGetRequest(const WorkerID &worker_id,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will we update an existing get request after this PR, shouldn't we also start a new pull with a new id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our current implementation of the GET request, a single GET request is divided into k batches and send k FetchRequest to local raylet. Therefore, this update is actually a merge in terms of semantics.

@Catch-Bull Catch-Bull requested a review from a team as a code owner July 18, 2025 09:56
@jjyao jjyao assigned edoakes and unassigned dayshah Jul 21, 2025
@edoakes
Copy link
Collaborator

edoakes commented Jul 23, 2025

@Catch-Bull can you help me understand the exact semantics of the request_id, both before and after your change?

Previously, this concept was not exposed to the worker at all. We are adding a lot of complexity and possibility for new bugs by adding it. This might be warranted, but I would like to see if there's an alternative solution that avoids this dependency if possible.

@Catch-Bull
Copy link
Contributor Author

Catch-Bull commented Jul 24, 2025

@Catch-Bull can you help me understand the exact semantics of the request_id, both before and after your change?

Previously, this concept was not exposed to the worker at all. We are adding a lot of complexity and possibility for new bugs by adding it. This might be warranted, but I would like to see if there's an alternative solution that avoids this dependency if possible.

@edoakes I think my PR doesn't change the semantics of the request ID (I think what we are discussing here is the get request ID). In fact, both the wait request ID and the get request ID are pull request IDs returned by the pull manager. The only difference is whether the pull request is triggered by ray.get or ray.wait.

Before this PR, a core worker would only have one active get request ID. After the PR, each unfinished ray.get will have its own active get request ID. In this way, ray.get calls in different threads will no longer interfere with each other.

The reason why the worker must hold the request ID is due to the current implementation of CoreWorkerPlasmaStoreProvider::Get. It does not synchronously wait for the completion of all get requests. Instead, it divides the requests into several batches, each time waiting for a shorter period of time (batch_timeout). This means it needs to send multiple get requests to try to modify the get request. In order to locate the corresponding get request, it must hold the request ID.

@edoakes
Copy link
Collaborator

edoakes commented Jul 25, 2025

Thanks for the explanation, it's really helpful. Let me do a little bit of digging through the code to build up my context and see if there are any alternative ways to simplify the request ID logic in general.

@edoakes
Copy link
Collaborator

edoakes commented Jul 30, 2025

Haven't forgotten about this, just buried in the queue

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR basically LGTM, stylistic comments

Comment on lines 388 to 384
/// No get request needs to be canceled.
RAY_RETURN_NOT_OK(raylet_client_->NotifyDirectCallTaskUnblocked(
/*get_request_id=*/std::numeric_limits<uint64_t>::max()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's give the get_request_id argument a default value that corresponds to not canceling any get request

if (should_notify_raylet) {
RAY_CHECK_OK(raylet_client_->NotifyDirectCallTaskUnblocked());
/// No get request needs to be canceled.
RAY_RETURN_NOT_OK(raylet_client_->NotifyDirectCallTaskUnblocked(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should do some cleanup of naming around these RPC methods. This should be called something like AcquireResourcesAndCancelGetRequest.

I'll take a stab at cleaning it up after this PR

Comment on lines +1049 to +1053
absl::flat_hash_set<uint64_t> get_request_ids;
if (message->get_request_id() < std::numeric_limits<uint64_t>::max()) {
get_request_ids.insert(message->get_request_id());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a set given the size always seems to be 0 or 1?

auto message = flatbuffers::GetRoot<protocol::NotifyUnblocked>(message_data);
AsyncResolveObjectsFinish(client, from_flatbuf<TaskID>(*message->task_id()));
absl::flat_hash_set<uint64_t> get_request_ids;
if (message->get_request_id() < std::numeric_limits<uint64_t>::max()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it'd be more natural to use 0 as the indicator for no get request and ensure that 0 is never generated as a get request ID

],
indirect=True,
)
def test_ray_actor_get_in_multiple_threads(ray_start_cluster_head):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test doesn't belong in test_threaded_actor.py. it's not actually even using a threaded actor at the moment, it's starting its own thread instead

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can introduce a new file called test_api_thread_safety.py so we can centralize similar fixes in the future there

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I believe the bug is not specific to usage from an actor. this can also happen from a driver. should we test from both?

Comment on lines 322 to 323
# put into worker plasma. Ensure that each get returns immediately.
# Before this fix, this get would repeatedly clear the pull progress.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is also a more deterministic way to write this test, roughly:

  • Have 3 nodes (head, worker 1, worker 2).
  • Run a task on worker 1 to create object 1, run a task on worker 2 to create object 2.
  • ray.wait(fetch_local=False) for both objects to be created
  • Remove node 1, which will lose the primary copy of object 1
  • Call ray.get(object_1) in thread 1
  • Call ray.get(object_2) in thread 2, let it complete
  • Create a new version of node 1, which will allow object 1 to be reconstructed. This would hang without your change and should work with it.

@edoakes
Copy link
Collaborator

edoakes commented Jul 31, 2025

@Catch-Bull I am going to work on cleaning up some of the naming around this area, first PR here: #55081

Will cause merge conflicts here, lmk if you want me to wait (depending on how long you think it'll take until you can address review)

@Catch-Bull
Copy link
Contributor Author

Catch-Bull commented Aug 1, 2025

@Catch-Bull I am going to work on cleaning up some of the naming around this area, first PR here: #55081

Will cause merge conflicts here, lmk if you want me to wait (depending on how long you think it'll take until you can address review)
@edoakes I'm ooo recently, perhaps you could merge your PR it first. I'll revise it according to your comments when I'm back at work next week, then rebase master.

@edoakes
Copy link
Collaborator

edoakes commented Aug 3, 2025

@edoakes I'm ooo recently, perhaps you could merge your PR it first. I'll revise it according to your comments when I'm back at work next week, then rebase master.

Sounds good. If it gets gnarly I can help fix the conflicts too.

edoakes added a commit that referenced this pull request Aug 4, 2025
The `NotifyUnblocked` naming is legacy from before Ray 1.0.

Also removed `task_id` from various places that we didn't need it.

Note there is an ongoing bugfix to cancel only the specific get request
instead of all requests for the worker:
#54495

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
kamil-kaczmarek pushed a commit that referenced this pull request Aug 4, 2025
The `NotifyUnblocked` naming is legacy from before Ray 1.0.

Also removed `task_id` from various places that we didn't need it.

Note there is an ongoing bugfix to cancel only the specific get request
instead of all requests for the worker:
#54495

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
mjacar pushed a commit to mjacar/ray that referenced this pull request Aug 5, 2025
)

The `NotifyUnblocked` naming is legacy from before Ray 1.0.

Also removed `task_id` from various places that we didn't need it.

Note there is an ongoing bugfix to cancel only the specific get request
instead of all requests for the worker:
ray-project#54495

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
elliot-barn pushed a commit that referenced this pull request Aug 5, 2025
The `NotifyUnblocked` naming is legacy from before Ray 1.0.

Also removed `task_id` from various places that we didn't need it.

Note there is an ongoing bugfix to cancel only the specific get request
instead of all requests for the worker:
#54495

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
sampan-s-nayak pushed a commit that referenced this pull request Aug 12, 2025
The `NotifyUnblocked` naming is legacy from before Ray 1.0.

Also removed `task_id` from various places that we didn't need it.

Note there is an ongoing bugfix to cancel only the specific get request
instead of all requests for the worker:
#54495

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: sampan <sampan@anyscale.com>
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 18, 2025
@edoakes edoakes added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Aug 18, 2025
@edoakes
Copy link
Collaborator

edoakes commented Aug 19, 2025

@Catch-Bull just checking in, are you still planning to pick this back up?

hejialing.hjl added 3 commits August 22, 2025 17:02
save

save

save

Signed-off-by: hejialing.hjl <hejialing.hjl@bytedance.com>
Signed-off-by: hejialing.hjl <hejialing.hjl@bytedance.com>
Signed-off-by: hejialing.hjl <hejialing.hjl@bytedance.com>
@Catch-Bull
Copy link
Contributor Author

@Catch-Bull just checking in, are you still planning to pick this back up?

@edoakes Sorry, I've been a bit busy lately, so this PR got delayed. I've already rebased it. Could you take another look?

Signed-off-by: hejialing.hjl <hejialing.hjl@bytedance.com>
@edoakes
Copy link
Collaborator

edoakes commented Aug 26, 2025

@israbbani is going to take over reviewing (there are a few other related issues he's working on too)

@danielgafni
Copy link

danielgafni commented Sep 11, 2025

nevermind, it was my bad

jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
)

The `NotifyUnblocked` naming is legacy from before Ray 1.0.

Also removed `task_id` from various places that we didn't need it.

Note there is an ongoing bugfix to cancel only the specific get request
instead of all requests for the worker:
ray-project#54495

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
@israbbani
Copy link
Contributor

@Catch-Bull I'll pick this up this week. I'll let you know once I've read through the description and the PR.

@israbbani
Copy link
Contributor

@Catch-Bull I'm ready to review this. It looks like there are merge conflicts. Can you merge master into the branch and we can kick off CI?

dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
)

The `NotifyUnblocked` naming is legacy from before Ray 1.0.

Also removed `task_id` from various places that we didn't need it.

Note there is an ongoing bugfix to cancel only the specific get request
instead of all requests for the worker:
ray-project#54495

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
@edoakes
Copy link
Collaborator

edoakes commented Nov 6, 2025

replaced by: #57911

@edoakes edoakes closed this Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] Multi-threaded ray.get can hang in certain situations.

7 participants