Skip to content

Conversation

@edoakes
Copy link
Collaborator

@edoakes edoakes commented Oct 30, 2025

The metrics_agent_client_ depends on client_call_manager_, but previously it was pulling out a reference to it from the core worker, which is not guaranteed to outlive the agent client.

Modifying it to keep the client_call_manager_ as a field of the core_worker_process instead.

I think we may also need to drain any ongoing RPCs from the metrics_agent_client_ on shutdown. Leaving that for a future PR.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@edoakes edoakes requested a review from a team as a code owner October 30, 2025 13:01
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Oct 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the ownership of client_call_manager_ by moving it from CoreWorker to CoreWorkerProcessImpl. This is a good change that correctly ensures client_call_manager_ outlives metrics_agent_client_, fixing a potential use-after-free bug. The changes are mostly correct, but I found a critical issue where std::unique_ptrs are passed to a function expecting references, which will likely cause a compilation error. I also found a placeholder comment that should be improved for clarity.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 30, 2025
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Comment on lines +183 to +184
/// The core worker instance of this worker process.
MutexProtected<std::shared_ptr<CoreWorker>> core_worker_;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved so that CoreWorker is destroyed before ClientCallManager

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What tool/test can catch this besides Rust compiler

@edoakes
Copy link
Collaborator Author

edoakes commented Oct 30, 2025

What tool/test can catch this besides Rust compiler

I'm not actually sure why ASAN doesn't catch it. I am not deeply familiar with ASAN/TSAN

@edoakes edoakes enabled auto-merge (squash) October 30, 2025 16:08
@edoakes edoakes merged commit 091dc49 into ray-project:master Oct 30, 2025
7 checks passed
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
… in core worker (ray-project#58315)

The `metrics_agent_client_` depends on `client_call_manager_`, but
previously it was pulling out a reference to it from the core worker,
which is not guaranteed to outlive the agent client.

Modifying it to keep the `client_call_manager_` as a field of the
`core_worker_process` instead.

I think we may also need to drain any ongoing RPCs from the
`metrics_agent_client_` on shutdown. Leaving that for a future PR.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
… in core worker (ray-project#58315)

The `metrics_agent_client_` depends on `client_call_manager_`, but
previously it was pulling out a reference to it from the core worker,
which is not guaranteed to outlive the agent client.

Modifying it to keep the `client_call_manager_` as a field of the
`core_worker_process` instead.

I think we may also need to drain any ongoing RPCs from the
`metrics_agent_client_` on shutdown. Leaving that for a future PR.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
… in core worker (ray-project#58315)

The `metrics_agent_client_` depends on `client_call_manager_`, but
previously it was pulling out a reference to it from the core worker,
which is not guaranteed to outlive the agent client.

Modifying it to keep the `client_call_manager_` as a field of the
`core_worker_process` instead.

I think we may also need to drain any ongoing RPCs from the
`metrics_agent_client_` on shutdown. Leaving that for a future PR.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants