[core] allow reporter agent to get pid via rpc to raylet #57004

tianyi-ge · 2025-09-29T15:33:33Z

Why are these changes needed?

currently, reporter agent is spawned by raylet process. It's assumed that all core workers are direct children of raylet, but it's not the case with new features (uv, image_url). reporter agent need another way to find all core workers.

ray/python/ray/dashboard/modules/reporter/reporter_agent.py

Line 911 in 10eacfd

for proc in raylet_proc.children()
driver is not spawned by raylet, thus is never monitored

implementation:

add an grpc endpoint in raylet process (node manager), and allow reporter agent to connect
reporter agent fetches worker lists via grpc reply, including driver. it creates a raylet client with a dedicated thread

Related issue number

Closes #56739

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Note

Reporter agent now fetches worker/driver PIDs via a new Raylet GetWorkerPIDs RPC using a new RayletClient binding, replacing psutil child-process scanning.

Backend (Raylet RPC):
- Add GetWorkerPIDs RPC in node_manager.proto and wire it into NodeManagerService.
- Implement NodeManager::HandleGetWorkerPIDs to return PIDs of all alive workers and drivers.
- Extend RayletClient (C++) with GetWorkerPIDs(timeout_ms) and an alternate ctor (ip, port); expose to Python via Cython (includes/raylet_client.pxi, includes/common.pxd).
Python/Cython plumbing:
- Include includes/raylet_client.pxi in _raylet.pyx to expose RayletClient to Python.
Dashboard Reporter:
- Update reporter_agent.py to use RayletClient(ip, node_manager_port).get_worker_pids(timeout) to discover workers; build psutil.Process objects from returned PIDs.
- Add RAYLET_RPC_TIMEOUT_SECONDS = 1 in dashboard/consts.py and use it for RPC timeout.
Server registration:
- Register new handler in node_manager_server.h macro list.

^{Written by Cursor Bugbot for commit f76f633. This will update automatically on new commits. Configure here.}

Signed-off-by: tianyi-ge <tianyig@outlook.com>

gemini-code-assist

Code Review

This pull request adds a new gRPC endpoint to the node manager for fetching worker and driver PIDs, which is a solid approach for discovering all worker processes. The changes to the protobuf definition and the C++ implementation are mostly correct. However, I've found a critical issue in the Python client code due to a typo that would cause the RPC call to fail. I've also included a few suggestions for improving error handling and code efficiency.

python/ray/dashboard/modules/reporter/reporter_agent.py

src/ray/raylet/node_manager.cc

Signed-off-by: tianyi-ge <tianyig@outlook.com>

src/ray/protobuf/node_manager.proto

edoakes · 2025-09-29T21:21:52Z

src/ray/protobuf/node_manager.proto

+  // Get the worker managed by local raylet.
+  // Failure: Sends to local raylet, so should never fail.


we should still add error handling & retries just in case (there could be a logical bug in the raylet)

src/ray/protobuf/node_manager.proto

src/ray/raylet/node_manager.cc

tianyi-ge · 2025-09-30T01:51:38Z

@edoakes thanks you for the prompt comments; I'll fix it soon. Also, after discussing with @can-anyscale , I'll replace python grpcio lib with a cython wrapper "RayletClient"

Signed-off-by: tianyi-ge <tianyig@outlook.com>

can-anyscale

Let's figure out a way to test that the solution work

python/ray/includes/raylet_client.pxi

can-anyscale · 2025-10-01T18:45:00Z

python/ray/dashboard/modules/reporter/reporter_agent.py

+        )
+        try:
+            return raylet_client.get_worker_pids(timeout=timeout)
+        except Exception as e:


let's not exception catch all; be explicit of what are the acceptable exceptions can be thrown from get_worker_pids and what exceptions ray should just fail out loud

yes this should be rpc exceptions or something; try not to exception catch all if possible

python/ray/dashboard/modules/reporter/reporter_agent.py

can-anyscale · 2025-10-01T19:06:39Z

src/ray/protobuf/node_manager.proto

  rpc IsLocalWorkerDead(IsLocalWorkerDeadRequest) returns (IsLocalWorkerDeadReply);
+  // Get the PIDs of all workers currently alive that are managed by the local Raylet.
+  // This includes connected driver processes.
+  // Failure: Will retry on failure with logging


nit: what "with logging" means? more useful information would be to retry how many time; what will the reply look like on failures (partial results, empty) etc.

src/ray/raylet/node_manager.cc

src/ray/raylet_rpc_client/raylet_client.cc

src/ray/raylet_rpc_client/raylet_client.h

python/ray/dashboard/modules/reporter/reporter_agent.py

Signed-off-by: tianyi-ge <tianyig@outlook.com>

can-anyscale · 2025-10-06T19:13:11Z

src/ray/raylet_rpc_client/threaded_raylet_client.cc

+
+ThreadedRayletClient::ThreadedRayletClient(const std::string &ip_address, int port)
+    : RayletClient() {
+  io_service_ = std::make_unique<instrumented_io_context>();


maybe can just use this https://github.com/ray-project/ray/blob/master/src/ray/common/asio/asio_util.h#L53 and don't need to maintain the thread yourself

there are also patterns here to make sure the io_context is reused across raylet client within one process

thanks for your suggestions

I guess in the future, if raylet client is used at multiple places, reusing io_context is important.
But to use IOContextProvider, I have to create a "default io context" anyway. It's also manually maintained, right?

oh dang sorry forgot to include the link to the pattern; you can create a static InstrumentedIOContextWithThread and reuse it across the constructor of ThreadedRayletClient https://github.com/ray-project/ray/blob/master/src/ray/gcs_rpc_client/gcs_client.cc#L219

I run a test of creating 5 actors. The rpc reply has 12 processes, including 5 actors, 5 idle workers, driver (python in the following screenshot) and a dashboard server head, which aligns with the dashboard.

2025-10-08 11:29:33,355 INFO reporter_agent.py:913 -- Worker PIDs from raylet: [41692, 41694, 41685, 41689, 41693, 41688, 41690, 41691, 41687, 41686, 41676, 41618]

should dashboard server head be here?

ah -- I don't think the dashboard server head should be there... the reason it's showing up is because it is connecting to ray with ray.init. We'll need some way of filtering it. I believe it is started in a namespace prefixed with _ray_internal. We do other such filtering here:

ray/python/ray/dashboard/modules/job/job_head.py

Line 687 in a214565

# This includes the _ray_internal_dashboard job that gets automatically

If the namespace is available in the raylet, we can add the filtering there and exclude any workers that are associated with a _ray_internal* namespace

Agree, for system driver processes, we should hide them.

thanks @edoakes . I added a new option filter_system_drivers. It finds the corresponding namesapce and check its prefix. now dashboard server head is gone

python/ray/dashboard/modules/reporter/reporter_agent.py

src/ray/raylet_rpc_client/raylet_client.h

…ylet client Signed-off-by: tianyi-ge <tianyig@outlook.com>

Signed-off-by: tianyi-ge <tianyig@outlook.com>

tianyi-ge · 2025-10-16T13:56:17Z

@jjyao Is runtime env agent a special driver or core worker? is it possible to assert it in my unittest?

cursor · 2025-10-16T14:00:54Z

src/ray/raylet_rpc_client/raylet_client.cc

+    return Status::TimedOut("Timed out getting worker PIDs from raylet");
+  }
+  return future.get();
+}


Bug: Race Condition in Dual Timeout Handling

The GetWorkerPIDs method has a race condition due to dual timeout handling. Both the RPC call and future.wait_for use the same timeout_ms, which can cause future.wait_for to incorrectly report a timeout even if the RPC successfully completed.

jjyao · 2025-10-16T18:11:50Z

python/ray/dashboard/modules/reporter/tests/test_reporter.py


-def test_report_stats():
+@patch("ray.dashboard.modules.reporter.reporter_agent.RayletClient")
+def test_report_stats(mock_raylet_client):


the mock_raylet_client is not used?

it will be used in ReporterAgent constructor to avoid creating a real grpc client

let's dependency inject the client instead of using patch to hijack the import path... makes it explicit and less likely to break and confuse people down the line.

jjyao · 2025-10-16T18:18:28Z

python/ray/dashboard/modules/reporter/tests/test_reporter.py

    assert resp_data["rayInitCluster"] == meta["ray_init_cluster"]


+def test_reporter_raylet_agent(ray_start_with_dashboard):


I think this test depends on the fact the total cpu resource of the node is 1 so we don't create extra idle nodes. Could you make it explicit by doing

@pytest.mark.parametrize( "ray_start_with_dashboard", [ { "num_cpus": 1, } ], indirect=True, )

jjyao · 2025-10-16T18:27:53Z

src/ray/common/constants.h

 /// PID of GCS process to record metrics.
 constexpr char kGcsPidKey[] = "gcs_pid";
+
+// Please keep this in sync with the definition in ray_constants.py.


We can enforce the sync by exposing the c++ constant to python via cython. We have examples in common.pxi and common.pxd:

ray/python/ray/includes/common.pxi

Line 161 in 9a434c7

RAY_NODE_TPU_POD_TYPE_KEY = kLabelKeyTpuPodType.decode()

jjyao · 2025-10-16T18:34:28Z

src/ray/protobuf/node_manager.proto

  // worker clients. The unavailable callback will eventually be retried so if this fails.
  rpc IsLocalWorkerDead(IsLocalWorkerDeadRequest) returns (IsLocalWorkerDeadReply);
+  // Get the PIDs of all workers currently alive that are managed by the local Raylet.
+  // This includes connected driver processes.


We should mention system drivers are excluded

jjyao · 2025-10-16T18:43:29Z

src/ray/raylet_rpc_client/raylet_client.cc

+  std::weak_ptr<std::promise<Status>> weak_promise = promise;
+  std::weak_ptr<std::vector<int32_t>> weak_worker_pids = worker_pids;


why do we need weak_ptr and promise here?

jjyao · 2025-10-16T19:57:13Z

python/ray/dashboard/modules/reporter/reporter_agent.py

+    def _get_worker_pids_from_raylet(self) -> List[int]:
+        try:
+            # Get worker pids from raylet via gRPC.
+            return self._raylet_client.get_worker_pids()


this is PRC so we should make it async and change get_worker_pids_from_raylet to async.

ah yes, @tianyi-ge, there is a pattern to turn this async grpc call into a await/future method in python, example here https://github.com/ray-project/ray/blob/master/python/ray/includes/gcs_client.pxi#L177-L191

Signed-off-by: tianyi-ge <tianyig@outlook.com>

cursor · 2025-10-17T16:28:54Z

python/ray/dashboard/modules/reporter/reporter_agent.py

-        raylet_proc = self._get_raylet_proc()
-        if raylet_proc is None:
+        pids = asyncio.run(self._get_worker_pids_from_raylet())
+        logger.debug(f"Worker PIDs from raylet: {pids}")


Bug: Asyncio Loop Conflict in Worker Process Retrieval

The _get_worker_processes method uses asyncio.run() to execute _get_worker_pids_from_raylet(). Since the ReporterAgent runs within an existing asyncio event loop, calling asyncio.run() from it raises a RuntimeError and crashes the application.

Signed-off-by: tianyi-ge <tianyig@outlook.com>

edoakes

Code changes LGTM, just one minor comment.

Also -- is it possible to add an e2e integration test? We can run an application using uv runtime_env and check that metrics are exported for the expected PIDs (both driver and the uv worker)

edoakes · 2025-10-20T13:23:25Z

python/ray/dashboard/modules/reporter/tests/test_reporter.py


-def test_report_stats():
+@patch("ray.dashboard.modules.reporter.reporter_agent.RayletClient")
+def test_report_stats(mock_raylet_client):


let's dependency inject the client instead of using patch to hijack the import path... makes it explicit and less likely to break and confuse people down the line.

tianyi-ge · 2025-10-20T14:20:32Z

Hi @edoakes how about dependency inject by passing an optional raylet_client=None to ReporterAgent.__init__?

also, could you give me some references about how to check pids from exported metrics in dashboard?

edoakes · 2025-10-20T14:53:00Z

Hi @edoakes how about dependency inject by passing an optional raylet_client=None to ReporterAgent.__init__?

Sounds great 👍

also, could you give me some references about how to check pids from exported metrics in dashboard?

Looks like the metrics are written to the GCS here:

ray/python/ray/dashboard/modules/reporter/reporter_agent.py

Line 1733 in fe7ad00

await self._gcs_client.async_publish_node_resource_usage(

Then they are read here:

ray/python/ray/dashboard/modules/node/node_head.py

Line 535 in fe7ad00

key, data = await subscriber.poll()

And finally they're used to serve the GET /nodes/{node_id} endpoint:

ray/python/ray/dashboard/modules/node/node_head.py

Line 422 in fe7ad00

async def get_node(self, req) -> aiohttp.web.Response:
https://github.com/ray-project/ray/blob/master/python/ray/dashboard/modules/node/datacenter.py#L131

So let's write a test that runs a driver & uv actor, then queries the GET /nodes/{node_id} endpoint and verified that stats for their PIDs are populated. You can get the uv actor PID by adding a method to it that returns os.getpid().

There are similar tests in test_node.py. You can add this there and follow a similar pattern.

Signed-off-by: tianyi-ge <tianyig@outlook.com>

tianyi-ge · 2025-10-20T16:17:50Z

updated!

cursor · 2025-10-20T16:19:45Z

python/ray/dashboard/modules/reporter/reporter_agent.py

-        stats = self._collect_stats()
+        return asyncio.run(
+            self._async_compose_stats_payload(cluster_autoscaling_stats_json)
+        )


Bug: Async Loop Conflict in Reporting Method

The _compose_stats_payload method calls asyncio.run() from within the async reporting loop. This raises a RuntimeError because asyncio.run() cannot be invoked when an event loop is already running.

can-anyscale · 2025-10-20T17:44:21Z

python/ray/_private/ray_constants.py

-# and should not be considered user activity.
-# Please keep this in sync with the definition kRayInternalNamespacePrefix
-# in /src/ray/gcs/gcs_server/gcs_job_manager.h.
-RAY_INTERNAL_NAMESPACE_PREFIX = "_ray_internal_"


what are these refactoring for?

#57004 (comment)
@jjyao suggested to expose this c++ constant to python, avoiding duplicate definition and potential inconsistency

can-anyscale · 2025-10-20T17:53:04Z

python/ray/dashboard/modules/node/tests/test_node.py

+            assert dump_info["result"] is True
+            detail = dump_info["data"]["detail"]
+            pids = [worker["pid"] for worker in detail["workers"]]
+            if len(pids) < 2:  # might include idle worker


you can remove this if condition, wait_for_condition already handles retry interval

can-anyscale · 2025-10-20T17:54:57Z

python/ray/dashboard/modules/reporter/reporter_agent.py

    """

-    def __init__(self, dashboard_agent):
+    def __init__(self, dashboard_agent, raylet_client=None):


Are the raylet_client=None only used inside test cases?

can-anyscale · 2025-10-20T17:58:06Z

python/ray/dashboard/modules/reporter/reporter_agent.py


            await asyncio.sleep(reporter_consts.REPORTER_UPDATE_INTERVAL_MS / 1000)

    def _compose_stats_payload(


Is there a reason you didn’t make this async as well and only call asyncio.run inside loop.run_in_executor? That approach seems less error-prone. Calling asyncio.run here makes _compose_stats_payload unsafe to reuse elsewhere.

yeah I tested only calling asyncio.run inside loop.run_in_executor but it does not work. I guess in my way, the creation of coroutine (call of asyncio.run) is delayed until it's really called by executor. for this unsafe issue, how about moving this _compose_stats_payload to be a nested function?

Signed-off-by: tianyi-ge <tianyig@outlook.com>

cursor · 2025-10-21T01:04:23Z

python/ray/dashboard/modules/reporter/reporter_agent.py

+        return asyncio.run(
+            self._async_compose_stats_payload(cluster_autoscaling_stats_json)
+        )
+


Bug: Async Method Causes Event Loop Conflicts

The _compose_stats_payload method uses asyncio.run() to execute an async function. This causes a RuntimeError if called from an existing event loop, which is likely for the ReporterAgent within the dashboard. This can lead to application crashes or failures in stats collection.

can-anyscale

LGTM

can-anyscale · 2025-10-21T01:18:32Z

python/ray/dashboard/modules/reporter/reporter_agent.py


            await asyncio.sleep(reporter_consts.REPORTER_UPDATE_INTERVAL_MS / 1000)

    def _compose_stats_payload(


nit: let's rename this to _run_in_executor (to mark it as the entry function, and use a similar run naming convention of this class; the rest looks good

Signed-off-by: tianyi-ge <tianyig@outlook.com>

edoakes · 2025-10-21T14:59:01Z

Thanks for the contribution and sticking through all of the feedback @tianyi-ge!

jjyao

Great Work! These comments you can fix in your next PR

jjyao · 2025-10-21T17:40:11Z

src/ray/protobuf/node_manager.proto

+  // Failure: Will retry with the default timeout 1000ms. If fails, reply return an empty
+  // list.


Do we actually retry? I think it's local PRC so should never fail unless the raylet crashes?

jjyao · 2025-10-21T17:53:36Z

src/ray/raylet_rpc_client/raylet_client.cc

 }

+void RayletClient::GetWorkerPIDs(
+    const gcs::OptionalItemCallback<std::vector<int32_t>> &callback, int64_t timeout_ms) {


I think we should move this out of gcs namespace since it's a common util now.

jjyao · 2025-10-21T17:57:00Z

src/ray/raylet_rpc_client/raylet_client_with_io_context.cc

+                                                     int port) {
+  // Connect to the raylet on a singleton io service with a dedicated thread.
+  // This is to avoid creating multiple threads for multiple clients in python.
+  static InstrumentedIOContextWithThread io_context("raylet_client_io_service");


I think this should be a member field of RayClientWithIoContext

…#57004) 1. currently, reporter agent is spawned by raylet process. It's assumed that all core workers are direct children of raylet, but it's not the case with new features (uv, image_url). reporter agent need another way to find all core workers. https://github.com/ray-project/ray/blob/10eacfd6ddf3b84827d016e37294bc5f2577ad3f/python/ray/dashboard/modules/reporter/reporter_agent.py#L911 2. driver is not spawned by raylet, thus is never monitored implementation: 1. add an grpc endpoint in raylet process (node manager), and allow reporter agent to connect 2. reporter agent fetches worker lists via grpc reply, including driver. it creates a raylet client with a dedicated thread Closes ray-project#56739 --------- Signed-off-by: tianyi-ge <tianyig@outlook.com> Signed-off-by: xgui <xgui@anyscale.com>

1. currently, reporter agent is spawned by raylet process. It's assumed that all core workers are direct children of raylet, but it's not the case with new features (uv, image_url). reporter agent need another way to find all core workers. https://github.com/ray-project/ray/blob/10eacfd6ddf3b84827d016e37294bc5f2577ad3f/python/ray/dashboard/modules/reporter/reporter_agent.py#L911 2. driver is not spawned by raylet, thus is never monitored implementation: 1. add an grpc endpoint in raylet process (node manager), and allow reporter agent to connect 2. reporter agent fetches worker lists via grpc reply, including driver. it creates a raylet client with a dedicated thread Closes #56739 --------- Signed-off-by: tianyi-ge <tianyig@outlook.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…#57004) 1. currently, reporter agent is spawned by raylet process. It's assumed that all core workers are direct children of raylet, but it's not the case with new features (uv, image_url). reporter agent need another way to find all core workers. https://github.com/ray-project/ray/blob/10eacfd6ddf3b84827d016e37294bc5f2577ad3f/python/ray/dashboard/modules/reporter/reporter_agent.py#L911 2. driver is not spawned by raylet, thus is never monitored implementation: 1. add an grpc endpoint in raylet process (node manager), and allow reporter agent to connect 2. reporter agent fetches worker lists via grpc reply, including driver. it creates a raylet client with a dedicated thread Closes ray-project#56739 --------- Signed-off-by: tianyi-ge <tianyig@outlook.com>

…#57004) 1. currently, reporter agent is spawned by raylet process. It's assumed that all core workers are direct children of raylet, but it's not the case with new features (uv, image_url). reporter agent need another way to find all core workers. https://github.com/ray-project/ray/blob/10eacfd6ddf3b84827d016e37294bc5f2577ad3f/python/ray/dashboard/modules/reporter/reporter_agent.py#L911 2. driver is not spawned by raylet, thus is never monitored implementation: 1. add an grpc endpoint in raylet process (node manager), and allow reporter agent to connect 2. reporter agent fetches worker lists via grpc reply, including driver. it creates a raylet client with a dedicated thread Closes ray-project#56739 --------- Signed-off-by: tianyi-ge <tianyig@outlook.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

[core] add get pid rpc to node manager

eb8924e

Signed-off-by: tianyi-ge <tianyig@outlook.com>

tianyi-ge requested a review from a team as a code owner September 29, 2025 15:33

[core] add get pid rpc to node manager

d28c905

Signed-off-by: tianyi-ge <tianyig@outlook.com>

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

tianyi-ge changed the title ~~[core] add get pid rpc to node manager~~ [core] allow reporter agent to get pid via rpc to raylet Sep 29, 2025

This comment was marked as outdated.

Sign in to view

[core] add get pid rpc to node manager

5927027

Signed-off-by: tianyi-ge <tianyig@outlook.com>

This comment was marked as outdated.

Sign in to view

ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Sep 29, 2025

edoakes reviewed Sep 29, 2025

View reviewed changes

[core] add cython wrapper for raylet client

f76f633

Signed-off-by: tianyi-ge <tianyig@outlook.com>

This comment was marked as outdated.

Sign in to view

can-anyscale self-assigned this Oct 1, 2025

can-anyscale reviewed Oct 1, 2025

View reviewed changes

[core] add cython wrapper for raylet client

c3dca66

Signed-off-by: tianyi-ge <tianyig@outlook.com>

This comment was marked as outdated.

Sign in to view

tianyi-ge added 2 commits October 5, 2025 23:08

Merge branch 'master' of github.com:ray-project/ray into raylet-grpc

1ea7081

[core] fix cursor comments

af61506

Signed-off-by: tianyi-ge <tianyig@outlook.com>

can-anyscale reviewed Oct 6, 2025

View reviewed changes

[core] reuse singleton io context helper class to start a threaded ra…

204caef

…ylet client Signed-off-by: tianyi-ge <tianyig@outlook.com>

This comment was marked as outdated.

Sign in to view

[core] fix reporter agent unittest

2b32615

Signed-off-by: tianyi-ge <tianyig@outlook.com>

This comment was marked as outdated.

Sign in to view

[core] add reporter agent unittest

c8f0fa4

Signed-off-by: tianyi-ge <tianyig@outlook.com>

This comment was marked as outdated.

Sign in to view

tianyi-ge added 2 commits October 10, 2025 09:58

[core] connect to raylet inside client constructor

ac20283

Signed-off-by: tianyi-ge <tianyig@outlook.com>

[core] filter out system drivers for reporter

b4e9a1d

Signed-off-by: tianyi-ge <tianyig@outlook.com>

Merge branch 'master' of github.com:ray-project/ray into raylet-grpc

a93cb28

cursor bot reviewed Oct 16, 2025

View reviewed changes

kenmcheng mentioned this pull request Oct 16, 2025

Filter out ANSI escape codes from logs when retrieving logs from the dashboard #53370

Merged

8 tasks

jjyao reviewed Oct 16, 2025

View reviewed changes

[core] change GetWorkerPIDs to async

9940d37

Signed-off-by: tianyi-ge <tianyig@outlook.com>

cursor bot reviewed Oct 17, 2025

View reviewed changes

[core] update async logic

e3c79a2

Signed-off-by: tianyi-ge <tianyig@outlook.com>

tianyi-ge mentioned this pull request Oct 20, 2025

[Core] [tech-debt] Extract gcs_rpc_client/python_callbacks.h to be public for raylet client #57894

Closed

edoakes reviewed Oct 20, 2025

View reviewed changes

[core] inject mocked raylet client and add e2e test

5edc6f0

Signed-off-by: tianyi-ge <tianyig@outlook.com>

cursor bot reviewed Oct 20, 2025

View reviewed changes

can-anyscale reviewed Oct 20, 2025

View reviewed changes

[core] prettify e2e reporter test

e281cf4

Signed-off-by: tianyi-ge <tianyig@outlook.com>

cursor bot reviewed Oct 21, 2025

View reviewed changes

can-anyscale reviewed Oct 21, 2025

View reviewed changes

tianyi-ge added 2 commits October 21, 2025 09:49

[core] rename sync _compose_stats_payload to _run_in_executor

3b97fb6

Signed-off-by: tianyi-ge <tianyig@outlook.com>

Merge branch 'master' of github.com:ray-project/ray into raylet-grpc

a6e7342

edoakes merged commit 47f6b87 into ray-project:master Oct 21, 2025
6 checks passed

jjyao reviewed Oct 21, 2025

View reviewed changes

jjyao mentioned this pull request Oct 21, 2025

[Observability] Ray Dashboard and Metrics aren't listing Driver by default #50097

Closed

tianyi-ge mentioned this pull request Nov 13, 2025

rename gcs_callback_types.h to grpc_callback_types.h #58596

Open

		// Get the worker managed by local raylet.
		// Failure: Sends to local raylet, so should never fail.

		assert resp_data["rayInitCluster"] == meta["ray_init_cluster"]


		def test_reporter_raylet_agent(ray_start_with_dashboard):

		std::weak_ptr<std::promise<Status>> weak_promise = promise;
		std::weak_ptr<std::vector<int32_t>> weak_worker_pids = worker_pids;


		await asyncio.sleep(reporter_consts.REPORTER_UPDATE_INTERVAL_MS / 1000)

		def _compose_stats_payload(

		// Failure: Will retry with the default timeout 1000ms. If fails, reply return an empty
		// list.

[core] allow reporter agent to get pid via rpc to raylet #57004

[core] allow reporter agent to get pid via rpc to raylet #57004

Uh oh!

Conversation

tianyi-ge commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tianyi-ge commented Sep 30, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

can-anyscale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

tianyi-ge commented Oct 16, 2025

Uh oh!

cursor bot Oct 16, 2025

Choose a reason for hiding this comment

tianyi-ge commented Sep 29, 2025 •

edited

Loading

tianyi-ge commented Oct 20, 2025 •

edited

Loading