[dashboard][train] Add dynolog for on-demand GPU profiling for Torch training #53191

justinvyu · 2025-05-21T01:58:09Z

Summary

Add on-demand GPU profiling endpoint to the dashboard reporter API at /worker/gpu_profile.

Here's the list of parameters that are available:

/worker/gpu_profile?node_ip=xxx.x.x.x&pid=xxxxx&num_iterations=x

Design

Functionality depends on dynolog binaries (dynolog, dyno) being installed on the image already.
Launch a singleton dynolog daemon monitoring process on every node during dashboard agent startup, listening at port 65406.
The /worker/gpu_profile request gets propagated to the ReporterAgent on the correct node, which makes a call to dyno gputrace --pids=<train_worker_pid> ... and then waits for the trace file to be dumped.
- The request then redirects to the streaming log download API to download the trace on the client (browser).
- See the diagram below for a visualization of the request handling.

Example E2E Usage

dynolog_prototype_e2e.mov

Edge Case Coverage

Test Case	Result
No Kineto variables set in the train run	❌ Fast-fails with an error message.
Non TorchTrainer workload	❌ Fast-fails with an error message.
On dead workers	❌ Fast-fails with an error message.
CPU workers on GPU node	⚠️ Works, but logs error: `gpuGetDeviceCount failed with code 100`
CPU workers on CPU nodes (no GPUs)	❌ Fast-fails with an error message.
Profiling before training steps have begun, but after workers are spawned.	❌❌ Infinite hang on the request — no `.tmp` file created.
Really long profile	✅ Works with 100 steps — sufficiently large test
Run ends before profiling finished (e.g. num_iterations > training duration)	✅ Returns early with a failure message if the process exits while waiting for profiling to finish.
Train V1	✅ Need to set the environment variables at the `ray.init(runtime_env)` level.
Multiple train runs profiled at once	✅

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2025-05-21T01:59:19Z

python/ray/dashboard/modules/reporter/tests/test_gpu_profiler_manager.py

@@ -0,0 +1,333 @@
+"""Unit tests for the GPU profiler manager.


Will add an e2e test with real GPU + dynolog in a follow-up. For now, I've tested e2e manually.

alanwguo · 2025-05-21T16:02:31Z

python/ray/dashboard/modules/reporter/reporter_agent.py

        )
        return reporter_pb2.CpuProfilingReply(output=output, success=success)

+    async def GpuProfiling(self, request, context):


Is this automatically hooked up to the GpuProfiling proto rpc?
Don't see any grpc code for handling it.

The proto is updated here: https://github.com/ray-project/ray/pull/53191/files#diff-1df7450350b7ffdc1923f9321c2df30c8217b6060106700e4920ba447eb7eb84

yeah, i see the proto change, but im surprised that you don't have to "hook up" the class to that proto in some way

oh yeah I think it just autogenerates the reporter stub with the new GpuProfile method

python/ray/dashboard/modules/reporter/gpu_profile_manager.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…log_on_demand_profiling

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…log_on_demand_profiling

justinvyu added 2 commits May 20, 2025 18:56

squash commit

6c02e84

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove temp file

f99fae4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu commented May 21, 2025

View reviewed changes

alanwguo reviewed May 21, 2025

View reviewed changes

justinvyu added 2 commits May 21, 2025 09:46

reduce sleep interval

912894d

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into dyno…

08f15cd

…log_on_demand_profiling

alanwguo approved these changes May 21, 2025

View reviewed changes

justinvyu enabled auto-merge (squash) May 21, 2025 18:18

github-actions bot added the go add ONLY when ready to merge, run all tests label May 21, 2025

justinvyu added 2 commits May 21, 2025 12:23

fix pydoclint

1dddb05

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into dyno…

15bfe49

…log_on_demand_profiling

github-actions bot disabled auto-merge May 21, 2025 19:23

justinvyu enabled auto-merge (squash) May 21, 2025 20:12

justinvyu merged commit efd333a into ray-project:master May 21, 2025
6 checks passed

justinvyu deleted the dynolog_on_demand_profiling branch May 21, 2025 21:46

hainesmichaelc added the community-backlog label May 22, 2025

ongkong mentioned this pull request Nov 7, 2025

[Dashboard] Expose the ability to take kineto traces #58446

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[dashboard][train] Add dynolog for on-demand GPU profiling for Torch training #53191

[dashboard][train] Add dynolog for on-demand GPU profiling for Torch training #53191

justinvyu commented May 21, 2025

Uh oh!

justinvyu May 21, 2025

Uh oh!

alanwguo May 21, 2025

Uh oh!

justinvyu May 21, 2025

Uh oh!

alanwguo May 21, 2025

Uh oh!

justinvyu May 21, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,333 @@
		"""Unit tests for the GPU profiler manager.

[dashboard][train] Add dynolog for on-demand GPU profiling for Torch training #53191

[dashboard][train] Add dynolog for on-demand GPU profiling for Torch training #53191

Conversation

justinvyu commented May 21, 2025

Summary

Design

Example E2E Usage

Edge Case Coverage

Uh oh!

justinvyu May 21, 2025

Choose a reason for hiding this comment

Uh oh!

alanwguo May 21, 2025

Choose a reason for hiding this comment

Uh oh!

justinvyu May 21, 2025

Choose a reason for hiding this comment

Uh oh!

alanwguo May 21, 2025

Choose a reason for hiding this comment

Uh oh!

justinvyu May 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants