Skip to content

Conversation

@justinvyu
Copy link
Contributor

Summary

Add on-demand GPU profiling endpoint to the dashboard reporter API at /worker/gpu_profile.

Here's the list of parameters that are available:

/worker/gpu_profile?node_ip=xxx.x.x.x&pid=xxxxx&num_iterations=x

Design

  • Functionality depends on dynolog binaries (dynolog, dyno) being installed on the image already.
  • Launch a singleton dynolog daemon monitoring process on every node during dashboard agent startup, listening at port 65406.
  • The /worker/gpu_profile request gets propagated to the ReporterAgent on the correct node, which makes a call to dyno gputrace --pids=<train_worker_pid> ... and then waits for the trace file to be dumped.
    • The request then redirects to the streaming log download API to download the trace on the client (browser).
    • See the diagram below for a visualization of the request handling.

Screenshot 2025-05-13 at 3 26 10 PM (1)

Example E2E Usage

dynolog_prototype_e2e.mov

Edge Case Coverage

Test Case Result
No Kineto variables set in the train run ❌ Fast-fails with an error message.
Non TorchTrainer workload ❌ Fast-fails with an error message.
On dead workers ❌ Fast-fails with an error message.
CPU workers on GPU node ⚠️ Works, but logs error: gpuGetDeviceCount failed with code 100
CPU workers on CPU nodes (no GPUs) ❌ Fast-fails with an error message.
Profiling before training steps have begun, but after workers are spawned. ❌❌ Infinite hang on the request — no .tmp file created.
Really long profile ✅ Works with 100 steps — sufficiently large test
Run ends before profiling finished (e.g. num_iterations > training duration) ✅ Returns early with a failure message if the process exits while waiting for profiling to finish.
Train V1 ✅ Need to set the environment variables at the ray.init(runtime_env) level.
Multiple train runs profiled at once

justinvyu added 2 commits May 20, 2025 18:56
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@@ -0,0 +1,333 @@
"""Unit tests for the GPU profiler manager.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add an e2e test with real GPU + dynolog in a follow-up. For now, I've tested e2e manually.

)
return reporter_pb2.CpuProfilingReply(output=output, success=success)

async def GpuProfiling(self, request, context):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this automatically hooked up to the GpuProfiling proto rpc?
Don't see any grpc code for handling it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i see the proto change, but im surprised that you don't have to "hook up" the class to that proto in some way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah I think it just autogenerates the reporter stub with the new GpuProfile method

justinvyu added 2 commits May 21, 2025 09:46
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu enabled auto-merge (squash) May 21, 2025 18:18
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label May 21, 2025
justinvyu added 2 commits May 21, 2025 12:23
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@github-actions github-actions bot disabled auto-merge May 21, 2025 19:23
@justinvyu justinvyu enabled auto-merge (squash) May 21, 2025 20:12
@justinvyu justinvyu merged commit efd333a into ray-project:master May 21, 2025
6 checks passed
@justinvyu justinvyu deleted the dynolog_on_demand_profiling branch May 21, 2025 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants