-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[dashboard][train] Add dynolog for on-demand GPU profiling for Torch training #53191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dashboard][train] Add dynolog for on-demand GPU profiling for Torch training #53191
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
| @@ -0,0 +1,333 @@ | |||
| """Unit tests for the GPU profiler manager. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add an e2e test with real GPU + dynolog in a follow-up. For now, I've tested e2e manually.
| ) | ||
| return reporter_pb2.CpuProfilingReply(output=output, success=success) | ||
|
|
||
| async def GpuProfiling(self, request, context): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this automatically hooked up to the GpuProfiling proto rpc?
Don't see any grpc code for handling it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i see the proto change, but im surprised that you don't have to "hook up" the class to that proto in some way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah I think it just autogenerates the reporter stub with the new GpuProfile method
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…log_on_demand_profiling
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…log_on_demand_profiling
Summary
Add on-demand GPU profiling endpoint to the dashboard reporter API at
/worker/gpu_profile.Here's the list of parameters that are available:
Design
dynolog,dyno) being installed on the image already.65406./worker/gpu_profilerequest gets propagated to the ReporterAgent on the correct node, which makes a call todyno gputrace --pids=<train_worker_pid> ...and then waits for the trace file to be dumped.Example E2E Usage
dynolog_prototype_e2e.mov
Edge Case Coverage
gpuGetDeviceCount failed with code 100.tmpfile created.ray.init(runtime_env)level.