-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Fix gpu metrics #56006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix gpu metrics #56006
Conversation
Signed-off-by: Alan Guo <aguo@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request primarily addresses a bug in GPU metrics reporting by correcting how process IDs are accessed. It also includes a significant refactoring of ray.train.v2 to simplify dataset handling by removing the DatasetManager actor, which improves maintainability. Additionally, it updates the project version to 2.49.0 across various files, including Python, Java, and C++ components, and adds protobuf compatibility helpers for MessageToJson. The tests have been updated to reflect these changes, including a refactoring of aggregator agent tests for better structure. My feedback is minor, pointing out a small typo in a comment.
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
| if processes: | ||
| for proc in processes.values(): | ||
| gpu_pid_mapping[proc.pid].append(proc) | ||
| gpu_pid_mapping[proc["pid"]].append(proc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dang this type of bug should have been caught with a type checker (if we can typify this and other classes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think TypedDict actually breaks the typechecker.
Autocomplete suggests you can use .field but in reality you cannot.
Signed-off-by: Alan Guo <aguo@anyscale.com>
cherrypick #56009
Why are these changes needed?
Bugs introduced in #52102
Two bugs:
proc["pid"]instead ofproc.pid.processes_pidis backwards-incompatible change that ends up changing the dashboard APIs that power the ray dashboard. Maintain backwards-compatibilityVerified fix:

Metrics work again:
Ray Dashboard works again:

Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.