-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Ray auto detect nvidia Gpu with pynvml #41020
Conversation
This seems like the right direction, and will close #35581 and a number of issues around GPU detection.
Will this be in a follow-up PR? |
The auto detection will be in this PR as well as the change is not too big + part of using nvml library. Will update once I'm done with the changes. |
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
4834fe9
to
fea66f8
Compare
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
d88326d
to
1513a67
Compare
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
4a8ba97
to
f780050
Compare
As per #41020 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wookayin @XuehaiPan it'd be great if you can also review our usage of pynvml since you are experts here :)
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
f780050
to
39c410e
Compare
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
7d514dd
to
e7e2d67
Compare
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Why are these changes needed?
We want to vendor
nvidia-ml-py
to Ray in order to serve several purposes:nvidia_gpu
auto detection current implementation didn't work when Ray doesn't have root access (Ray cannot access GPUs under a non-root user (failed access of ray.init() to root-owned/proc/driver/nvidia/gpus
) #28064). From several alternatives mentioned in the issue, all of them requireslibnvi-ml.so
which is nvidia driver C api. So, we usepynvml
which is a python wrapper of thislibnvi-ml.so
without any other dependencies. Additionally, we can remove GPUUtil from auto detection as well.gpustat
dependencies on Ray[default] by replacing it withpynvml
whichgpustat
depends on as well (later PR)Related issue number
#28064
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.