-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unify gpu checking around gpustat #35581
Conversation
Some of the errors do not seem to be related to my PR. Is there a "known good commit" I can test against? |
I'm interested in the progress of this PR as I'm having issues with permissions i.e. #28064 |
I rebased this to clear merge conflicts, I think it is ready for review. |
Sorry for missing this one. Currently the team is busy with Ray summit, we will review after that. |
rebased, in the hope someone will review |
I will take a look at it soon |
+1 to this as, by default, containers can't run with root in OpenShift environments, so a non-root-required mechanism for the detection is important. |
@mattip sorry for the late review. this looks great to me. Could you rebase with master since I moved the auto detection code to nvidia_gpu.py |
I will try, but it is frustrating to have to do this work twice. I hope this time it gets a review. |
Signed-off-by: mattip <matti.picus@gmail.com>
Signed-off-by: mattip <matti.picus@gmail.com>
At least one of the failures is connected to this PR. I think it is due to the mocking?
|
cmdargs = ["WMIC", "PATH", "Win32_VideoController", "GET", props] | ||
lines = subprocess.check_output(cmdargs).splitlines()[1:] | ||
num_gpus = len([x.rstrip() for x in lines if x.startswith(b"NVIDIA")]) | ||
num_gpus = gpustat.gpu_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current gpustat
is only installed for ray[default]
so I think we still need the old code that checks "/proc/driver/nvidia/gpus" for minimal installed ray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be acceptable to make gpustat a unconditional dependency for working with GPUs and ray? That code is very fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be hard since it has external dependencies as well:
install_requires = [
'nvidia-ml-py>=12.535.108', # see #107, #143, #161
'psutil>=5.6.0', # GH-1447
'blessed>=1.17.1', # GH-126
'typing_extensions',
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we copy the auto detect code that pytorch has torch.cuda.device_count()
. I think it doesn't depend on GPUtil or gpustat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That goes to this code, which eventually uses ctypes and libnvidia-ml.so.1
. How does this work on windows and macos?
Completely understand the frustration. Sorry again for the late review. |
#41020 takes a different (and probably better) approach of using pynvml directly. |
Replaced by #41020 |
Why are these changes needed?
Consistently use the required gpustat to detect gpu availability. There are fall-back paths to check
/proc/driver/nvidia/gpus
on linux, which requiresroot
permissions. Also update gpustat to 1.1 to get a fix for wookayin/gpustat#142.Related issue number
Toward #28064 (this comment to fix the dashboard server checking for
/
disk usage which requires root permissions is not part of this PR, and should probably be a separate issue)Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.