-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add check for unsupported NVML metrics #5343
Conversation
distributed/diagnostics/nvml.py
Outdated
try: | ||
util = pynvml.nvmlDeviceGetUtilizationRates(h).gpu | ||
except pynvml.NVMLError_NotSupported: | ||
util = None | ||
try: | ||
mem = pynvml.nvmlDeviceGetMemoryInfo(h).used | ||
except pynvml.NVMLError_NotSupported: | ||
mem = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate where have you seen this happening? I'm wondering if this a bug, or it's a lack of capability for the device and we should maybe consider alerting diagnostics somehow to make users aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, as per #5342 it seems this is WSL. I'm not sure whether this is indeed unsupported or if PyNVML is perhaps missing something. I think it would be best if we investigate this before, I think it may be possible to support that correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you confirm the driver version you're running? It seems that NVML is supposed to be supported since 510.06.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, 510.06 is supposed to support NVML within docker containers in WSL, but NVML was initially supported in WSL in 465.42.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking WSL's nvidia-smi
it looks like I'm on 510.10:
Thu Sep 23 13:24:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.00 Driver Version: 510.10 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 On | 00000000:15:00.0 Off | Off |
| 34% 35C P8 18W / 260W | 444MiB / 49152MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro RTX 8000 On | 00000000:2D:00.0 On | Off |
| 35% 61C P0 70W / 260W | 1715MiB / 49152MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
microsoft/WSL#7162 seems to be tracking a variation of this issue. Definitely agree that we should let some time pass to see how this issue pans out, this was mostly just experimenting to see if we could get Distributed working on WSL with NVML support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looks similar. Although that still wouldn't explain why nvidia-smi
shows used/total memory and (Py)NVML doesn't. Perhaps it's worth doing as suggested in gpuopenanalytics/pynvml#26 (comment) and checking that all WSL2 requirements are met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I didn't make this clear, but PyNVML is consistent with nvidia-smi
in that all Distributed-relevant metrics but utilization can be accessed:
In [1]: import pynvml
In [2]: pynvml.nvmlInit()
In [3]: h = pynvml.nvmlDeviceGetHandleByIndex(0)
In [4]: pynvml.nvmlDeviceGetMemoryInfo(h).used
Out[4]: 465567744
In [5]: pynvml.nvmlDeviceGetMemoryInfo(h).total
Out[5]: 51539607552
In [6]: pynvml.nvmlDeviceGetName(h).decode()
Out[6]: 'Quadro RTX 8000'
In [7]: pynvml.nvmlDeviceGetUtilizationRates(h).gpu
---------------------------------------------------------------------------
NVMLError_NotSupported Traceback (most recent call last)
<ipython-input-7-fcad4c1b0a84> in <module>
----> 1 pynvml.nvmlDeviceGetUtilizationRates(h).gpu
~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py in nvmlDeviceGetUtilizationRates(handle)
2056 fn = _nvmlGetFunctionPointer("nvmlDeviceGetUtilizationRates")
2057 ret = fn(handle, byref(c_util))
-> 2058 _nvmlCheckReturn(ret)
2059 return c_util
2060
~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py in _nvmlCheckReturn(ret)
741 def _nvmlCheckReturn(ret):
742 if (ret != NVML_SUCCESS):
--> 743 raise NVMLError(ret)
744 return ret
745
NVMLError_NotSupported: Not Supported
Regardless, I'll double check the WSL2 requirements for GPU support and make sure I'm not missing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charlesbluca and I discussed this offline, he checked the WSL2 requirements and they were met, so that isn't the issue. He will also file a bug report internally to the NVML team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing here -- we are getting Unknown
instead of NotSupported
for this call (wsl2, rtx3070), which is tripping up downstreams: #5628
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thanks @charlesbluca !
Thanks @charlesbluca for the work and @pentschev for the review |
Adds some checks to
nvml.one_time
andnvml.real_time
to handle the case where GPUs are available for monitoring, but some metrics are unsupported for whatever reason.pre-commit run --all-files