Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check for unsupported NVML metrics #5343

Merged
merged 3 commits into from
Sep 28, 2021

Conversation

charlesbluca
Copy link
Member

Adds some checks to nvml.one_time and nvml.real_time to handle the case where GPUs are available for monitoring, but some metrics are unsupported for whatever reason.

Comment on lines 86 to 93
try:
util = pynvml.nvmlDeviceGetUtilizationRates(h).gpu
except pynvml.NVMLError_NotSupported:
util = None
try:
mem = pynvml.nvmlDeviceGetMemoryInfo(h).used
except pynvml.NVMLError_NotSupported:
mem = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate where have you seen this happening? I'm wondering if this a bug, or it's a lack of capability for the device and we should maybe consider alerting diagnostics somehow to make users aware.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, as per #5342 it seems this is WSL. I'm not sure whether this is indeed unsupported or if PyNVML is perhaps missing something. I think it would be best if we investigate this before, I think it may be possible to support that correctly.

Copy link
Member

@pentschev pentschev Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you confirm the driver version you're running? It seems that NVML is supposed to be supported since 510.06.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, 510.06 is supposed to support NVML within docker containers in WSL, but NVML was initially supported in WSL in 465.42.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking WSL's nvidia-smi it looks like I'm on 510.10:

Thu Sep 23 13:24:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.00       Driver Version: 510.10       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:15:00.0 Off |                  Off |
| 34%   35C    P8    18W / 260W |    444MiB / 49152MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:2D:00.0  On |                  Off |
| 35%   61C    P0    70W / 260W |   1715MiB / 49152MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

microsoft/WSL#7162 seems to be tracking a variation of this issue. Definitely agree that we should let some time pass to see how this issue pans out, this was mostly just experimenting to see if we could get Distributed working on WSL with NVML support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks similar. Although that still wouldn't explain why nvidia-smi shows used/total memory and (Py)NVML doesn't. Perhaps it's worth doing as suggested in gpuopenanalytics/pynvml#26 (comment) and checking that all WSL2 requirements are met.

Copy link
Member Author

@charlesbluca charlesbluca Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I didn't make this clear, but PyNVML is consistent with nvidia-smi in that all Distributed-relevant metrics but utilization can be accessed:

In [1]: import pynvml

In [2]: pynvml.nvmlInit()

In [3]: h = pynvml.nvmlDeviceGetHandleByIndex(0)

In [4]: pynvml.nvmlDeviceGetMemoryInfo(h).used
Out[4]: 465567744

In [5]: pynvml.nvmlDeviceGetMemoryInfo(h).total
Out[5]: 51539607552

In [6]: pynvml.nvmlDeviceGetName(h).decode()
Out[6]: 'Quadro RTX 8000'

In [7]: pynvml.nvmlDeviceGetUtilizationRates(h).gpu
---------------------------------------------------------------------------
NVMLError_NotSupported                    Traceback (most recent call last)
<ipython-input-7-fcad4c1b0a84> in <module>
----> 1 pynvml.nvmlDeviceGetUtilizationRates(h).gpu

~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py in nvmlDeviceGetUtilizationRates(handle)
   2056     fn = _nvmlGetFunctionPointer("nvmlDeviceGetUtilizationRates")
   2057     ret = fn(handle, byref(c_util))
-> 2058     _nvmlCheckReturn(ret)
   2059     return c_util
   2060 

~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py in _nvmlCheckReturn(ret)
    741 def _nvmlCheckReturn(ret):
    742     if (ret != NVML_SUCCESS):
--> 743         raise NVMLError(ret)
    744     return ret
    745 

NVMLError_NotSupported: Not Supported

Regardless, I'll double check the WSL2 requirements for GPU support and make sure I'm not missing anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesbluca and I discussed this offline, he checked the WSL2 requirements and they were met, so that isn't the issue. He will also file a bug report internally to the NVML team.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing here -- we are getting Unknown instead of NotSupported for this call (wsl2, rtx3070), which is tripping up downstreams: #5628

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks @charlesbluca !

@quasiben
Copy link
Member

Thanks @charlesbluca for the work and @pentschev for the review

@quasiben quasiben merged commit 70158c8 into dask:main Sep 28, 2021
@charlesbluca charlesbluca deleted the check-nvml-unsupported branch July 20, 2022 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NVML monitoring fails on Window Subsystem for Linux w/ GPU support
4 participants