Add check for unsupported NVML metrics #5343

charlesbluca · 2021-09-23T15:35:13Z

Adds some checks to nvml.one_time and nvml.real_time to handle the case where GPUs are available for monitoring, but some metrics are unsupported for whatever reason.

Closes NVML monitoring fails on Window Subsystem for Linux w/ GPU support #5342
Tests added / passed
Passes pre-commit run --all-files

pentschev · 2021-09-23T16:52:49Z

distributed/diagnostics/nvml.py

+    try:
+        util = pynvml.nvmlDeviceGetUtilizationRates(h).gpu
+    except pynvml.NVMLError_NotSupported:
+        util = None
+    try:
+        mem = pynvml.nvmlDeviceGetMemoryInfo(h).used
+    except pynvml.NVMLError_NotSupported:
+        mem = None


Could you elaborate where have you seen this happening? I'm wondering if this a bug, or it's a lack of capability for the device and we should maybe consider alerting diagnostics somehow to make users aware.

Oh, as per #5342 it seems this is WSL. I'm not sure whether this is indeed unsupported or if PyNVML is perhaps missing something. I think it would be best if we investigate this before, I think it may be possible to support that correctly.

Could you confirm the driver version you're running? It seems that NVML is supposed to be supported since 510.06.

Wait, 510.06 is supposed to support NVML within docker containers in WSL, but NVML was initially supported in WSL in 465.42.

Checking WSL's nvidia-smi it looks like I'm on 510.10:

Thu Sep 23 13:24:43 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.00 Driver Version: 510.10 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Quadro RTX 8000 On | 00000000:15:00.0 Off | Off | | 34% 35C P8 18W / 260W | 444MiB / 49152MiB | N/A Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Quadro RTX 8000 On | 00000000:2D:00.0 On | Off | | 35% 61C P0 70W / 260W | 1715MiB / 49152MiB | N/A Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

microsoft/WSL#7162 seems to be tracking a variation of this issue. Definitely agree that we should let some time pass to see how this issue pans out, this was mostly just experimenting to see if we could get Distributed working on WSL with NVML support.

Yeah, looks similar. Although that still wouldn't explain why nvidia-smi shows used/total memory and (Py)NVML doesn't. Perhaps it's worth doing as suggested in gpuopenanalytics/pynvml#26 (comment) and checking that all WSL2 requirements are met.

Sorry if I didn't make this clear, but PyNVML is consistent with nvidia-smi in that all Distributed-relevant metrics but utilization can be accessed:

In [1]: import pynvml In [2]: pynvml.nvmlInit() In [3]: h = pynvml.nvmlDeviceGetHandleByIndex(0) In [4]: pynvml.nvmlDeviceGetMemoryInfo(h).used Out[4]: 465567744 In [5]: pynvml.nvmlDeviceGetMemoryInfo(h).total Out[5]: 51539607552 In [6]: pynvml.nvmlDeviceGetName(h).decode() Out[6]: 'Quadro RTX 8000' In [7]: pynvml.nvmlDeviceGetUtilizationRates(h).gpu --------------------------------------------------------------------------- NVMLError_NotSupported Traceback (most recent call last) <ipython-input-7-fcad4c1b0a84> in <module> ----> 1 pynvml.nvmlDeviceGetUtilizationRates(h).gpu ~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py in nvmlDeviceGetUtilizationRates(handle) 2056 fn = _nvmlGetFunctionPointer("nvmlDeviceGetUtilizationRates") 2057 ret = fn(handle, byref(c_util)) -> 2058 _nvmlCheckReturn(ret) 2059 return c_util 2060 ~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py in _nvmlCheckReturn(ret) 741 def _nvmlCheckReturn(ret): 742 if (ret != NVML_SUCCESS): --> 743 raise NVMLError(ret) 744 return ret 745 NVMLError_NotSupported: Not Supported

Regardless, I'll double check the WSL2 requirements for GPU support and make sure I'm not missing anything.

@charlesbluca and I discussed this offline, he checked the WSL2 requirements and they were met, so that isn't the issue. He will also file a bug report internally to the NVML team.

Continuing here -- we are getting Unknown instead of NotSupported for this call (wsl2, rtx3070), which is tripping up downstreams: #5628

distributed/diagnostics/nvml.py

pentschev

LGTM now, thanks @charlesbluca !

quasiben · 2021-09-28T21:09:26Z

Thanks @charlesbluca for the work and @pentschev for the review

Add check for unsupported NVML metrics

87f8bd9

pentschev reviewed Sep 23, 2021

View reviewed changes

Only catch not supported error once

fb7492a

pentschev reviewed Sep 23, 2021

View reviewed changes

distributed/diagnostics/nvml.py Outdated Show resolved Hide resolved

Move metric checks to individual functions

d0ed7cd

pentschev approved these changes Sep 28, 2021

View reviewed changes

quasiben merged commit 70158c8 into dask:main Sep 28, 2021

quasiben mentioned this pull request Oct 26, 2021

NVMLError_Unkown: Unknown Error rapidsai/dask-cuda#761

Closed

This was referenced Dec 7, 2021

has_cuda_context fails in WSL #5567

Closed

Disable NVML monitoring on WSL #5568

Merged

charlesbluca deleted the check-nvml-unsupported branch July 20, 2022 03:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add check for unsupported NVML metrics #5343

Add check for unsupported NVML metrics #5343

charlesbluca commented Sep 23, 2021

pentschev Sep 23, 2021

pentschev Sep 23, 2021

pentschev Sep 23, 2021 •

edited

Loading

pentschev Sep 23, 2021

charlesbluca Sep 23, 2021

pentschev Sep 24, 2021

charlesbluca Sep 24, 2021 •

edited

Loading

pentschev Sep 28, 2021

lmeyerov Dec 24, 2021

pentschev left a comment

quasiben commented Sep 28, 2021

Add check for unsupported NVML metrics #5343

Add check for unsupported NVML metrics #5343

Conversation

charlesbluca commented Sep 23, 2021

pentschev Sep 23, 2021

Choose a reason for hiding this comment

pentschev Sep 23, 2021

Choose a reason for hiding this comment

pentschev Sep 23, 2021 • edited Loading

Choose a reason for hiding this comment

pentschev Sep 23, 2021

Choose a reason for hiding this comment

charlesbluca Sep 23, 2021

Choose a reason for hiding this comment

pentschev Sep 24, 2021

Choose a reason for hiding this comment

charlesbluca Sep 24, 2021 • edited Loading

Choose a reason for hiding this comment

pentschev Sep 28, 2021

Choose a reason for hiding this comment

lmeyerov Dec 24, 2021

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

quasiben commented Sep 28, 2021

pentschev Sep 23, 2021 •

edited

Loading

charlesbluca Sep 24, 2021 •

edited

Loading