`has_cuda_context` fails in WSL #5567

charlesbluca · 2021-12-07T16:23:15Z

What happened:
When attempting to check if a CUDA context has already been created using has_cuda_context in WSL, we see failures at the NVML level:

NVMLError_Unknown                         Traceback (most recent call last)
<ipython-input-2-086bc6eeaba8> in <module>
----> 1 has_cuda_context()

~/dev/rapids-dask/distributed/distributed/diagnostics/nvml.py in has_cuda_context()
     76         handle = pynvml.nvmlDeviceGetHandleByIndex(index)
     77         if hasattr(pynvml, "nvmlDeviceGetComputeRunningProcesses_v2"):
---> 78             running_processes = pynvml.nvmlDeviceGetComputeRunningProcesses_v2(handle)
     79         else:
     80             running_processes = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)

~/miniconda3/envs/rapids-dask/lib/python3.8/site-packages/pynvml/nvml.py in nvmlDeviceGetComputeRunningProcesses_v2(handle)
   2139     else:
   2140         # error case
-> 2141         raise NVMLError(ret)
   2142 
   2143 def nvmlDeviceGetComputeRunningProcesses(handle):

This is actually expected, as querying active processes is not yet supported in WSL.

What you expected to happen:
We would want the function to work as desired, i.e. returning false or the index of the device for which there's a CUDA context.

Minimal Complete Verifiable Example:

from distributed.diagnostics.nvml import has_cuda_context

has_cuda_context()

The error also occurs running the bare PyNVML code:

from pynvml import *

nvmlInit()

h = nvmlDeviceGetHandleByIndex(0)
nvmlDeviceGetComputeRunningProcesses_v2(h)

Anything else we need to know?:
For context, this causes issues when attempting to create a CUDA context in dask-cuda, which is currently being tracked in rapidsai/dask-cuda#816.

I think a potential solution to this issue, in the same vein as #5343, would be to make a special case for WSL in has_cuda_context. This could either be done implicitly with a try/except block to catch the NVML error, or explicitly by checking if the OS is WSL and doing something different in that case:

# https://www.scivision.dev/python-detect-wsl/
from platform import uname

def in_wsl() -> bool:
    return 'microsoft-standard' in uname().release

Environment:

Dask version: latest main
Python version: 3.8
Operating System: ubuntu 20.04 (WSL2)
Install method (conda, pip, source): source

cc @pentschev

The text was updated successfully, but these errors were encountered:

pentschev · 2021-12-07T16:28:38Z

I think a potential solution to this issue, in the same vein as #5343, would be to make a special case for WSL in has_cuda_context. This could either be done implicitly with a try/except block to catch the NVML error, or explicitly by checking if the OS is WSL and doing something different in that case:

I would any day prefer an explicit check. Implicitly disabling NVML based on an error makes any future non-WSL-related issues very hard to debug.

# https://www.scivision.dev/python-detect-wsl/
from platform import uname

def in_wsl() -> bool:
    return 'microsoft-standard' in uname().release

This looks like a good enough check for me, +1 to just verifying this way and returning False at all times. A better alternative mixing both implicit and explicit checks would be to implicitly try/except, and return False in the except block when system is WSL.

charlesbluca · 2021-12-07T16:44:38Z

A better alternative mixing both implicit and explicit checks would be to implicitly try/except, and return False in the except block when system is WSL.

Good idea! Playing around with the fix in a branch now and now that we are past the failures in has_cuda_context, we run into issues with the real-time NVML monitoring:

tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 6 memory: 338 MB fds: 30>>
Traceback (most recent call last):
  File "/home/charlesb/miniconda3/envs/rapids-dask/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/home/charlesb/dev/rapids-dask/distributed/distributed/system_monitor.py", line 132, in update
    gpu_metrics = nvml.real_time()
  File "/home/charlesb/dev/rapids-dask/distributed/distributed/diagnostics/nvml.py", line 129, in real_time
    h = _pynvml_handles()
  File "/home/charlesb/dev/rapids-dask/distributed/distributed/diagnostics/nvml.py", line 64, in _pynvml_handles
    return pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
  File "/home/charlesb/miniconda3/envs/rapids-dask/lib/python3.8/site-packages/pynvml/nvml.py", line 1576, in nvmlDeviceGetHandleByIndex
    _nvmlCheckReturn(ret)
  File "/home/charlesb/miniconda3/envs/rapids-dask/lib/python3.8/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_Unknown: Unknown Error

Is this expected given the workaround we're doing here?

pentschev · 2021-12-07T16:49:40Z

I'm guessing this is now the general issue with NVML in WSL. You'll probably need to forcefully disable it with DASK_DISTRIBUTED__DIAGNOSTICS__NVML=False. Could you try that? If it works then I believe the CUDA context issue can be considered resolved (given current NVML limitations with WSL).

charlesbluca · 2021-12-07T17:19:33Z

Yeah, tried that out and the cluster seems to work fine, just making sure nothing else seemed obviously wrong 🙂 opened #5568 with the changes

charlesbluca mentioned this issue Dec 7, 2021

Disable NVML monitoring on WSL #5568

Merged

4 tasks

jrbourbeau closed this as completed in #5568 Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`has_cuda_context` fails in WSL #5567

`has_cuda_context` fails in WSL #5567

charlesbluca commented Dec 7, 2021

pentschev commented Dec 7, 2021

charlesbluca commented Dec 7, 2021

pentschev commented Dec 7, 2021

charlesbluca commented Dec 7, 2021

has_cuda_context fails in WSL #5567

has_cuda_context fails in WSL #5567

Comments

charlesbluca commented Dec 7, 2021

pentschev commented Dec 7, 2021

charlesbluca commented Dec 7, 2021

pentschev commented Dec 7, 2021

charlesbluca commented Dec 7, 2021

`has_cuda_context` fails in WSL #5567

`has_cuda_context` fails in WSL #5567