Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

has_cuda_context fails in WSL #5567

Closed
charlesbluca opened this issue Dec 7, 2021 · 4 comments · Fixed by #5568
Closed

has_cuda_context fails in WSL #5567

charlesbluca opened this issue Dec 7, 2021 · 4 comments · Fixed by #5568

Comments

@charlesbluca
Copy link
Member

What happened:
When attempting to check if a CUDA context has already been created using has_cuda_context in WSL, we see failures at the NVML level:

NVMLError_Unknown                         Traceback (most recent call last)
<ipython-input-2-086bc6eeaba8> in <module>
----> 1 has_cuda_context()

~/dev/rapids-dask/distributed/distributed/diagnostics/nvml.py in has_cuda_context()
     76         handle = pynvml.nvmlDeviceGetHandleByIndex(index)
     77         if hasattr(pynvml, "nvmlDeviceGetComputeRunningProcesses_v2"):
---> 78             running_processes = pynvml.nvmlDeviceGetComputeRunningProcesses_v2(handle)
     79         else:
     80             running_processes = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)

~/miniconda3/envs/rapids-dask/lib/python3.8/site-packages/pynvml/nvml.py in nvmlDeviceGetComputeRunningProcesses_v2(handle)
   2139     else:
   2140         # error case
-> 2141         raise NVMLError(ret)
   2142 
   2143 def nvmlDeviceGetComputeRunningProcesses(handle):

This is actually expected, as querying active processes is not yet supported in WSL.

What you expected to happen:
We would want the function to work as desired, i.e. returning false or the index of the device for which there's a CUDA context.

Minimal Complete Verifiable Example:

from distributed.diagnostics.nvml import has_cuda_context

has_cuda_context()

The error also occurs running the bare PyNVML code:

from pynvml import *

nvmlInit()

h = nvmlDeviceGetHandleByIndex(0)
nvmlDeviceGetComputeRunningProcesses_v2(h)

Anything else we need to know?:
For context, this causes issues when attempting to create a CUDA context in dask-cuda, which is currently being tracked in rapidsai/dask-cuda#816.

I think a potential solution to this issue, in the same vein as #5343, would be to make a special case for WSL in has_cuda_context. This could either be done implicitly with a try/except block to catch the NVML error, or explicitly by checking if the OS is WSL and doing something different in that case:

# https://www.scivision.dev/python-detect-wsl/
from platform import uname

def in_wsl() -> bool:
    return 'microsoft-standard' in uname().release

Environment:

  • Dask version: latest main
  • Python version: 3.8
  • Operating System: ubuntu 20.04 (WSL2)
  • Install method (conda, pip, source): source

cc @pentschev

@pentschev
Copy link
Member

I think a potential solution to this issue, in the same vein as #5343, would be to make a special case for WSL in has_cuda_context. This could either be done implicitly with a try/except block to catch the NVML error, or explicitly by checking if the OS is WSL and doing something different in that case:

I would any day prefer an explicit check. Implicitly disabling NVML based on an error makes any future non-WSL-related issues very hard to debug.

# https://www.scivision.dev/python-detect-wsl/
from platform import uname

def in_wsl() -> bool:
    return 'microsoft-standard' in uname().release

This looks like a good enough check for me, +1 to just verifying this way and returning False at all times. A better alternative mixing both implicit and explicit checks would be to implicitly try/except, and return False in the except block when system is WSL.

@charlesbluca
Copy link
Member Author

A better alternative mixing both implicit and explicit checks would be to implicitly try/except, and return False in the except block when system is WSL.

Good idea! Playing around with the fix in a branch now and now that we are past the failures in has_cuda_context, we run into issues with the real-time NVML monitoring:

tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 6 memory: 338 MB fds: 30>>
Traceback (most recent call last):
  File "/home/charlesb/miniconda3/envs/rapids-dask/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/home/charlesb/dev/rapids-dask/distributed/distributed/system_monitor.py", line 132, in update
    gpu_metrics = nvml.real_time()
  File "/home/charlesb/dev/rapids-dask/distributed/distributed/diagnostics/nvml.py", line 129, in real_time
    h = _pynvml_handles()
  File "/home/charlesb/dev/rapids-dask/distributed/distributed/diagnostics/nvml.py", line 64, in _pynvml_handles
    return pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
  File "/home/charlesb/miniconda3/envs/rapids-dask/lib/python3.8/site-packages/pynvml/nvml.py", line 1576, in nvmlDeviceGetHandleByIndex
    _nvmlCheckReturn(ret)
  File "/home/charlesb/miniconda3/envs/rapids-dask/lib/python3.8/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_Unknown: Unknown Error

Is this expected given the workaround we're doing here?

@pentschev
Copy link
Member

I'm guessing this is now the general issue with NVML in WSL. You'll probably need to forcefully disable it with DASK_DISTRIBUTED__DIAGNOSTICS__NVML=False. Could you try that? If it works then I believe the CUDA context issue can be considered resolved (given current NVML limitations with WSL).

@charlesbluca
Copy link
Member Author

Yeah, tried that out and the cluster seems to work fine, just making sure nothing else seemed obviously wrong 🙂 opened #5568 with the changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants