-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
has_cuda_context
fails in WSL
#5567
Comments
I would any day prefer an explicit check. Implicitly disabling NVML based on an error makes any future non-WSL-related issues very hard to debug.
This looks like a good enough check for me, +1 to just verifying this way and returning |
Good idea! Playing around with the fix in a branch now and now that we are past the failures in tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 6 memory: 338 MB fds: 30>>
Traceback (most recent call last):
File "/home/charlesb/miniconda3/envs/rapids-dask/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/home/charlesb/dev/rapids-dask/distributed/distributed/system_monitor.py", line 132, in update
gpu_metrics = nvml.real_time()
File "/home/charlesb/dev/rapids-dask/distributed/distributed/diagnostics/nvml.py", line 129, in real_time
h = _pynvml_handles()
File "/home/charlesb/dev/rapids-dask/distributed/distributed/diagnostics/nvml.py", line 64, in _pynvml_handles
return pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
File "/home/charlesb/miniconda3/envs/rapids-dask/lib/python3.8/site-packages/pynvml/nvml.py", line 1576, in nvmlDeviceGetHandleByIndex
_nvmlCheckReturn(ret)
File "/home/charlesb/miniconda3/envs/rapids-dask/lib/python3.8/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.nvml.NVMLError_Unknown: Unknown Error Is this expected given the workaround we're doing here? |
I'm guessing this is now the general issue with NVML in WSL. You'll probably need to forcefully disable it with |
Yeah, tried that out and the cluster seems to work fine, just making sure nothing else seemed obviously wrong 🙂 opened #5568 with the changes |
What happened:
When attempting to check if a CUDA context has already been created using
has_cuda_context
in WSL, we see failures at the NVML level:This is actually expected, as querying active processes is not yet supported in WSL.
What you expected to happen:
We would want the function to work as desired, i.e. returning false or the index of the device for which there's a CUDA context.
Minimal Complete Verifiable Example:
The error also occurs running the bare PyNVML code:
Anything else we need to know?:
For context, this causes issues when attempting to create a CUDA context in dask-cuda, which is currently being tracked in rapidsai/dask-cuda#816.
I think a potential solution to this issue, in the same vein as #5343, would be to make a special case for WSL in
has_cuda_context
. This could either be done implicitly with a try/except block to catch the NVML error, or explicitly by checking if the OS is WSL and doing something different in that case:Environment:
main
cc @pentschev
The text was updated successfully, but these errors were encountered: