-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move SystemMonitor's GPU initialization back to constructor #4866
Conversation
pentschev
commented
Jun 1, 2021
•
edited by quasiben
Loading
edited by quasiben
- Closes System Monitor Error rapidsai/dask-cuda#634
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @pentschev 😄
distributed/diagnostics/nvml.py
Outdated
cuda_visible_devices = list(range(count)) | ||
gpu_idx = cuda_visible_devices[0] | ||
return pynvml.nvmlDeviceGetHandleByIndex(gpu_idx) | ||
return pynvml.nvmlDeviceGetHandleByIndex(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicky but at this point could we just call pynvml.nvmlDeviceGetHandleByIndex(0)
in the places where we used to call nvml._pynvml_handles()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in ddf9a43
distributed/system_monitor.py
Outdated
@@ -92,10 +93,6 @@ def update(self): | |||
|
|||
# give external modules (like dask-cuda) a chance to initialize CUDA context | |||
if nvml is not None and nvml.nvmlInit is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the nvml.nvmlInit
check here is redundant now, though it shouldn't cause any problems to leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not redundant, this is what I mentioned earlier when I also thought it was. It refers to the object in
nvmlInit = None |
pynvml.nvmlInit
method. I think that's a confusing naming choice nevertheless but I won't touch it right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that; I'm referring to the fact that by the time we call update()
, we will have also called nvml.one_time()
, meaning that nvml.nvmlInit
will always not be None
if nvml
is not None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, you're right, good catch. I've updated that in 79b315b .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pentschev! Is there a regression test we should add here or in dask-cuda
?
@jrbourbeau since it requires PyNVML and a GPU, unfortunately we can't test it in Distributed right now, but @charlesbluca is working on testing that in Dask-CUDA in rapidsai/dask-cuda#635 . |
I think there is an issue here still. The changes around In [1]: from dask.distributed import Client, fire_and_forget, wait
...: from dask_cuda import LocalCUDACluster
...: from dask.utils import parse_bytes
...: import dask
In [2]: cluster = LocalCUDACluster()
In [3]: client = Client(cluster)
In [4]: import rmm
In [5]: rmm.reinitialize(pool_allocator=1e9) # create data on the client/GPU 0
In [6]: for w in cluster.scheduler.workers:
...: print(cluster.scheduler.workers[w].metrics['gpu_memory_used'])
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576 In the above, we should only see one GPU with a large allocation |
Thanks for pointing me to rapidsai/dask-cuda#635 @pentschev -- that's what I was looking for. I knew |
You're right. I've reverted the changes now. However, this breaks https://github.com/rapidsai/dask-cuda/blob/81bbc6f85575826b13b3fb45894b54135514e668/dask_cuda/tests/test_dask_cuda_worker.py#L21-L59 , which is a test that ensures we can verify |
Alright, this will break the test I mentioned above but there's not much we can do right now to prevent it without adding considerable complexity to Distributed or Dask-CUDA. I say we should merge this as is and then xfail those tests in Dask-CUDA for now, and I'll file an issue to figure out a solution later. |
cc @quasiben |
Thanks @pentschev . I'm good with merging in as well and I'll help (as best I can) with the failing dask-cuda test |
I'm ok with that. The failed test doesn't seem to be related, so it's good to merge from my side. |
Thanks again @pentschev ! |
Thanks everyone for reviews! |
* Always use index 0 to get NVML GPU handle * Move SystemMonitor's GPU initialization back to constructor * Use nvmlDeviceGetHandleByIndex directly * Remove redundant nvmlInit check * Revert "Use nvmlDeviceGetHandleByIndex directly" This reverts commit ddf9a43. * Revert "Always use index 0 to get NVML GPU handle" This reverts commit d860e58.
After recent changes in Distributed, particularly dask/distributed#4866, worker processes will now attempt to get information from PyNVML based on the index specified in `CUDA_VISIBLE_DEVICES`. Some of our tests purposely test device numbers that may not exist in some systems (e.g., gpuCI where only single-GPU is supported) to ensure the `CUDA_VISIBLE_DEVICES` of each worker indeed respects the ordering of `dask_cuda.utils.cuda_visible_devices`. The changes here introduce a new `MockWorker` class that will monkey-patch the behavior of NVML usage of `distributed.Worker`, which can then be used to return those tests to a working state. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Benjamin Zaitlen (https://github.com/quasiben) URL: #638