Move SystemMonitor's GPU initialization back to constructor #4866

pentschev · 2021-06-01T18:32:20Z

Closes System Monitor Error rapidsai/dask-cuda#634

pentschev · 2021-06-01T18:32:27Z

charlesbluca

Thanks for doing this @pentschev 😄

charlesbluca · 2021-06-01T18:39:11Z

distributed/diagnostics/nvml.py

-        cuda_visible_devices = list(range(count))
-    gpu_idx = cuda_visible_devices[0]
-    return pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
+    return pynvml.nvmlDeviceGetHandleByIndex(0)


Nitpicky but at this point could we just call pynvml.nvmlDeviceGetHandleByIndex(0) in the places where we used to call nvml._pynvml_handles()?

Done in ddf9a43

charlesbluca · 2021-06-01T18:40:17Z

distributed/system_monitor.py

@@ -92,10 +93,6 @@ def update(self):

        # give external modules (like dask-cuda) a chance to initialize CUDA context
        if nvml is not None and nvml.nvmlInit is not None:


I think the nvml.nvmlInit check here is redundant now, though it shouldn't cause any problems to leave it.

It's not redundant, this is what I mentioned earlier when I also thought it was. It refers to the object in

distributed/distributed/diagnostics/nvml.py

Line 5 in 9d4f0bf

nvmlInit = None

and not to the pynvml.nvmlInit method. I think that's a confusing naming choice nevertheless but I won't touch it right now.

I agree with that; I'm referring to the fact that by the time we call update(), we will have also called nvml.one_time(), meaning that nvml.nvmlInit will always not be None if nvml is not None.

Ah sorry, you're right, good catch. I've updated that in 79b315b .

jrbourbeau

Thanks @pentschev! Is there a regression test we should add here or in dask-cuda?

pentschev · 2021-06-02T07:04:08Z

@jrbourbeau since it requires PyNVML and a GPU, unfortunately we can't test it in Distributed right now, but @charlesbluca is working on testing that in Dask-CUDA in rapidsai/dask-cuda#635 .

quasiben · 2021-06-02T16:17:42Z

I think there is an issue here still. The changes around pynvml.nvmlDeviceGetHandleByIndex i think, need to be reverted. In #3810 we saw that pynvml doesn't respect CUDA_VISIBLE_DEVICES so we get in correct reporting. Here's a small reproducer of the issue:

In [1]: from dask.distributed import Client, fire_and_forget, wait
   ...: from dask_cuda import LocalCUDACluster
   ...: from dask.utils import parse_bytes
   ...: import dask

In [2]: cluster = LocalCUDACluster()

In [3]: client = Client(cluster)

In [4]: import rmm

In [5]: rmm.reinitialize(pool_allocator=1e9) # create data on the client/GPU 0

In [6]: for w in cluster.scheduler.workers:
    ...:     print(cluster.scheduler.workers[w].metrics['gpu_memory_used'])
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576
17728536576

In the above, we should only see one GPU with a large allocation

This reverts commit ddf9a43.

This reverts commit d860e58.

jrbourbeau · 2021-06-02T16:38:47Z

Thanks for pointing me to rapidsai/dask-cuda#635 @pentschev -- that's what I was looking for. I knew dask-cuda ran some subset of the distributed tests, I just wanted to make sure something for this issue was included which it looks like rapidsai/dask-cuda#635 is handling

pentschev · 2021-06-02T17:36:00Z

I think there is an issue here still. The changes around pynvml.nvmlDeviceGetHandleByIndex i think, need to be reverted.

You're right. I've reverted the changes now. However, this breaks https://github.com/rapidsai/dask-cuda/blob/81bbc6f85575826b13b3fb45894b54135514e668/dask_cuda/tests/test_dask_cuda_worker.py#L21-L59 , which is a test that ensures we can verify CUDA_VISIBLE_DEVICES behavior even on a single-GPU setup (e.g., gpuCI), but fails because we now try to address a GPU index beyond the existing ones. I'm still trying to think of a way to fix that.

pentschev · 2021-06-03T17:40:42Z

Alright, this will break the test I mentioned above but there's not much we can do right now to prevent it without adding considerable complexity to Distributed or Dask-CUDA. I say we should merge this as is and then xfail those tests in Dask-CUDA for now, and I'll file an issue to figure out a solution later.

jrbourbeau · 2021-06-03T18:03:50Z

cc @quasiben

quasiben · 2021-06-03T18:16:20Z

Thanks @pentschev . I'm good with merging in as well and I'll help (as best I can) with the failing dask-cuda test

pentschev · 2021-06-03T18:35:16Z

I'm ok with that. The failed test doesn't seem to be related, so it's good to merge from my side.

quasiben · 2021-06-03T19:36:59Z

Thanks again @pentschev !

pentschev · 2021-06-03T20:17:46Z

Thanks everyone for reviews!

* Always use index 0 to get NVML GPU handle * Move SystemMonitor's GPU initialization back to constructor * Use nvmlDeviceGetHandleByIndex directly * Remove redundant nvmlInit check * Revert "Use nvmlDeviceGetHandleByIndex directly" This reverts commit ddf9a43. * Revert "Always use index 0 to get NVML GPU handle" This reverts commit d860e58.

After recent changes in Distributed, particularly dask/distributed#4866, worker processes will now attempt to get information from PyNVML based on the index specified in `CUDA_VISIBLE_DEVICES`. Some of our tests purposely test device numbers that may not exist in some systems (e.g., gpuCI where only single-GPU is supported) to ensure the `CUDA_VISIBLE_DEVICES` of each worker indeed respects the ordering of `dask_cuda.utils.cuda_visible_devices`. The changes here introduce a new `MockWorker` class that will monkey-patch the behavior of NVML usage of `distributed.Worker`, which can then be used to return those tests to a working state. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Benjamin Zaitlen (https://github.com/quasiben) URL: #638

pentschev added 2 commits June 1, 2021 11:30

Always use index 0 to get NVML GPU handle

d860e58

Move SystemMonitor's GPU initialization back to constructor

80224a8

pentschev mentioned this pull request Jun 1, 2021

System Monitor Error rapidsai/dask-cuda#634

Closed

charlesbluca reviewed Jun 1, 2021

View reviewed changes

pentschev added 2 commits June 1, 2021 12:44

Use nvmlDeviceGetHandleByIndex directly

ddf9a43

Remove redundant nvmlInit check

79b315b

jrbourbeau reviewed Jun 2, 2021

View reviewed changes

pentschev mentioned this pull request Jun 2, 2021

Ensure PyNVML works correctly when installed with no GPUs #4873

Merged

2 tasks

pentschev added 2 commits June 2, 2021 09:34

Revert "Use nvmlDeviceGetHandleByIndex directly"

8218605

This reverts commit ddf9a43.

Revert "Always use index 0 to get NVML GPU handle"

95d93f4

This reverts commit d860e58.

jrbourbeau mentioned this pull request Jun 2, 2021

Release 2021.06.0 dask/community#162

Closed

4 tasks

pentschev mentioned this pull request Jun 3, 2021

Add scheduler tests to gpuCI rapidsai/dask-cuda#635

Closed

pentschev mentioned this pull request Jun 3, 2021

Add range_query tests to NVML test suite #4879

Merged

3 tasks

quasiben merged commit 1754b48 into dask:main Jun 3, 2021

pentschev mentioned this pull request Jun 3, 2021

Fix CUDA_VISIBLE_DEVICES tests rapidsai/dask-cuda#638

Merged

pentschev deleted the fix-nvml-system-monitor branch June 30, 2021 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move SystemMonitor's GPU initialization back to constructor #4866

Move SystemMonitor's GPU initialization back to constructor #4866

pentschev commented Jun 1, 2021 •

edited by quasiben

Loading

pentschev commented Jun 1, 2021

charlesbluca left a comment

charlesbluca Jun 1, 2021

pentschev Jun 1, 2021

charlesbluca Jun 1, 2021 •

edited

Loading

pentschev Jun 1, 2021

charlesbluca Jun 1, 2021

pentschev Jun 1, 2021

jrbourbeau left a comment

pentschev commented Jun 2, 2021

quasiben commented Jun 2, 2021

jrbourbeau commented Jun 2, 2021

pentschev commented Jun 2, 2021

pentschev commented Jun 3, 2021

jrbourbeau commented Jun 3, 2021

quasiben commented Jun 3, 2021

pentschev commented Jun 3, 2021

quasiben commented Jun 3, 2021

pentschev commented Jun 3, 2021

		@@ -92,10 +93,6 @@ def update(self):

		# give external modules (like dask-cuda) a chance to initialize CUDA context
		if nvml is not None and nvml.nvmlInit is not None:

Move SystemMonitor's GPU initialization back to constructor #4866

Move SystemMonitor's GPU initialization back to constructor #4866

Conversation

pentschev commented Jun 1, 2021 • edited by quasiben Loading

pentschev commented Jun 1, 2021

charlesbluca left a comment

Choose a reason for hiding this comment

charlesbluca Jun 1, 2021

Choose a reason for hiding this comment

pentschev Jun 1, 2021

Choose a reason for hiding this comment

charlesbluca Jun 1, 2021 • edited Loading

Choose a reason for hiding this comment

pentschev Jun 1, 2021

Choose a reason for hiding this comment

charlesbluca Jun 1, 2021

Choose a reason for hiding this comment

pentschev Jun 1, 2021

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

pentschev commented Jun 2, 2021

quasiben commented Jun 2, 2021

jrbourbeau commented Jun 2, 2021

pentschev commented Jun 2, 2021

pentschev commented Jun 3, 2021

jrbourbeau commented Jun 3, 2021

quasiben commented Jun 3, 2021

pentschev commented Jun 3, 2021

quasiben commented Jun 3, 2021

pentschev commented Jun 3, 2021

pentschev commented Jun 1, 2021 •

edited by quasiben

Loading

charlesbluca Jun 1, 2021 •

edited

Loading