Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move SystemMonitor's GPU initialization back to constructor #4866

Merged
merged 6 commits into from
Jun 3, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 1 addition & 14 deletions distributed/diagnostics/nvml.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
import os

import pynvml

nvmlInit = None
Expand All @@ -17,18 +15,7 @@ def init_once():


def _pynvml_handles():
count = pynvml.nvmlDeviceGetCount()
try:
cuda_visible_devices = [
int(idx) for idx in os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")
]
except ValueError:
# CUDA_VISIBLE_DEVICES is not set
cuda_visible_devices = False
if not cuda_visible_devices:
cuda_visible_devices = list(range(count))
gpu_idx = cuda_visible_devices[0]
return pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
return pynvml.nvmlDeviceGetHandleByIndex(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky but at this point could we just call pynvml.nvmlDeviceGetHandleByIndex(0) in the places where we used to call nvml._pynvml_handles()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ddf9a43



def real_time():
Expand Down
9 changes: 3 additions & 6 deletions distributed/system_monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,9 @@ def __init__(self, n=10000):
self.quantities["num_fds"] = self.num_fds

if nvml is not None:
self.gpu_name = None
self.gpu_memory_total = None
gpu_extra = nvml.one_time()
self.gpu_name = gpu_extra["name"]
self.gpu_memory_total = gpu_extra["memory-total"]
self.gpu_utilization = deque(maxlen=n)
self.gpu_memory_used = deque(maxlen=n)
self.quantities["gpu_utilization"] = self.gpu_utilization
Expand Down Expand Up @@ -92,10 +93,6 @@ def update(self):

# give external modules (like dask-cuda) a chance to initialize CUDA context
if nvml is not None and nvml.nvmlInit is not None:
Copy link
Member

@charlesbluca charlesbluca Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the nvml.nvmlInit check here is redundant now, though it shouldn't cause any problems to leave it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not redundant, this is what I mentioned earlier when I also thought it was. It refers to the object in

and not to the pynvml.nvmlInit method. I think that's a confusing naming choice nevertheless but I won't touch it right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that; I'm referring to the fact that by the time we call update(), we will have also called nvml.one_time(), meaning that nvml.nvmlInit will always not be None if nvml is not None.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, you're right, good catch. I've updated that in 79b315b .

if self.gpu_name is None:
gpu_extra = nvml.one_time()
self.gpu_name = gpu_extra["name"]
self.gpu_memory_total = gpu_extra["memory-total"]
gpu_metrics = nvml.real_time()
self.gpu_utilization.append(gpu_metrics["utilization"])
self.gpu_memory_used.append(gpu_metrics["memory-used"])
Expand Down