-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit GPU metrics to visible devices only #3810
Changes from 6 commits
7d72928
adcba1f
987bb2d
024b092
e020434
137bdb5
37bca31
5393abf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,28 +1,60 @@ | ||
import os | ||
import pynvml | ||
|
||
handles = None | ||
nvmlInit = None | ||
|
||
|
||
def init_once(): | ||
global nvmlInit | ||
if nvmlInit is not None: | ||
return | ||
|
||
from pynvml import nvmlInit as _nvmlInit | ||
|
||
nvmlInit = _nvmlInit | ||
nvmlInit() | ||
|
||
|
||
def _pynvml_handles(): | ||
global handles | ||
if handles is None: | ||
pynvml.nvmlInit() | ||
count = pynvml.nvmlDeviceGetCount() | ||
handles = [pynvml.nvmlDeviceGetHandleByIndex(i) for i in range(count)] | ||
count = pynvml.nvmlDeviceGetCount() | ||
try: | ||
cuda_visible_devices = [ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As dask cuda just reorder the devices to change cuda device enumeration. This is still getting all the devices like what nvml does right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure that I undrstand this comment. My understanding is that |
||
int(idx) for idx in os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",") | ||
] | ||
except ValueError: | ||
# CUDA_VISIBLE_DEVICES is not set | ||
cuda_visible_devices = False | ||
if not cuda_visible_devices: | ||
cuda_visible_devices = list(range(count)) | ||
handles = [ | ||
pynvml.nvmlDeviceGetHandleByIndex(i) | ||
for i in range(count) | ||
if i in cuda_visible_devices | ||
] | ||
return handles | ||
|
||
|
||
def real_time(): | ||
init_once() | ||
handles = _pynvml_handles() | ||
return { | ||
"utilization": [pynvml.nvmlDeviceGetUtilizationRates(h).gpu for h in handles], | ||
"memory-used": [pynvml.nvmlDeviceGetMemoryInfo(h).used for h in handles], | ||
"procs": [ | ||
[p.pid for p in pynvml.nvmlDeviceGetComputeRunningProcesses(h)] | ||
for h in handles | ||
], | ||
} | ||
|
||
|
||
def one_time(): | ||
init_once() | ||
handles = _pynvml_handles() | ||
return { | ||
"memory-total": [pynvml.nvmlDeviceGetMemoryInfo(h).total for h in handles], | ||
"name": [pynvml.nvmlDeviceGetName(h).decode() for h in handles], | ||
"procs": [ | ||
[p.pid for p in pynvml.nvmlDeviceGetComputeRunningProcesses(h)] | ||
for h in handles | ||
], | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
import pytest | ||
import os | ||
|
||
pytest.importorskip("pynvml") | ||
|
||
from distributed.diagnostics import nvml | ||
|
||
|
||
def test_one_time(): | ||
output = nvml.one_time() | ||
assert "memory-total" in output | ||
assert "name" in output | ||
|
||
assert len(output["name"]) > 0 | ||
|
||
|
||
def test_1_visible_devices(): | ||
os.environ["CUDA_VISIBLE_DEVICES"] = "0" | ||
output = nvml.one_time() | ||
assert len(output["memory-total"]) == 1 | ||
|
||
|
||
def test_2_visible_devices(): | ||
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1" | ||
output = nvml.one_time() | ||
assert len(output["memory-total"]) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I resolved the logic issues in the diagnostics display by adding in process information from each GPU. What we had before was each worker on a node would collect all the data from the GPU(s) (if multiple were available) and pass back to this for-loop. So we were n^2 the amount of data displayed:
Node 1-worker 1 -> GPU0->8
Node 1-worker 2 -> GPU0->7
....
Node 2-worker 1 -> GPU0->7
Node 2-worker 2 -> GPU0->7
....
This is not ideal, but by collecting process information from the GPU and passing back to the dashboard logic, we can not appropriately match up the pid of the worker with the pids reported by the GPU and only display when the pids match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pentschev do you think this is the best way to gather unique worker processes/GPU ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you do what I said in #3810 (comment) ? There's only one relevant GPU per process -- the first GPU in
CUDA_VISIBLE_DEVICES
. We would need something equivalent to:In other words, we shouldn't be capturing data for all GPUs in
CUDA_VISIBLE_DEVICES
and reporting them all for each worker, but only the first one.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And note that, to answer your question more directly, this may work fine until we have PID collisions, as is bound to happen in sufficiently large clusters with multiple nodes. You'd probably need to match that with a unique identifier for each node to be more resilient, not sure if we do have a way for that in Dask. But this solution is fine with me as well, the only other way it could be more reliable is to use
pynvml.nvmlDeviceGetSerial
to match by a GPUs serial number.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about including an IP address as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, forgot to push the change