-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit GPU metrics to visible devices only #3810
Conversation
Running the tests locally shows that the global If I run each test individually it passes. However if I run them all together the second test onwards fails. This is likely because of the global variable not being reset between tests. |
Thanks for working on this @jacobtomlinson. One thing I wanted to point out is that the global handle will probably cause issues between tests. You might consider something to check if distributed/distributed/comm/ucx.py Lines 49 to 61 in 1bcbaee
|
Checking in. What is the status here? |
Still on my radar, I ended up going quite deep down the pynvml rabbit hole here. Mainly trying to see if I could do this in a neat way without globals. |
@@ -8,7 +9,16 @@ def _pynvml_handles(): | |||
if handles is None: | |||
pynvml.nvmlInit() | |||
count = pynvml.nvmlDeviceGetCount() | |||
handles = [pynvml.nvmlDeviceGetHandleByIndex(i) for i in range(count)] | |||
cuda_visible_devices = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As dask cuda just reorder the devices to change cuda device enumeration. This is still getting all the devices like what nvml does right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that I undrstand this comment. My understanding is that pynvml
doesn't respect the CUDA_VISIBLE_DEVICES
environment variable, and so we need to handle this manually.
It looks like this is mostly done, but maybe having small issues with testing. @quasiben is there anyone else on your team that would want to take this over? It seems like an easy thing for @pentschev perhaps? |
I pushed a small fix to the code which alleviates the testing. Once this is merged in we can add to the ucx-py CI or dask-cuda. Something like the following: @pentschev do you have time to review ? |
@mrocklin Thanks for the reply.
From what I understand, If I have 2 GPUs, the first worker will have: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay here, and changes look good generally, and results also look good to me:
# 1 GPU
$ CUDA_VISIBLE_DEVICES=0 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [34089730048], 'name': ['Tesla V100-SXM2-32GB']}
# 2 GPUs
$ CUDA_VISIBLE_DEVICES=0,2 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [34089730048, 34089730048], 'name': ['Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB']}
# 2 GPUs (GPU 8 doesn't exist, we still get only 1 output as expected)
$ CUDA_VISIBLE_DEVICES=0,8 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [34089730048], 'name': ['Tesla V100-SXM2-32GB']}
# 9 GPUs (GPU 8 doesn't exist, we still get only 8 outputs as expected)
$ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [34089730048, 34089730048, 34089730048, 34089730048, 34089730048, 34089730048, 34089730048, 34089730048], 'name': ['Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB']}
# 1 non-existing GPU
$ CUDA_VISIBLE_DEVICES=8 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [], 'name': []}
Thanks @jacobtomlinson and @quasiben for the work here.
@trivialfis if you have a minute can you test out this PR. I think we are ready to merge in but it would be good to hear from you before doing that |
I have 2 GPUs. Installing this patch: Calling
Printing
|
It seems that the visual diagnostics is reporting for all devices in each process, but each process should only be reporting GPU 0 (relative to |
I wanted to help fixing it, but didn't know how to map each worker to its GPU. |
Speaking of dask-cuda exclusively, each worker is already mapped to a single GPU. Internally, the process will always use GPU 0, relative to I must admit I have no clue where the code for displaying that even lives, but I think it's doing what I wrote above. |
I'm not familiar with the internal of worker resource management and came up with this workaround. As the problem is nvml doesn't respect cuda parameters, so we need to map the cuda device to nvml device index.
|
@trivialfis can you pull latest and test again (e020434) I am still not entirely clear what is happening here but it looks like all GPU data for all workers is stored in each worker -- maybe something in the scheduler update is not quite right? For now, i've made a small change to how we iterate through the GPU info for the display. For anyone interested, the logic is here: distributed/distributed/dashboard/components/nvml.py Lines 136 to 156 in 111029d
|
Let me try that later tonight. |
Hi sorry I don't think I can get back today. Feel free to ignore me as I will be OOTO for a few more days. |
info["memory-total"], | ||
) | ||
): | ||
# find which GPU maps to which process | ||
if ws.pid not in procs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I resolved the logic issues in the diagnostics display by adding in process information from each GPU. What we had before was each worker on a node would collect all the data from the GPU(s) (if multiple were available) and pass back to this for-loop. So we were n^2 the amount of data displayed:
Node 1-worker 1 -> GPU0->8
Node 1-worker 2 -> GPU0->7
....
Node 2-worker 1 -> GPU0->7
Node 2-worker 2 -> GPU0->7
....
This is not ideal, but by collecting process information from the GPU and passing back to the dashboard logic, we can not appropriately match up the pid of the worker with the pids reported by the GPU and only display when the pids match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pentschev do you think this is the best way to gather unique worker processes/GPU ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you do what I said in #3810 (comment) ? There's only one relevant GPU per process -- the first GPU in CUDA_VISIBLE_DEVICES
. We would need something equivalent to:
def get_used_memory():
gpu_idx = os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")[0]
handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
return pynvml.nvmlDeviceGetMemoryInfo(handle).used
used_memory_per_gpu = client.run(get_used_memory())
In other words, we shouldn't be capturing data for all GPUs in CUDA_VISIBLE_DEVICES
and reporting them all for each worker, but only the first one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And note that, to answer your question more directly, this may work fine until we have PID collisions, as is bound to happen in sufficiently large clusters with multiple nodes. You'd probably need to match that with a unique identifier for each node to be more resilient, not sure if we do have a way for that in Dask. But this solution is fine with me as well, the only other way it could be more reliable is to use pynvml.nvmlDeviceGetSerial
to match by a GPUs serial number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about including an IP address as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, forgot to push the change
Attached are plots from the latest changes: I ran this a cluster with 8 workers per node on two nodes. In the above, I've circled two workers each on a different box. Notice one IP is @benjha you were recently asking about collecting GPU stats from remote workers and thought this PR may be helpful for you. |
Thanks @quasiben (cc @jakirkham, @pentschev rapidsai/dask-cuda#36 (comment)). I am trying newer versions of the packages (dask 2.25.0 / dask-labextension 3.0) and metrics look more closer to what is being reported here. I'll update to these. |
OK, cool. @quasiben should this be merged in? |
If it's ok, I'd like for @pentschev to weigh in on one last question |
We can't guarantee that the user will only want one gpu per process. This
is true for dask-cuda situations, but may not be universal.
…On Fri, Oct 2, 2020, 10:56 AM Peter Andreas Entschev < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In distributed/dashboard/components/nvml.py
<#3810 (comment)>:
> info["memory-total"],
)
):
+ # find which GPU maps to which process
+ if ws.pid not in procs:
Why can't you do what I said in #3810 (comment)
<#3810 (comment)> ?
There's only one relevant GPU per process -- the first GPU in
CUDA_VISIBLE_DEVICES. We would need something equivalent to:
def get_used_memory():
gpu_idx = os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")[0]
handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
return pynvml.nvmlDeviceGetMemoryInfo(handle).used
used_memory_per_gpu = client.run(get_used_memory())
In other words, we shouldn't be capturing data for all GPUs in
CUDA_VISIBLE_DEVICES and reporting them all for each worker, but only the
first one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3810 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFMJUDKPOOZW3OAVALSIYH4TANCNFSM4NEGZ6VA>
.
|
Can you cite one example where this is the case today? Coming up with a universal solution when we don't have an established set of ways one can use Dask with GPUs is rather unlikely. For example, we could have theoretically someone running multiple Dask workers accessing the same GPU, that's also true for multiple workers as threads, do we report one GPU or each worker as if each had an exclusive GPU even though they are using the same one? |
Yeah, agreed that if a universal solution isn't possible then what you
propose would probably work well in most cases that we see today.
What I had in mind was people using Dask workers to drive other mutli-gpu
computations, like PyTorch or TensorFlow.
My hope/guess was that Ben's solution was general, but I don't have as much
context here as you all do.
…On Fri, Oct 2, 2020 at 11:26 AM Peter Andreas Entschev < ***@***.***> wrote:
We can't guarantee that the user will only want one gpu per process. This
is true for dask-cuda situations, but may not be universal.
Can you cite one example where this is the case today? Coming up with a
universal solution when we don't have an established set of ways one can
use Dask with GPUs is rather unlikely. For example, we could have
theoretically someone running multiple Dask workers accessing the same GPU,
that's also true for multiple workers as threads, do we report one GPU or
each worker as if each had an exclusive GPU even though they are using the
same one?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3810 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEV7D3J5XWDJ65S5UTSIYLM5ANCNFSM4NEGZ6VA>
.
|
Yeah I would suggest waiting until some users step forward asking for additional functionality to be handled before trying to design beyond known use cases. We would want them engaged in the design process to make sure we are solving problems they care about. |
Thanks @pentschev. In the last commit, dask is now sending one nvml data point and one nvml handle per worker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me @quasiben , I didn't test it though, I'm going to trust that it's working on your tests. :)
Removed the accidental file, fixed the test, and added a check for multiple GPUs |
LGTM, I'm fine with merging this as is. Thanks @quasiben ! |
I'll wait to CI passes here then merge in. After, I'll add to dask-cuda for GPU CI |
staging tests here: rapidsai/dask-cuda#408 |
* Limit GPU metrics to visible devices only * Move importorskip * init nvmInit once rather than handles * ws object has data for all workers * match pid from worker process with process from GPUs * send nvml data per worker/gpu not all gpus * remove notebook Co-authored-by: Benjamin Zaitlen <quasiben@gmail.com>
Thanks for pushing this through everyone! |
It looks like the GPU diagnostics read from all GPUs on the system.
When starting a
LocalCUDACluster
one worker is started per GPU, which means we are seeing duplication in the metrics as all workers report totals for all GPUs.This PR restricts the nvml diagnostics to only read data on the GPUs specified in
CUDA_VISIBLE_DEVICES
(or all GPUs if this is not set).Fixes #3808.