Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit GPU metrics to visible devices only #3810

Merged
merged 8 commits into from
Oct 6, 2020

Conversation

jacobtomlinson
Copy link
Member

@jacobtomlinson jacobtomlinson commented May 18, 2020

It looks like the GPU diagnostics read from all GPUs on the system.

When starting a LocalCUDACluster one worker is started per GPU, which means we are seeing duplication in the metrics as all workers report totals for all GPUs.

This PR restricts the nvml diagnostics to only read data on the GPUs specified in CUDA_VISIBLE_DEVICES (or all GPUs if this is not set).

Fixes #3808.

@jacobtomlinson
Copy link
Member Author

Running the tests locally shows that the global handles property may be causing some issues.

If I run each test individually it passes. However if I run them all together the second test onwards fails. This is likely because of the global variable not being reset between tests.

@jacobtomlinson
Copy link
Member Author

Also the dashboard is still showing the incorrect metric.

image

There are 8 32GB GPUs so it should show 256GB. But is shows 8 x 8 x 32GB = 2TB.

@quasiben
Copy link
Member

Thanks for working on this @jacobtomlinson. One thing I wanted to point out is that the global handle will probably cause issues between tests. You might consider something to check if nvmlInit() has been called rather than handles. UCX does something like that:

def init_once():
global ucp, host_array, device_array
if ucp is not None:
return
import ucp as _ucp
ucp = _ucp
# remove/process dask.ucx flags for valid ucx options
ucx_config = _scrub_ucx_config()
ucp.init(options=ucx_config, env_takes_precedence=True)

@mrocklin
Copy link
Member

Checking in. What is the status here?

@jacobtomlinson
Copy link
Member Author

Still on my radar, I ended up going quite deep down the pynvml rabbit hole here. Mainly trying to see if I could do this in a neat way without globals.

@@ -8,7 +9,16 @@ def _pynvml_handles():
if handles is None:
pynvml.nvmlInit()
count = pynvml.nvmlDeviceGetCount()
handles = [pynvml.nvmlDeviceGetHandleByIndex(i) for i in range(count)]
cuda_visible_devices = [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As dask cuda just reorder the devices to change cuda device enumeration. This is still getting all the devices like what nvml does right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I undrstand this comment. My understanding is that pynvml doesn't respect the CUDA_VISIBLE_DEVICES environment variable, and so we need to handle this manually.

@mrocklin
Copy link
Member

It looks like this is mostly done, but maybe having small issues with testing. @quasiben is there anyone else on your team that would want to take this over? It seems like an easy thing for @pentschev perhaps?

@quasiben
Copy link
Member

I pushed a small fix to the code which alleviates the testing. Once this is merged in we can add to the ucx-py CI or dask-cuda. Something like the following:
https://github.com/rapidsai/ucx-py/blob/branch-0.16/ci/gpu/build.sh

@pentschev do you have time to review ?

@trivialfis
Copy link

trivialfis commented Sep 29, 2020

@mrocklin Thanks for the reply.

I'm not sure that I undrstand this comment. My understanding is that pynvml doesn't respect the CUDA_VISIBLE_DEVICES environment variable, and so we need to handle this manually.

From what I understand, dask-cuda LocalCUDACluster manipulates CUDA_VISIABLE_DEVICES by reordering. Feel free to correct me if I'm wrong:

If I have 2 GPUs, the first worker will have: CUDA_VISIABLE_DEVICES=0,1 while the second one will have CUDA_VISIABLE_DEVICES=1,0. So len(handles) is still 2 after this PR. #4117 will continue to report 4 GPUs and #3808 might be the same?

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay here, and changes look good generally, and results also look good to me:

# 1 GPU
$ CUDA_VISIBLE_DEVICES=0 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [34089730048], 'name': ['Tesla V100-SXM2-32GB']}

# 2 GPUs
$ CUDA_VISIBLE_DEVICES=0,2 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [34089730048, 34089730048], 'name': ['Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB']}

# 2 GPUs (GPU 8 doesn't exist, we still get only 1 output as expected)
$ CUDA_VISIBLE_DEVICES=0,8 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [34089730048], 'name': ['Tesla V100-SXM2-32GB']}

# 9 GPUs (GPU 8 doesn't exist, we still get only 8 outputs as expected)
$ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [34089730048, 34089730048, 34089730048, 34089730048, 34089730048, 34089730048, 34089730048, 34089730048], 'name': ['Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB']}

# 1 non-existing GPU
$ CUDA_VISIBLE_DEVICES=8 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [], 'name': []}

Thanks @jacobtomlinson and @quasiben for the work here.

@quasiben
Copy link
Member

@trivialfis if you have a minute can you test out this PR. I think we are ready to merge in but it would be good to hear from you before doing that

@mrocklin mrocklin marked this pull request as ready for review September 30, 2020 14:06
@trivialfis
Copy link

trivialfis commented Sep 30, 2020

I have 2 GPUs. Installing this patch:

Screenshot_2020-10-01 Bokeh Application

Calling print('handles:', handles) in def real_time()::

handles: [<pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7f3b5170a440>, <pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7f3b5170a540>]
handles: [<pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7fb49c092740>, <pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7fb49c092840>]
handles: [<pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7fb488547840>, <pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7fb49c09ff40>]

...

Printing CUDA_VISIABLE_DEVICES in my local process:

CUDA: 0,1
CUDA: 1,0

dask-cuda is installed from 0.16 branch. Hope that helps.

@pentschev
Copy link
Member

It seems that the visual diagnostics is reporting for all devices in each process, but each process should only be reporting GPU 0 (relative to CUDA_VISIBLE_DEVICES), at least for dask-cuda.

@trivialfis
Copy link

trivialfis commented Sep 30, 2020

I wanted to help fixing it, but didn't know how to map each worker to its GPU.

@pentschev
Copy link
Member

Speaking of dask-cuda exclusively, each worker is already mapped to a single GPU. Internally, the process will always use GPU 0, relative to CUDA_VISIBLE_DEVICES. What I think is happening in this report is that it's getting all GPUs listed by CUDA_VISIBLE_DEVICES and displaying them all. So in your 2 GPU case, that means the report will report CUDA_VISIBLE_DEVICES=0,1 and CUDA_VISIBLE_DEVICES=1,0 for workers 0 and 1, respectively, rather than just using the first GPU listed in CUDA_VISIBLE_DEVICES for each of these two workers.

I must admit I have no clue where the code for displaying that even lives, but I think it's doing what I wrote above.

@trivialfis
Copy link

trivialfis commented Sep 30, 2020

I'm not familiar with the internal of worker resource management and came up with this workaround. As the problem is nvml doesn't respect cuda parameters, so we need to map the cuda device to nvml device index.

  1. Get first device by using cuda api cudaGetDevice, this is controlled by visible devices.
  2. Get its UUID by calling cuda api cudaGetDeviceProperties, mark as worker_uuid.
  3. Enumerate all devices by nvml and store the device index from enumeration.
  4. List all UUIDs with device indices we just enumerated using nvml api nvmlDeviceGetUUID.
  5. Filter out those nvml device indices whose UUID don't match worker_uuid, leaving only 1 valid nvml device index.
  6. Use that to get GPU usage from nvml

@quasiben
Copy link
Member

quasiben commented Oct 1, 2020

@trivialfis can you pull latest and test again (e020434)

I am still not entirely clear what is happening here but it looks like all GPU data for all workers is stored in each worker -- maybe something in the scheduler update is not quite right?

For now, i've made a small change to how we iterate through the GPU info for the display. For anyone interested, the logic is here:

for ws in workers:
try:
info = ws.extra["gpu"]
except KeyError:
continue
metrics = ws.metrics["gpu"]
for j, (u, mem_used, mem_total) in enumerate(
zip(
metrics["utilization"],
metrics["memory-used"],
info["memory-total"],
)
):
memory_max = max(memory_max, mem_total)
memory_total += mem_total
utilization.append(int(u))
memory.append(mem_used)
worker.append(ws.address)
gpu_index.append(j)
y.append(i)
i += 1

@trivialfis
Copy link

trivialfis commented Oct 1, 2020

Let me try that later tonight.

@trivialfis
Copy link

Hi sorry I don't think I can get back today. Feel free to ignore me as I will be OOTO for a few more days.

info["memory-total"],
)
):
# find which GPU maps to which process
if ws.pid not in procs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I resolved the logic issues in the diagnostics display by adding in process information from each GPU. What we had before was each worker on a node would collect all the data from the GPU(s) (if multiple were available) and pass back to this for-loop. So we were n^2 the amount of data displayed:

Node 1-worker 1 -> GPU0->8
Node 1-worker 2 -> GPU0->7
....
Node 2-worker 1 -> GPU0->7
Node 2-worker 2 -> GPU0->7
....

This is not ideal, but by collecting process information from the GPU and passing back to the dashboard logic, we can not appropriately match up the pid of the worker with the pids reported by the GPU and only display when the pids match

Copy link
Member

@quasiben quasiben Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pentschev do you think this is the best way to gather unique worker processes/GPU ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you do what I said in #3810 (comment) ? There's only one relevant GPU per process -- the first GPU in CUDA_VISIBLE_DEVICES. We would need something equivalent to:

def get_used_memory():
    gpu_idx = os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")[0]
    handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
    return pynvml.nvmlDeviceGetMemoryInfo(handle).used

used_memory_per_gpu = client.run(get_used_memory())

In other words, we shouldn't be capturing data for all GPUs in CUDA_VISIBLE_DEVICES and reporting them all for each worker, but only the first one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And note that, to answer your question more directly, this may work fine until we have PID collisions, as is bound to happen in sufficiently large clusters with multiple nodes. You'd probably need to match that with a unique identifier for each node to be more resilient, not sure if we do have a way for that in Dask. But this solution is fine with me as well, the only other way it could be more reliable is to use pynvml.nvmlDeviceGetSerial to match by a GPUs serial number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about including an IP address as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, forgot to push the change

@quasiben
Copy link
Member

quasiben commented Oct 2, 2020

Attached are plots from the latest changes:

Screen Shot 2020-10-01 at 10 12 05 PM

I ran this a cluster with 8 workers per node on two nodes. In the above, I've circled two workers each on a different box. Notice one IP is 127.0.0.1 and the other is a remote worker with an ip of 10.33.XXX.XX

@benjha you were recently asking about collecting GPU stats from remote workers and thought this PR may be helpful for you.

@benjha
Copy link

benjha commented Oct 2, 2020

@benjha you were recently asking about collecting GPU stats from remote workers and thought this PR may be helpful for you.

Thanks @quasiben (cc @jakirkham, @pentschev rapidsai/dask-cuda#36 (comment)).

I am trying newer versions of the packages (dask 2.25.0 / dask-labextension 3.0) and metrics look more closer to what is being reported here. I'll update to these.

@mrocklin
Copy link
Member

mrocklin commented Oct 2, 2020

OK, cool. @quasiben should this be merged in?

@quasiben
Copy link
Member

quasiben commented Oct 2, 2020

If it's ok, I'd like for @pentschev to weigh in on one last question

@mrocklin
Copy link
Member

mrocklin commented Oct 2, 2020 via email

@pentschev
Copy link
Member

We can't guarantee that the user will only want one gpu per process. This is true for dask-cuda situations, but may not be universal.

Can you cite one example where this is the case today? Coming up with a universal solution when we don't have an established set of ways one can use Dask with GPUs is rather unlikely. For example, we could have theoretically someone running multiple Dask workers accessing the same GPU, that's also true for multiple workers as threads, do we report one GPU or each worker as if each had an exclusive GPU even though they are using the same one?

@mrocklin
Copy link
Member

mrocklin commented Oct 2, 2020 via email

@jakirkham
Copy link
Member

Yeah I would suggest waiting until some users step forward asking for additional functionality to be handled before trying to design beyond known use cases. We would want them engaged in the design process to make sure we are solving problems they care about.

@quasiben
Copy link
Member

quasiben commented Oct 2, 2020

Thanks @pentschev. In the last commit, dask is now sending one nvml data point and one nvml handle per worker

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me @quasiben , I didn't test it though, I'm going to trust that it's working on your tests. :)

distributed/dashboard/components/Untitled.ipynb Outdated Show resolved Hide resolved
@quasiben
Copy link
Member

quasiben commented Oct 6, 2020

Removed the accidental file, fixed the test, and added a check for multiple GPUs

@pentschev
Copy link
Member

LGTM, I'm fine with merging this as is. Thanks @quasiben !

@quasiben
Copy link
Member

quasiben commented Oct 6, 2020

I'll wait to CI passes here then merge in. After, I'll add to dask-cuda for GPU CI

@quasiben quasiben merged commit 6efd5b6 into dask:master Oct 6, 2020
@quasiben
Copy link
Member

quasiben commented Oct 6, 2020

staging tests here: rapidsai/dask-cuda#408

lr4d pushed a commit to lr4d/distributed that referenced this pull request Oct 9, 2020
* Limit GPU metrics to visible devices only

* Move importorskip

* init nvmInit once rather than handles

* ws object has data for all workers

* match pid from worker process with process from GPUs

* send nvml data per worker/gpu not all gpus

* remove notebook

Co-authored-by: Benjamin Zaitlen <quasiben@gmail.com>
@jacobtomlinson jacobtomlinson deleted the nvml-diag-visible branch October 19, 2020 10:01
@jacobtomlinson
Copy link
Member Author

Thanks for pushing this through everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect GPU Memory Totals
7 participants