Limit GPU metrics to visible devices only #3810

jacobtomlinson · 2020-05-18T16:05:31Z

It looks like the GPU diagnostics read from all GPUs on the system.

When starting a LocalCUDACluster one worker is started per GPU, which means we are seeing duplication in the metrics as all workers report totals for all GPUs.

This PR restricts the nvml diagnostics to only read data on the GPUs specified in CUDA_VISIBLE_DEVICES (or all GPUs if this is not set).

Fixes #3808.

jacobtomlinson · 2020-05-18T16:08:28Z

Running the tests locally shows that the global handles property may be causing some issues.

If I run each test individually it passes. However if I run them all together the second test onwards fails. This is likely because of the global variable not being reset between tests.

jacobtomlinson · 2020-05-18T16:09:33Z

Also the dashboard is still showing the incorrect metric.

There are 8 32GB GPUs so it should show 256GB. But is shows 8 x 8 x 32GB = 2TB.

quasiben · 2020-05-18T20:56:56Z

Thanks for working on this @jacobtomlinson. One thing I wanted to point out is that the global handle will probably cause issues between tests. You might consider something to check if nvmlInit() has been called rather than handles. UCX does something like that:

distributed/distributed/comm/ucx.py

Lines 49 to 61 in 1bcbaee

    
           def init_once(): 
        
               global ucp, host_array, device_array 
        
               if ucp is not None: 
        
                   return 
        
               import ucp as _ucp 
        
               ucp = _ucp 
        
               # remove/process dask.ucx flags for valid ucx options 
        
               ucx_config = _scrub_ucx_config() 
        
               ucp.init(options=ucx_config, env_takes_precedence=True)

mrocklin · 2020-05-25T23:47:11Z

Checking in. What is the status here?

jacobtomlinson · 2020-05-26T16:20:05Z

Still on my radar, I ended up going quite deep down the pynvml rabbit hole here. Mainly trying to see if I could do this in a neat way without globals.

trivialfis · 2020-09-27T15:18:49Z

distributed/diagnostics/nvml.py

@@ -8,7 +9,16 @@ def _pynvml_handles():
    if handles is None:
        pynvml.nvmlInit()
        count = pynvml.nvmlDeviceGetCount()
-        handles = [pynvml.nvmlDeviceGetHandleByIndex(i) for i in range(count)]
+        cuda_visible_devices = [


As dask cuda just reorder the devices to change cuda device enumeration. This is still getting all the devices like what nvml does right?

I'm not sure that I undrstand this comment. My understanding is that pynvml doesn't respect the CUDA_VISIBLE_DEVICES environment variable, and so we need to handle this manually.

mrocklin · 2020-09-29T01:28:30Z

It looks like this is mostly done, but maybe having small issues with testing. @quasiben is there anyone else on your team that would want to take this over? It seems like an easy thing for @pentschev perhaps?

quasiben · 2020-09-29T13:12:45Z

I pushed a small fix to the code which alleviates the testing. Once this is merged in we can add to the ucx-py CI or dask-cuda. Something like the following:
https://github.com/rapidsai/ucx-py/blob/branch-0.16/ci/gpu/build.sh

@pentschev do you have time to review ?

trivialfis · 2020-09-29T13:19:09Z

@mrocklin Thanks for the reply.

I'm not sure that I undrstand this comment. My understanding is that pynvml doesn't respect the CUDA_VISIBLE_DEVICES environment variable, and so we need to handle this manually.

From what I understand, dask-cuda LocalCUDACluster manipulates CUDA_VISIABLE_DEVICES by reordering. Feel free to correct me if I'm wrong:

If I have 2 GPUs, the first worker will have: CUDA_VISIABLE_DEVICES=0,1 while the second one will have CUDA_VISIABLE_DEVICES=1,0. So len(handles) is still 2 after this PR. #4117 will continue to report 4 GPUs and #3808 might be the same?

pentschev

Sorry for the delay here, and changes look good generally, and results also look good to me:

# 1 GPU
$ CUDA_VISIBLE_DEVICES=0 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [34089730048], 'name': ['Tesla V100-SXM2-32GB']}

# 2 GPUs
$ CUDA_VISIBLE_DEVICES=0,2 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [34089730048, 34089730048], 'name': ['Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB']}

# 2 GPUs (GPU 8 doesn't exist, we still get only 1 output as expected)
$ CUDA_VISIBLE_DEVICES=0,8 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [34089730048], 'name': ['Tesla V100-SXM2-32GB']}

# 9 GPUs (GPU 8 doesn't exist, we still get only 8 outputs as expected)
$ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [34089730048, 34089730048, 34089730048, 34089730048, 34089730048, 34089730048, 34089730048, 34089730048], 'name': ['Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB', 'Tesla V100-SXM2-32GB']}

# 1 non-existing GPU
$ CUDA_VISIBLE_DEVICES=8 python -c "from distributed.diagnostics import nvml; print(nvml.one_time())"
{'memory-total': [], 'name': []}

Thanks @jacobtomlinson and @quasiben for the work here.

quasiben · 2020-09-30T13:59:15Z

@trivialfis if you have a minute can you test out this PR. I think we are ready to merge in but it would be good to hear from you before doing that

trivialfis · 2020-09-30T16:09:46Z

I have 2 GPUs. Installing this patch:

Calling print('handles:', handles) in def real_time()::

handles: [<pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7f3b5170a440>, <pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7f3b5170a540>]
handles: [<pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7fb49c092740>, <pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7fb49c092840>]
handles: [<pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7fb488547840>, <pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7fb49c09ff40>]

...

Printing CUDA_VISIABLE_DEVICES in my local process:

CUDA: 0,1
CUDA: 1,0

dask-cuda is installed from 0.16 branch. Hope that helps.

pentschev · 2020-09-30T16:22:21Z

It seems that the visual diagnostics is reporting for all devices in each process, but each process should only be reporting GPU 0 (relative to CUDA_VISIBLE_DEVICES), at least for dask-cuda.

trivialfis · 2020-09-30T16:24:29Z

I wanted to help fixing it, but didn't know how to map each worker to its GPU.

pentschev · 2020-09-30T17:15:39Z

Speaking of dask-cuda exclusively, each worker is already mapped to a single GPU. Internally, the process will always use GPU 0, relative to CUDA_VISIBLE_DEVICES. What I think is happening in this report is that it's getting all GPUs listed by CUDA_VISIBLE_DEVICES and displaying them all. So in your 2 GPU case, that means the report will report CUDA_VISIBLE_DEVICES=0,1 and CUDA_VISIBLE_DEVICES=1,0 for workers 0 and 1, respectively, rather than just using the first GPU listed in CUDA_VISIBLE_DEVICES for each of these two workers.

I must admit I have no clue where the code for displaying that even lives, but I think it's doing what I wrote above.

trivialfis · 2020-09-30T19:47:52Z

I'm not familiar with the internal of worker resource management and came up with this workaround. As the problem is nvml doesn't respect cuda parameters, so we need to map the cuda device to nvml device index.

Get first device by using cuda api cudaGetDevice, this is controlled by visible devices.
Get its UUID by calling cuda api cudaGetDeviceProperties, mark as worker_uuid.
Enumerate all devices by nvml and store the device index from enumeration.
List all UUIDs with device indices we just enumerated using nvml api nvmlDeviceGetUUID.
Filter out those nvml device indices whose UUID don't match worker_uuid, leaving only 1 valid nvml device index.
Use that to get GPU usage from nvml

quasiben · 2020-10-01T02:09:07Z

@trivialfis can you pull latest and test again (e020434)

I am still not entirely clear what is happening here but it looks like all GPU data for all workers is stored in each worker -- maybe something in the scheduler update is not quite right?

For now, i've made a small change to how we iterate through the GPU info for the display. For anyone interested, the logic is here:

distributed/distributed/dashboard/components/nvml.py

Lines 136 to 156 in 111029d

    
           for ws in workers: 
        
               try: 
        
                   info = ws.extra["gpu"] 
        
               except KeyError: 
        
                   continue 
        
               metrics = ws.metrics["gpu"] 
        
               for j, (u, mem_used, mem_total) in enumerate( 
        
                   zip( 
        
                       metrics["utilization"], 
        
                       metrics["memory-used"], 
        
                       info["memory-total"], 
        
                   ) 
        
               ): 
        
                   memory_max = max(memory_max, mem_total) 
        
                   memory_total += mem_total 
        
                   utilization.append(int(u)) 
        
                   memory.append(mem_used) 
        
                   worker.append(ws.address) 
        
                   gpu_index.append(j) 
        
                   y.append(i) 
        
                   i += 1

trivialfis · 2020-10-01T04:41:34Z

Let me try that later tonight.

trivialfis · 2020-10-01T12:24:45Z

Hi sorry I don't think I can get back today. Feel free to ignore me as I will be OOTO for a few more days.

quasiben · 2020-10-02T02:17:39Z

distributed/dashboard/components/nvml.py

                        info["memory-total"],
                    )
                ):
+                    # find which GPU maps to which process
+                    if ws.pid not in procs:


I think I resolved the logic issues in the diagnostics display by adding in process information from each GPU. What we had before was each worker on a node would collect all the data from the GPU(s) (if multiple were available) and pass back to this for-loop. So we were n^2 the amount of data displayed:

Node 1-worker 1 -> GPU0->8
Node 1-worker 2 -> GPU0->7
....
Node 2-worker 1 -> GPU0->7
Node 2-worker 2 -> GPU0->7
....

This is not ideal, but by collecting process information from the GPU and passing back to the dashboard logic, we can not appropriately match up the pid of the worker with the pids reported by the GPU and only display when the pids match

@pentschev do you think this is the best way to gather unique worker processes/GPU ?

Why can't you do what I said in #3810 (comment) ? There's only one relevant GPU per process -- the first GPU in CUDA_VISIBLE_DEVICES. We would need something equivalent to:

def get_used_memory(): gpu_idx = os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")[0] handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_idx) return pynvml.nvmlDeviceGetMemoryInfo(handle).used used_memory_per_gpu = client.run(get_used_memory())

In other words, we shouldn't be capturing data for all GPUs in CUDA_VISIBLE_DEVICES and reporting them all for each worker, but only the first one.

And note that, to answer your question more directly, this may work fine until we have PID collisions, as is bound to happen in sufficiently large clusters with multiple nodes. You'd probably need to match that with a unique identifier for each node to be more resilient, not sure if we do have a way for that in Dask. But this solution is fine with me as well, the only other way it could be more reliable is to use pynvml.nvmlDeviceGetSerial to match by a GPUs serial number.

What about including an IP address as well?

sorry, forgot to push the change

quasiben · 2020-10-02T02:27:33Z

Attached are plots from the latest changes:

I ran this a cluster with 8 workers per node on two nodes. In the above, I've circled two workers each on a different box. Notice one IP is 127.0.0.1 and the other is a remote worker with an ip of 10.33.XXX.XX

@benjha you were recently asking about collecting GPU stats from remote workers and thought this PR may be helpful for you.

benjha · 2020-10-02T15:45:30Z

@benjha you were recently asking about collecting GPU stats from remote workers and thought this PR may be helpful for you.

Thanks @quasiben (cc @jakirkham, @pentschev rapidsai/dask-cuda#36 (comment)).

I am trying newer versions of the packages (dask 2.25.0 / dask-labextension 3.0) and metrics look more closer to what is being reported here. I'll update to these.

mrocklin · 2020-10-02T16:08:55Z

OK, cool. @quasiben should this be merged in?

quasiben · 2020-10-02T16:53:37Z

If it's ok, I'd like for @pentschev to weigh in on one last question

mrocklin · 2020-10-02T18:08:24Z

We can't guarantee that the user will only want one gpu per process. This is true for dask-cuda situations, but may not be universal.

…

On Fri, Oct 2, 2020, 10:56 AM Peter Andreas Entschev < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In distributed/dashboard/components/nvml.py <#3810 (comment)>: > info["memory-total"], ) ): + # find which GPU maps to which process + if ws.pid not in procs: Why can't you do what I said in #3810 (comment) <#3810 (comment)> ? There's only one relevant GPU per process -- the first GPU in CUDA_VISIBLE_DEVICES. We would need something equivalent to: def get_used_memory(): gpu_idx = os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")[0] handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_idx) return pynvml.nvmlDeviceGetMemoryInfo(handle).used used_memory_per_gpu = client.run(get_used_memory()) In other words, we shouldn't be capturing data for all GPUs in CUDA_VISIBLE_DEVICES and reporting them all for each worker, but only the first one. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3810 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFMJUDKPOOZW3OAVALSIYH4TANCNFSM4NEGZ6VA> .

pentschev · 2020-10-02T18:26:09Z

We can't guarantee that the user will only want one gpu per process. This is true for dask-cuda situations, but may not be universal.

Can you cite one example where this is the case today? Coming up with a universal solution when we don't have an established set of ways one can use Dask with GPUs is rather unlikely. For example, we could have theoretically someone running multiple Dask workers accessing the same GPU, that's also true for multiple workers as threads, do we report one GPU or each worker as if each had an exclusive GPU even though they are using the same one?

mrocklin · 2020-10-02T18:36:11Z

Yeah, agreed that if a universal solution isn't possible then what you propose would probably work well in most cases that we see today. What I had in mind was people using Dask workers to drive other mutli-gpu computations, like PyTorch or TensorFlow. My hope/guess was that Ben's solution was general, but I don't have as much context here as you all do.

…

On Fri, Oct 2, 2020 at 11:26 AM Peter Andreas Entschev < ***@***.***> wrote: We can't guarantee that the user will only want one gpu per process. This is true for dask-cuda situations, but may not be universal. Can you cite one example where this is the case today? Coming up with a universal solution when we don't have an established set of ways one can use Dask with GPUs is rather unlikely. For example, we could have theoretically someone running multiple Dask workers accessing the same GPU, that's also true for multiple workers as threads, do we report one GPU or each worker as if each had an exclusive GPU even though they are using the same one? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3810 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTEV7D3J5XWDJ65S5UTSIYLM5ANCNFSM4NEGZ6VA> .

jakirkham · 2020-10-02T19:36:34Z

Yeah I would suggest waiting until some users step forward asking for additional functionality to be handled before trying to design beyond known use cases. We would want them engaged in the design process to make sure we are solving problems they care about.

quasiben · 2020-10-02T20:18:54Z

Thanks @pentschev. In the last commit, dask is now sending one nvml data point and one nvml handle per worker

pentschev

Seems reasonable to me @quasiben , I didn't test it though, I'm going to trust that it's working on your tests. :)

distributed/dashboard/components/Untitled.ipynb

quasiben · 2020-10-06T13:41:42Z

Removed the accidental file, fixed the test, and added a check for multiple GPUs

pentschev · 2020-10-06T14:13:43Z

LGTM, I'm fine with merging this as is. Thanks @quasiben !

quasiben · 2020-10-06T14:16:52Z

I'll wait to CI passes here then merge in. After, I'll add to dask-cuda for GPU CI

quasiben · 2020-10-06T14:28:36Z

staging tests here: rapidsai/dask-cuda#408

* Limit GPU metrics to visible devices only * Move importorskip * init nvmInit once rather than handles * ws object has data for all workers * match pid from worker process with process from GPUs * send nvml data per worker/gpu not all gpus * remove notebook Co-authored-by: Benjamin Zaitlen <quasiben@gmail.com>

jacobtomlinson · 2020-10-19T10:01:10Z

Thanks for pushing this through everyone!

Limit GPU metrics to visible devices only

7d72928

Move importorskip

adcba1f

jacobtomlinson mentioned this pull request May 18, 2020

Incorrect GPU Memory Totals #3808

Closed

trivialfis reviewed Sep 27, 2020

View reviewed changes

quasiben added 2 commits September 29, 2020 05:42

Merge branch 'master' of github.com:dask/distributed into HEAD

987bb2d

init nvmInit once rather than handles

024b092

pentschev approved these changes Sep 29, 2020

View reviewed changes

mrocklin marked this pull request as ready for review September 30, 2020 14:06

ws object has data for all workers

e020434

match pid from worker process with process from GPUs

137bdb5

quasiben reviewed Oct 2, 2020

View reviewed changes

send nvml data per worker/gpu not all gpus

37bca31

pentschev reviewed Oct 5, 2020

View reviewed changes

distributed/dashboard/components/Untitled.ipynb Outdated Show resolved Hide resolved

remove notebook

5393abf

quasiben merged commit 6efd5b6 into dask:master Oct 6, 2020

quasiben mentioned this pull request Oct 6, 2020

Run Dask's NVML tests rapidsai/dask-cuda#408

Merged

jakirkham mentioned this pull request Oct 6, 2020

Fix test_gpu_metrics failure #4154

Merged

beckernick mentioned this pull request Oct 7, 2020

Update accessing of gpu worker info for distributed's new scheduler_info data structures rapidsai/gpu-bdb#113

Closed

jacobtomlinson deleted the nvml-diag-visible branch October 19, 2020 10:01

jacobtomlinson mentioned this pull request Oct 19, 2020

GPU utilization plot reports repeated GPUs #4117

Closed

quasiben mentioned this pull request Jun 2, 2021

Move SystemMonitor's GPU initialization back to constructor #4866

Merged

1 task

Uh oh!

Limit GPU metrics to visible devices only #3810

Limit GPU metrics to visible devices only #3810

Uh oh!

Conversation

jacobtomlinson commented May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobtomlinson commented May 18, 2020

Uh oh!

jacobtomlinson commented May 18, 2020

Uh oh!

quasiben commented May 18, 2020

Uh oh!

mrocklin commented May 25, 2020

Uh oh!

jacobtomlinson commented May 26, 2020

Uh oh!

trivialfis Sep 27, 2020

Choose a reason for hiding this comment

Uh oh!

mrocklin Sep 29, 2020

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Sep 29, 2020

Uh oh!

quasiben commented Sep 29, 2020

Uh oh!

trivialfis commented Sep 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pentschev left a comment

Choose a reason for hiding this comment

Uh oh!

quasiben commented Sep 30, 2020

Uh oh!

trivialfis commented Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pentschev commented Sep 30, 2020

Uh oh!

trivialfis commented Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pentschev commented Sep 30, 2020

Uh oh!

trivialfis commented Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quasiben commented Oct 1, 2020

Uh oh!

trivialfis commented Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trivialfis commented Oct 1, 2020

Uh oh!

quasiben Oct 2, 2020

Choose a reason for hiding this comment

Uh oh!

quasiben Oct 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pentschev Oct 2, 2020

Choose a reason for hiding this comment

Uh oh!

pentschev Oct 2, 2020

Choose a reason for hiding this comment

Uh oh!

jakirkham Oct 3, 2020

Choose a reason for hiding this comment

Uh oh!

quasiben Oct 3, 2020

Choose a reason for hiding this comment

Uh oh!

quasiben commented Oct 2, 2020

Uh oh!

benjha commented Oct 2, 2020

Uh oh!

mrocklin commented Oct 2, 2020

jacobtomlinson commented May 18, 2020 •

edited

Loading

trivialfis commented Sep 29, 2020 •

edited

Loading

trivialfis commented Sep 30, 2020 •

edited

Loading

trivialfis commented Sep 30, 2020 •

edited

Loading

trivialfis commented Sep 30, 2020 •

edited

Loading

trivialfis commented Oct 1, 2020 •

edited

Loading

quasiben Oct 2, 2020 •

edited

Loading

quasiben commented Oct 2, 2020 •

edited

Loading

quasiben commented Oct 2, 2020 •

edited

Loading