dask-cuda should automatically register GPU memory resource tracking #36

randerzander · 2019-04-18T16:31:14Z

It would be nice for dask-cuda workers to automatically add GPU memory tracking like this.

Nicer still would be for the ${dask_scheduler_ip}:8787/status dashboard to show GPU memory bytes stored in addition to host memory bytes stored.

mrocklin · 2019-04-18T20:16:02Z

I agree that this would be great to have. It would also be interesting to think about what other dashboards we might provide for users about their GPUs. It would also be interesting to think about dashboards that might be useful outside of the context of Dask. I know that some folks have been interested in this generally. I'd be happy to help anyone that wanted to push on this effort longer term.

Long term We might want to use a library like pynvml to do this with a little less overhead.

Those are both long terms comments. I have no strong objection to using nvidia-smi as well short term, providing that it's not too expensive to run (I suspect that this blocks the worker every time we poll for resources).

jakirkham · 2019-04-19T03:48:59Z

One other thing that might be interesting when thinking about memory specifically is leveraging Python's tracemalloc. This would generally give us a way of tracking memory allocations and that input could be fed into other things like dashboards or even used by scripts, libraries, or other user applications. To use this we would need to register those allocations ourselves. It could be something that RMM would do in the Python interface level. Alternatively several different libraries could do this and we could filter out their individual contributions to memory usage.

That said, pynvml is useful for more than just memory allocations. So it would certainly be useful for a variety of interesting diagnostic feedback.

randerzander · 2019-08-08T22:18:44Z

As dask_cudf usage has grown, the number of one-off scripts for monitoring GPU memory are starting to proliferate.

Is there someone who can work on a first version of built-in GPU memory monitoring?

mrocklin · 2019-08-08T22:25:25Z

I'm actually a little bit ahead of you on this one :) dask/distributed#2932 If you're looking for dashboards today then @rjzamora's work here is pretty slick: https://github.com/rjzamora/jupyterlab-bokeh-server/tree/pynvml

…

On Thu, Aug 8, 2019 at 6:18 PM Randy Gelhausen ***@***.***> wrote: As dask_cudf usage has grown, the number of one-off scripts for monitoring GPU memory are starting to proliferate. Is there someone who can work on a first version of built-in GPU memory monitoring? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#36?email_source=notifications&email_token=AACKZTDIR5BYK2OCWXDMWVDQDSLULA5CNFSM4HG6R5IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD35B2LI#issuecomment-519707949>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACKZTAV7FCMV7SQ4TNGMGLQDSLULANCNFSM4HG6R5IA> .

mrocklin · 2019-08-08T22:44:26Z

Randy is this what you meant? Or were you looking more for spill-to-disk kinds of thiings. Or, more broadly, why do you want us to track GPU memory?

…

On Thu, Aug 8, 2019 at 6:25 PM Matthew Rocklin ***@***.***> wrote: I'm actually a little bit ahead of you on this one :) dask/distributed#2932 If you're looking for dashboards today then @rjzamora's work here is pretty slick: https://github.com/rjzamora/jupyterlab-bokeh-server/tree/pynvml On Thu, Aug 8, 2019 at 6:18 PM Randy Gelhausen ***@***.***> wrote: > As dask_cudf usage has grown, the number of one-off scripts for > monitoring GPU memory are starting to proliferate. > > Is there someone who can work on a first version of built-in GPU memory > monitoring? > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#36?email_source=notifications&email_token=AACKZTDIR5BYK2OCWXDMWVDQDSLULA5CNFSM4HG6R5IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD35B2LI#issuecomment-519707949>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AACKZTAV7FCMV7SQ4TNGMGLQDSLULANCNFSM4HG6R5IA> > . >

randerzander · 2019-08-09T00:01:20Z

It's a start. When debugging workflows, there's often a lengthy pause while someone driving a notebook jumps back to (multiple) terminals to check GPU memory usage via nvidia-smi.

Having the dashboard show total GPU memory used by persisted DataFrames would be a great first step in alleviating that pain.

I would think a future improvement would be surfacing peak memory used, since DataFrame operations often cause significant, albeit temporary spikes in GPU memory usage.

Beyond that, being having access to display this type of data tells me that we might be able to determine that a given workflow could benefit safely from using multiple threads or processes per GPU worker.

mrocklin · 2019-08-09T00:04:56Z

If you're on a single GPU then you probably want the solution in https://github.com/rjzamora/jupyterlab-bokeh-server/tree/pynvml

This has been done for a while, we just needed to package it up (which requires people comfortable with both Python and JS, which is somewhat rare).

@jacobtomlinson seemed interested in doing this. Jacob, if this is easy for you and not too much of a distraction could you prioritize it?

mrocklin · 2019-08-09T00:06:03Z

Regardless, I imagine I'll have the Dask version done in a week or two. It could be done sooner if this is a high priority for you Randy. I get the sense that it's only a mild annoyance for now and not a burning fire, but I could be wrong.

randerzander · 2019-08-09T00:07:44Z

You read me well =)

I'm glad to know that there's been significant progress in the meantime.

mrocklin · 2019-08-09T14:04:36Z

There is an initial pair of plots in dask/distributed#2944 .

They won't be very discoverable until dask/dask-labextension#75

But you can navigate to them directly by going to /individual-gpu-memory and /individual-gpu-utilization of the scheduler's dashboard address, often :8787.

We might ask someone like @rjzamora to expand on that work, but I still think that, if you're on a single node then, you're probably better off with his existing project.

pentschev · 2020-05-06T09:46:14Z

I think https://github.com/rapidsai/jupyterlab-nvdashboard addresses most of the requests (if not all). Regardless, I think that's now a more appropriate project for future feature requests than dask-cuda, therefore I'm closing this but feel free to reopen should something in dask-cuda still be necessary.

benjha · 2020-10-01T20:33:24Z

Hi All,

Looks like this functionality has been out for a while. One of our users needs to get GPU metrics from dask-cuda-workers. In this set up the scheduler is running in one node and the dask-cuda-workers in different nodes.

The question here is what /individual-gpu-memory and /individual-gpu-utilization GPU refer to ?, GPU(s) available in a local cuda cluster ?

On the other hand, I also tried the dask-labextension, but it only shows GPU metrics from the machine running the scheduler.

This is the Dask related packages in use:

$ conda list | grep dask
dask                      2.19.0                     py_0    conda-forge
dask-core                 2.19.0                     py_0  
dask-cuda                 0.14.1                   pypi_0    pypi
dask-cudf                 0.14.0a0+5439.g6244cfc.dirty          pypi_0    pypi
dask-labextension         2.0.2                      py_0    conda-forge
dask_labextension         2.0.2                         0    conda-forge

and Jupyter

jupyter-server-proxy      1.5.0                      py_0    conda-forge
jupyter_client            6.1.6                      py_0  
jupyter_core              4.6.1                    py37_0  
jupyterlab                2.1.5                      py_0    conda-forge
jupyterlab-nvdashboard    0.3.1                      py_0    conda-forge
jupyterlab_server         1.2.0                      py_0

pentschev · 2020-10-01T21:49:29Z

I'm not sure about any of those questions. @quasiben @jakirkham are you familiar with those?

jakirkham · 2020-10-01T22:22:39Z

Could we please move this over to a new issue?

Edit: Also would recommend including a screenshot if you can. That should make it a bit clearer what's going on 🙂

mrocklin mentioned this issue Aug 9, 2019

Add GPUCurrentLoad dashboard plots dask/distributed#2944

Merged

pentschev mentioned this issue Aug 9, 2019

[QST] Query on dask-cuda-worker and compatibility with dask-worker #108

Closed

pentschev closed this as completed May 6, 2020

pentschev mentioned this issue Jun 16, 2020

Solver functions give "no kernel image is available for execution on the device" #318

Closed

benjha mentioned this issue Oct 2, 2020

Limit GPU metrics to visible devices only dask/distributed#3810

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dask-cuda should automatically register GPU memory resource tracking #36

dask-cuda should automatically register GPU memory resource tracking #36

randerzander commented Apr 18, 2019

mrocklin commented Apr 18, 2019

jakirkham commented Apr 19, 2019

randerzander commented Aug 8, 2019

mrocklin commented Aug 8, 2019 via email

mrocklin commented Aug 8, 2019 via email

randerzander commented Aug 9, 2019

mrocklin commented Aug 9, 2019

mrocklin commented Aug 9, 2019

randerzander commented Aug 9, 2019

mrocklin commented Aug 9, 2019

pentschev commented May 6, 2020

benjha commented Oct 1, 2020 •

edited

Loading

pentschev commented Oct 1, 2020

jakirkham commented Oct 1, 2020 •

edited

Loading

dask-cuda should automatically register GPU memory resource tracking #36

dask-cuda should automatically register GPU memory resource tracking #36

Comments

randerzander commented Apr 18, 2019

mrocklin commented Apr 18, 2019

jakirkham commented Apr 19, 2019

randerzander commented Aug 8, 2019

mrocklin commented Aug 8, 2019 via email

mrocklin commented Aug 8, 2019 via email

randerzander commented Aug 9, 2019

mrocklin commented Aug 9, 2019

mrocklin commented Aug 9, 2019

randerzander commented Aug 9, 2019

mrocklin commented Aug 9, 2019

pentschev commented May 6, 2020

benjha commented Oct 1, 2020 • edited Loading

pentschev commented Oct 1, 2020

jakirkham commented Oct 1, 2020 • edited Loading

benjha commented Oct 1, 2020 •

edited

Loading

jakirkham commented Oct 1, 2020 •

edited

Loading