Log when spilling from GPU memory to host memory and from host memory to disk happens. #438

EvenOldridge · 2020-11-05T00:15:58Z

Add logs that can be turned on/off with dask.config.set that log when spilling occurs.

Spilling has such an adverse effect on perf that it's important to highlight to our customers when it happens. Probably also useful to add guidelines for improvement the first time it occurs.

Something along the lines of:

"Worker X | Task Y hit GPU memory limits. ___ MB? spilled to host. This can impact overall task performance."
"Increase your number of GPUs or GPU memory to avoid this issue. Webpage provides guidance for configuring your cluster and workflow to avoid spilling" (only on first error of this type?)

and

"Worker X | Task Y hit host memory limits. ____ MB? spilled to disk. This will significantly impact performance."
"Increase on host memory to avoid this issue. Webpage_ provides guidance for configuring your cluster and workflow to avoid spilling" (only on first error of this type?)

Obviously these need to be refined and workshopped, but capture the gist of what I'd like to see.

quasiben · 2020-11-05T15:51:31Z

We can do a little bit of this in dask-cuda but the host memory to disk occurs in dask/distibuted and in zict which I don't think we log. @pentschev is that correct ?

pentschev · 2020-11-05T16:34:37Z

No, we entirely replace Dask's spilling mechanism with our own, so we can log things without messing with Dask core. We might need to add a new configuration flag in Distributed though, I'm not sure if we can extend to have our own configs in dask-cuda without adding that to Distributed, can we @quasiben ?

quasiben · 2020-11-07T15:39:54Z

We could have our own config but I would caution against this. It makes sense for things like dask-jobqueue, dask-cloudprovider, etc, but we already have so much in dask-core (ucx and rmm config) that I would think we would want to keep those things, which mostly dask-cuda uses, centralized. What kind of config flag were you thinking about it

pentschev · 2020-11-09T10:32:15Z

We could have our own config but I would caution against this. It makes sense for things like dask-jobqueue, dask-cloudprovider, etc, but we already have so much in dask-core (ucx and rmm config) that I would think we would want to keep those things, which mostly dask-cuda uses, centralized. What kind of config flag were you thinking about it

I was thinking of something like logging.dask-cuda.spilling because that's gonna be used for dask-cuda only. I don't have an issue with that staying in distributed, but it feels really weird to have a flag there that will only be used by dask-cuda and no effect whatsoever in distributed alone. The other flags that you mentioned are both used primarily by dask-cuda users, but they were both added there because we needed them to configure dask-scheduler, before that was needed we had it all in dask-cuda.

jakirkham · 2020-11-10T00:33:53Z

Are we sure this couldn't also be used by distributed? It sounds like it could be, in which case maybe it would be better to use a generic name

pentschev · 2020-11-10T09:11:51Z

Are we sure this couldn't also be used by distributed? It sounds like it could be, in which case maybe it would be better to use a generic name

I'm not saying it couldn't be, but rather I think it wouldn't, at least immediately. I don't oppose to having that functionality in distributed as well, but someone's gotta do the work there, which I assume won't happen unless someone really needs that.

pentschev · 2020-11-10T20:15:24Z

@EvenOldridge I had a discussion offline with @quasiben around this request. I believe what your users want is some kind of log that displays directly on the client console or some other user-accessible place without much change, is that assumption correct?

Today, there's no functionality like that in Dask, either Dask-CUDA or Distributed, and it's not clear to us whether this should ever be their responsibility to do so. One idea we had is that users could register an async PeriodicCallback function that would call client.get_worker_logs(), parse the logs to get only the relevant part and handle output/printing where most convenient for the application at hand. Would that work for you?

Second, we still need to log the spill calls, and this is Dask-CUDA's responsibility -- potentially Distributed in the future. This is something that we'll start by logging only in Dask-CUDA for now and generalizing later.

pentschev · 2020-11-12T19:29:20Z

I opened #442 with an initial attempt to solve this, covering only LocalCUDACluster at the moment, but we can easily extend that if that's helpful to solve the problem @EvenOldridge has been asked.

github-actions · 2021-02-16T19:08:49Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

This PR allows enabling logs for spilling operations, as requested in #438 . This is definitely not the cleanest solution, but it allows us to test without changing anything in Distributed or Zict, although these changes could be done in those two projects in the future preventing us from creating subclasses. We might also want to have a specific configuration for logging, once we're confident what's the best way to handle that, we would need to decide whether that should be a Distributed or Dask-CUDA configuration. Here's an example of how logs look like: ``` distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 0) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 1) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 2) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 3) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 4) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 5) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 6) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 7) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 8) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 9) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 10) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 11) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 12) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 0) from Host to Disk distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 4) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 7) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 6) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 8) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 0) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 0) from Disk to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 2) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 3) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 1) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 12) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 5) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 11) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 10) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 9) from Host to Device ``` Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #442

pentschev · 2021-04-22T20:57:10Z

This has been addressed in #442 , closing.

wmalpica mentioned this issue Nov 9, 2020

Ability to have output dask_cudf.DataFrame not necessarily all in memory BlazingDB/blazingsql#1128

Open

pentschev mentioned this issue Nov 12, 2020

Add capability to log spilling #442

Merged

pentschev added the feature request New feature or request label Jan 8, 2021

github-actions bot added the inactive-30d label Feb 16, 2021

pentschev closed this as completed Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log when spilling from GPU memory to host memory and from host memory to disk happens. #438

Log when spilling from GPU memory to host memory and from host memory to disk happens. #438

EvenOldridge commented Nov 5, 2020

quasiben commented Nov 5, 2020

pentschev commented Nov 5, 2020

quasiben commented Nov 7, 2020

pentschev commented Nov 9, 2020

jakirkham commented Nov 10, 2020

pentschev commented Nov 10, 2020

pentschev commented Nov 10, 2020

pentschev commented Nov 12, 2020

github-actions bot commented Feb 16, 2021

pentschev commented Apr 22, 2021

Log when spilling from GPU memory to host memory and from host memory to disk happens. #438

Log when spilling from GPU memory to host memory and from host memory to disk happens. #438

Comments

EvenOldridge commented Nov 5, 2020

quasiben commented Nov 5, 2020

pentschev commented Nov 5, 2020

quasiben commented Nov 7, 2020

pentschev commented Nov 9, 2020

jakirkham commented Nov 10, 2020

pentschev commented Nov 10, 2020

pentschev commented Nov 10, 2020

pentschev commented Nov 12, 2020

github-actions bot commented Feb 16, 2021

pentschev commented Apr 22, 2021