Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log when spilling from GPU memory to host memory and from host memory to disk happens. #438

Closed
EvenOldridge opened this issue Nov 5, 2020 · 10 comments
Labels
feature request New feature or request inactive-30d

Comments

@EvenOldridge
Copy link

Add logs that can be turned on/off with dask.config.set that log when spilling occurs.

Spilling has such an adverse effect on perf that it's important to highlight to our customers when it happens. Probably also useful to add guidelines for improvement the first time it occurs.

Something along the lines of:

"Worker X | Task Y hit GPU memory limits. ___ MB? spilled to host. This can impact overall task performance."
"Increase your number of GPUs or GPU memory to avoid this issue. Webpage provides guidance for configuring your cluster and workflow to avoid spilling" (only on first error of this type?)

and

"Worker X | Task Y hit host memory limits. ____ MB? spilled to disk. This will significantly impact performance."
"Increase on host memory to avoid this issue. Webpage_ provides guidance for configuring your cluster and workflow to avoid spilling" (only on first error of this type?)

Obviously these need to be refined and workshopped, but capture the gist of what I'd like to see.

@quasiben
Copy link
Member

quasiben commented Nov 5, 2020

We can do a little bit of this in dask-cuda but the host memory to disk occurs in dask/distibuted and in zict which I don't think we log. @pentschev is that correct ?

@pentschev
Copy link
Member

No, we entirely replace Dask's spilling mechanism with our own, so we can log things without messing with Dask core. We might need to add a new configuration flag in Distributed though, I'm not sure if we can extend to have our own configs in dask-cuda without adding that to Distributed, can we @quasiben ?

@quasiben
Copy link
Member

quasiben commented Nov 7, 2020

We could have our own config but I would caution against this. It makes sense for things like dask-jobqueue, dask-cloudprovider, etc, but we already have so much in dask-core (ucx and rmm config) that I would think we would want to keep those things, which mostly dask-cuda uses, centralized. What kind of config flag were you thinking about it

@pentschev
Copy link
Member

We could have our own config but I would caution against this. It makes sense for things like dask-jobqueue, dask-cloudprovider, etc, but we already have so much in dask-core (ucx and rmm config) that I would think we would want to keep those things, which mostly dask-cuda uses, centralized. What kind of config flag were you thinking about it

I was thinking of something like logging.dask-cuda.spilling because that's gonna be used for dask-cuda only. I don't have an issue with that staying in distributed, but it feels really weird to have a flag there that will only be used by dask-cuda and no effect whatsoever in distributed alone. The other flags that you mentioned are both used primarily by dask-cuda users, but they were both added there because we needed them to configure dask-scheduler, before that was needed we had it all in dask-cuda.

@jakirkham
Copy link
Member

Are we sure this couldn't also be used by distributed? It sounds like it could be, in which case maybe it would be better to use a generic name

@pentschev
Copy link
Member

Are we sure this couldn't also be used by distributed? It sounds like it could be, in which case maybe it would be better to use a generic name

I'm not saying it couldn't be, but rather I think it wouldn't, at least immediately. I don't oppose to having that functionality in distributed as well, but someone's gotta do the work there, which I assume won't happen unless someone really needs that.

@pentschev
Copy link
Member

@EvenOldridge I had a discussion offline with @quasiben around this request. I believe what your users want is some kind of log that displays directly on the client console or some other user-accessible place without much change, is that assumption correct?

Today, there's no functionality like that in Dask, either Dask-CUDA or Distributed, and it's not clear to us whether this should ever be their responsibility to do so. One idea we had is that users could register an async PeriodicCallback function that would call client.get_worker_logs(), parse the logs to get only the relevant part and handle output/printing where most convenient for the application at hand. Would that work for you?

Second, we still need to log the spill calls, and this is Dask-CUDA's responsibility -- potentially Distributed in the future. This is something that we'll start by logging only in Dask-CUDA for now and generalizing later.

@pentschev
Copy link
Member

I opened #442 with an initial attempt to solve this, covering only LocalCUDACluster at the moment, but we can easily extend that if that's helpful to solve the problem @EvenOldridge has been asked.

@pentschev pentschev added the feature request New feature or request label Jan 8, 2021
@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

rapids-bot bot pushed a commit that referenced this issue Apr 5, 2021
This PR allows enabling logs for spilling operations, as requested in #438 . This is definitely not the cleanest solution, but it allows us to test without changing anything in Distributed or Zict, although these changes could be done in those two projects in the future preventing us from creating subclasses. We might also want to have a specific configuration for logging, once we're confident what's the best way to handle that, we would need to decide whether that should be a Distributed or Dask-CUDA configuration.

Here's an example of how logs look like:

```
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 0) from Device to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 1) from Device to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 2) from Device to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 3) from Device to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 4) from Device to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 5) from Device to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 6) from Device to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 7) from Device to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 8) from Device to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 9) from Device to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 10) from Device to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 11) from Device to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 12) from Device to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 0) from Host to Disk
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 4) from Host to Device
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 7) from Host to Device
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 6) from Host to Device
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 8) from Host to Device
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 0) from Host to Device
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 0) from Disk to Host
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 2) from Host to Device
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 3) from Host to Device
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 1) from Host to Device
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 12) from Host to Device
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 5) from Host to Device
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 11) from Host to Device
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 10) from Host to Device
distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 9) from Host to Device
```

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #442
@pentschev
Copy link
Member

This has been addressed in #442 , closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request inactive-30d
Projects
None yet
Development

No branches or pull requests

4 participants