-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log when spilling from GPU memory to host memory and from host memory to disk happens. #438
Comments
We can do a little bit of this in dask-cuda but the host memory to disk occurs in dask/distibuted and in zict which I don't think we log. @pentschev is that correct ? |
No, we entirely replace Dask's spilling mechanism with our own, so we can log things without messing with Dask core. We might need to add a new configuration flag in Distributed though, I'm not sure if we can extend to have our own configs in dask-cuda without adding that to Distributed, can we @quasiben ? |
We could have our own config but I would caution against this. It makes sense for things like dask-jobqueue, dask-cloudprovider, etc, but we already have so much in dask-core (ucx and rmm config) that I would think we would want to keep those things, which mostly dask-cuda uses, centralized. What kind of config flag were you thinking about it |
I was thinking of something like |
Are we sure this couldn't also be used by |
I'm not saying it couldn't be, but rather I think it wouldn't, at least immediately. I don't oppose to having that functionality in distributed as well, but someone's gotta do the work there, which I assume won't happen unless someone really needs that. |
@EvenOldridge I had a discussion offline with @quasiben around this request. I believe what your users want is some kind of log that displays directly on the client console or some other user-accessible place without much change, is that assumption correct? Today, there's no functionality like that in Dask, either Dask-CUDA or Distributed, and it's not clear to us whether this should ever be their responsibility to do so. One idea we had is that users could register an async PeriodicCallback function that would call Second, we still need to log the spill calls, and this is Dask-CUDA's responsibility -- potentially Distributed in the future. This is something that we'll start by logging only in Dask-CUDA for now and generalizing later. |
I opened #442 with an initial attempt to solve this, covering only |
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
This PR allows enabling logs for spilling operations, as requested in #438 . This is definitely not the cleanest solution, but it allows us to test without changing anything in Distributed or Zict, although these changes could be done in those two projects in the future preventing us from creating subclasses. We might also want to have a specific configuration for logging, once we're confident what's the best way to handle that, we would need to decide whether that should be a Distributed or Dask-CUDA configuration. Here's an example of how logs look like: ``` distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 0) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 1) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 2) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 3) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 4) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 5) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 6) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 7) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 8) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 9) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 10) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 11) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 12) from Device to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 0) from Host to Disk distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 4) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 7) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 6) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 8) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 0) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 0) from Disk to Host distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 2) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 3) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 1) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 12) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 5) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 11) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 10) from Host to Device distributed.worker - INFO - Worker at <tcp://127.0.0.1:39753>: Spilling key ('random_sample-587bb130aacf2dae8cd3ff7b4309027e', 9) from Host to Device ``` Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #442
This has been addressed in #442 , closing. |
Add logs that can be turned on/off with dask.config.set that log when spilling occurs.
Spilling has such an adverse effect on perf that it's important to highlight to our customers when it happens. Probably also useful to add guidelines for improvement the first time it occurs.
Something along the lines of:
"Worker X | Task Y hit GPU memory limits. ___ MB? spilled to host. This can impact overall task performance."
"Increase your number of GPUs or GPU memory to avoid this issue. Webpage provides guidance for configuring your cluster and workflow to avoid spilling" (only on first error of this type?)
and
"Worker X | Task Y hit host memory limits. ____ MB? spilled to disk. This will significantly impact performance."
"Increase on host memory to avoid this issue. Webpage_ provides guidance for configuring your cluster and workflow to avoid spilling" (only on first error of this type?)
Obviously these need to be refined and workshopped, but capture the gist of what I'd like to see.
The text was updated successfully, but these errors were encountered: