-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive stops when workers preempted #4591
Comments
Thanks for raising this. It is not 100% clear to me why we call adaptive stop here. My guess would be for cases where scaling up is failing, we wouldn't want adaptive to hammer some service that is rejecting our jobs. It is interesting to see this failing when things are scaling down too. I don't think this should be happening. I think it would be reasonable to refactor the code here to ignore exceptions from scaling down, but stop adapting if we fail to scale up. Do you have any interest in raising a PR? I'm also curious about your compute. When a node is pre-empted does the process receive any signals? In #2844 there is some work to make workers exit more gracefully when receiving a sigint. Perhaps this would also help your usecase. |
Thanks for the quick response. I mean if the cluster fails to scale up after It might only be hammering the resource manager if an exception is thrown at the time of call to Yup, happy to contibute. It receives a sigterm, then a forceful sigkill. In #2844, I see that sigterm and sigint are given the same treatement. I would think that sigterm should also instruct the workers to shutdown gracefully? |
Yeah that's reasonable. Perhaps @mrocklin can remember why stop is called there.
Yeah sounds reasonable. |
What happened:
I'm using a custom dask provider to run a distributed dask cluster with on-prem compute. These compute nodes can be preempted. I've switched on autoscaling with min workers 0 and max workers 500. Whenever I persist DF (200GB+) onto the cluster, and when workers are preempted, this is observed in the logs:
Looking at pull 4422, I noticed this was made more verbose. I made that change locally, and was seeing this in the log:
When this happens, the cluster is unable to scale down to the minimum number of workers even after the persisted DF is deleted.
What you expected to happen:
Adaptive to be resilient to preemption.
Minimal Complete Verifiable Example:
You can repro this by simply generating 200+GB worth of random timeseries, and persisting this to a dask cluster. Kill a few workers - to simulate preemption - once the DF is about 90% persisted (from the Dask scheduler's dashboard status tab).
Anything else we need to know?:
Are there good reasons for calling stop on adaptive when this happens? I've tried removing the call to
.stop()
locally, and the cluster seem to run fine afterwards.I am also aware of issue 4421. The solution was for the workers to poll a metaurl from public cloud providers for an early warning signal on preemption before shutting down gracefull. However, that is not an option in my case.
Environment:
The text was updated successfully, but these errors were encountered: