Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive stops when workers preempted #4591

Closed
gerald732 opened this issue Mar 15, 2021 · 4 comments
Closed

Adaptive stops when workers preempted #4591

gerald732 opened this issue Mar 15, 2021 · 4 comments

Comments

@gerald732
Copy link
Contributor

gerald732 commented Mar 15, 2021

What happened:
I'm using a custom dask provider to run a distributed dask cluster with on-prem compute. These compute nodes can be preempted. I've switched on autoscaling with min workers 0 and max workers 500. Whenever I persist DF (200GB+) onto the cluster, and when workers are preempted, this is observed in the logs:

distributed.deploy.adaptive_core - INFO - Adaptive stop

Looking at pull 4422, I noticed this was made more verbose. I made that change locally, and was seeing this in the log:

distributed.deploy.adaptive_core - ERROR - Adaptive stopping due to error in <closed TCP>: Stream is closed: while trying to call remote method 'retire_workers'

When this happens, the cluster is unable to scale down to the minimum number of workers even after the persisted DF is deleted.

What you expected to happen:
Adaptive to be resilient to preemption.

Minimal Complete Verifiable Example:
You can repro this by simply generating 200+GB worth of random timeseries, and persisting this to a dask cluster. Kill a few workers - to simulate preemption - once the DF is about 90% persisted (from the Dask scheduler's dashboard status tab).

Anything else we need to know?:
Are there good reasons for calling stop on adaptive when this happens? I've tried removing the call to .stop() locally, and the cluster seem to run fine afterwards.

I am also aware of issue 4421. The solution was for the workers to poll a metaurl from public cloud providers for an early warning signal on preemption before shutting down gracefull. However, that is not an option in my case.

Environment:

  • Dask version: 2.30.0
  • Distributed version: 2.30.0
  • Python version: 3.7
@jacobtomlinson
Copy link
Member

Thanks for raising this.

It is not 100% clear to me why we call adaptive stop here. My guess would be for cases where scaling up is failing, we wouldn't want adaptive to hammer some service that is rejecting our jobs.

It is interesting to see this failing when things are scaling down too. I don't think this should be happening. I think it would be reasonable to refactor the code here to ignore exceptions from scaling down, but stop adapting if we fail to scale up. Do you have any interest in raising a PR?

I'm also curious about your compute. When a node is pre-empted does the process receive any signals? In #2844 there is some work to make workers exit more gracefully when receiving a sigint. Perhaps this would also help your usecase.

@gerald732
Copy link
Contributor Author

gerald732 commented Mar 16, 2021

Thanks for the quick response.

I mean if the cluster fails to scale up after .scale() is called on spec.py by adaptive, it wouldn't be hammering other services i.e. the resource manager. It simply appends to _worker_spec until it has reached the maximum number of workers set on adapt(). Subsequently, even if the workers fail to spin up after a submission to the resource manager, the state of _worker_spec remains the same.

It might only be hammering the resource manager if an exception is thrown at the time of call to start() which is unlikely unless there is some connectivity issue to the provider.

Yup, happy to contibute.

It receives a sigterm, then a forceful sigkill.

In #2844, I see that sigterm and sigint are given the same treatement. I would think that sigterm should also instruct the workers to shutdown gracefully?

@jacobtomlinson
Copy link
Member

Yeah that's reasonable. Perhaps @mrocklin can remember why stop is called there.

I would thought sigterm to also instruct the workers to shutdown gracefully?

Yeah sounds reasonable.

@gerald732
Copy link
Contributor Author

gerald732 commented Mar 26, 2021

Yeah that's reasonable. Perhaps @mrocklin can remember why stop is called there.

I would thought sigterm to also instruct the workers to shutdown gracefully?

Yeah sounds reasonable.

I've created a PR for this #4633

jacobtomlinson pushed a commit that referenced this issue Mar 31, 2021
Co-authored-by: James Bourbeau <jrbourbeau@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants