Adaptive stops when workers preempted #4591

gerald732 · 2021-03-15T15:21:05Z

What happened:
I'm using a custom dask provider to run a distributed dask cluster with on-prem compute. These compute nodes can be preempted. I've switched on autoscaling with min workers 0 and max workers 500. Whenever I persist DF (200GB+) onto the cluster, and when workers are preempted, this is observed in the logs:

distributed.deploy.adaptive_core - INFO - Adaptive stop

Looking at pull 4422, I noticed this was made more verbose. I made that change locally, and was seeing this in the log:

distributed.deploy.adaptive_core - ERROR - Adaptive stopping due to error in <closed TCP>: Stream is closed: while trying to call remote method 'retire_workers'

When this happens, the cluster is unable to scale down to the minimum number of workers even after the persisted DF is deleted.

What you expected to happen:
Adaptive to be resilient to preemption.

Minimal Complete Verifiable Example:
You can repro this by simply generating 200+GB worth of random timeseries, and persisting this to a dask cluster. Kill a few workers - to simulate preemption - once the DF is about 90% persisted (from the Dask scheduler's dashboard status tab).

Anything else we need to know?:
Are there good reasons for calling stop on adaptive when this happens? I've tried removing the call to .stop() locally, and the cluster seem to run fine afterwards.

I am also aware of issue 4421. The solution was for the workers to poll a metaurl from public cloud providers for an early warning signal on preemption before shutting down gracefull. However, that is not an option in my case.

Environment:

Dask version: 2.30.0
Distributed version: 2.30.0
Python version: 3.7

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2021-03-15T15:44:16Z

Thanks for raising this.

It is not 100% clear to me why we call adaptive stop here. My guess would be for cases where scaling up is failing, we wouldn't want adaptive to hammer some service that is rejecting our jobs.

It is interesting to see this failing when things are scaling down too. I don't think this should be happening. I think it would be reasonable to refactor the code here to ignore exceptions from scaling down, but stop adapting if we fail to scale up. Do you have any interest in raising a PR?

I'm also curious about your compute. When a node is pre-empted does the process receive any signals? In #2844 there is some work to make workers exit more gracefully when receiving a sigint. Perhaps this would also help your usecase.

gerald732 · 2021-03-16T13:43:40Z

Thanks for the quick response.

I mean if the cluster fails to scale up after .scale() is called on spec.py by adaptive, it wouldn't be hammering other services i.e. the resource manager. It simply appends to _worker_spec until it has reached the maximum number of workers set on adapt(). Subsequently, even if the workers fail to spin up after a submission to the resource manager, the state of _worker_spec remains the same.

It might only be hammering the resource manager if an exception is thrown at the time of call to start() which is unlikely unless there is some connectivity issue to the provider.

Yup, happy to contibute.

It receives a sigterm, then a forceful sigkill.

In #2844, I see that sigterm and sigint are given the same treatement. I would think that sigterm should also instruct the workers to shutdown gracefully?

jacobtomlinson · 2021-03-16T15:26:18Z

Yeah that's reasonable. Perhaps @mrocklin can remember why stop is called there.

I would thought sigterm to also instruct the workers to shutdown gracefully?

Yeah sounds reasonable.

gerald732 · 2021-03-26T02:00:14Z

Yeah that's reasonable. Perhaps @mrocklin can remember why stop is called there.

I would thought sigterm to also instruct the workers to shutdown gracefully?

Yeah sounds reasonable.

I've created a PR for this #4633

Co-authored-by: James Bourbeau <jrbourbeau@gmail.com>

gerald732 mentioned this issue Mar 15, 2021

Log adaptive error #4422

Merged

gerald732 added a commit to gerald732/distributed that referenced this issue Mar 25, 2021

ignore oserror exception when scaling down (dask#4591)

c27581e

gerald732 added a commit to gerald732/distributed that referenced this issue Mar 25, 2021

ignore oserror exception when scaling down (dask#4591)

ced9010

gerald732 mentioned this issue Mar 25, 2021

ignore oserror exception when scaling down (dask#4591) #4633

Merged

jacobtomlinson pushed a commit that referenced this issue Mar 31, 2021

ignore oserror exception when scaling down (#4591) (#4633)

cc3074f

Co-authored-by: James Bourbeau <jrbourbeau@gmail.com>

jacobtomlinson closed this as completed Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive stops when workers preempted #4591

Adaptive stops when workers preempted #4591

gerald732 commented Mar 15, 2021 •

edited

Loading

jacobtomlinson commented Mar 15, 2021

gerald732 commented Mar 16, 2021 •

edited

Loading

jacobtomlinson commented Mar 16, 2021

gerald732 commented Mar 26, 2021 •

edited

Loading

Adaptive stops when workers preempted #4591

Adaptive stops when workers preempted #4591

Comments

gerald732 commented Mar 15, 2021 • edited Loading

jacobtomlinson commented Mar 15, 2021

gerald732 commented Mar 16, 2021 • edited Loading

jacobtomlinson commented Mar 16, 2021

gerald732 commented Mar 26, 2021 • edited Loading

gerald732 commented Mar 15, 2021 •

edited

Loading

gerald732 commented Mar 16, 2021 •

edited

Loading

gerald732 commented Mar 26, 2021 •

edited

Loading