[WIP] Flag workers as non-active and pause in case of graceful shutdown #3564

fjetter · 2020-03-11T15:35:13Z

This is still WIP and fails currently due to improper treatment of the pause/worker busy logic but I would like to address this in a separate PR since it turns out to be non trivial.

This PR is supposed to improve the treatment of worker retirement which I believe is still a valuable addition even if the replication/retirement logic is changed in the mid/long term.

There are a few assumptions in here:

Graceful worker shutdown is time limited. In our particular scenario, the cluster manager issues a graceful shutdown and after a certain grace period, the worker is terminated.
No matter how the details are implemented, a worker will require some time until it can actually shut down gracefully. There are two primary reasons for it: First, the worker is currently executing a task which might run for minutes. Second, the worker needs to somehow evict/replicate all tasks in memory which also takes some time.

Combining the two leaves us with with the fact that the Scheduler.retire_workers coroutine may not be finishing in time before the worker can complete it's shutdown, i.e. the coroutine may never hit the Scheduler.remove_worker(address=w, safe=True) call which removes the worker and all of the remaining tasks of the worker safely and transitions the tasks into the Ready state without increasing the suspicious counter.

Worse even, due to the scheduler wide lock around the replication, the scheduler is not open to retire many workers sequentially if the first one blocks so the above issue may cascade.

This change introduces the concept of a retirement notification where the first step of the retirement is to remove the worker from an "active" list and notify the worker to prepare for shutdown. This has two benefits

While not being an active worker anymore, it is not regarded as a viable option for work stealing, ordinary task assignment, replication, scatter, etc. For all intents and purposes, this worker is already dead/dying from the POV of the scheduler and it should be treated this way. The only reason to not remove it from the workers dict directly is because we still want to accept submission of it and still want to be able to properly close it after it has replicated its data, i.e. we still know it but we don't want to include it for new assignments.
If a non-active worker actually dies, the tasks are no longer flagged as suspicious, i.e. the retry counter doesn't increase since this was an expected shutdown/failure

Closes #3526

distributed/scheduler.py

mrocklin · 2020-05-05T15:53:23Z

@fjetter checking in here. I notice that this was WIP/Draft. Did you want review here?

fjetter · 2020-05-05T16:55:41Z

It's in WIP because I couldn't cleanup the code yet and some tests are still failing. I wouldn't recommend to dive too deep into the code changes but there are a few specific things I'd like to get some feedback on

The self.active attribute (new) which filters retired workers but is otherwise similar to self.workers is a quite significant change to the scheduler code. is this something we would accept? Also, should we filter this at runtime or try to maintain an additional set similar to idle, saturated, etc. (we're looping over self.workers at many, many places so I figured it shouldn't be that expensive but some people may use very large clusters and this might be an opportunity to clean things up)

I noticed some irregularities between the nanny and a worker in terms of the states (and handlers) for closed / closing / closing-gracefully / retire / etc. Is there a general concept of how things should work usually? For instance, if there is a nanny, am I supposed to close the nanny AND the worker or would I expect that I just close the nanny and the nanny will then close the worker. Especially w.r.t. graceful downscaling this seems to be quite entangled.

mrocklin · 2020-05-09T15:28:33Z

I noticed some irregularities between the nanny and a worker in terms of the states (and handlers) for closed / closing / closing-gracefully / retire / etc. Is there a general concept of how things should work usually? For instance, if there is a nanny, am I supposed to close the nanny AND the worker or would I expect that I just close the nanny and the nanny will then close the worker. Especially w.r.t. graceful downscaling this seems to be quite entangled.

My guess is that there isn't a principled approach here, but that there should be.

mrocklin · 2020-05-09T15:31:50Z

The self.active attribute (new) which filters retired workers but is otherwise similar to self.workers is a quite significant change to the scheduler code. is this something we would accept? Also, should we filter this at runtime or try to maintain an additional set similar to idle, saturated, etc. (we're looping over self.workers at many, many places so I figured it shouldn't be that expensive but some people may use very large clusters and this might be an opportunity to clean things up)

I think that keeping track of which workers are appropriate targets for work is generally a good idea. There are lots of things that we might want to check here, including if they're retiring, as you have now, but also if they're paused due to being out of memory, and probably other issues. I haven't yet had enough time to think about this well enough to have a good answer here. Probably it would be useful for someone to try to gather requirements on all the times when we might want to exclude workers from some activity, and the criteria by which we would want to exclude them. Having a larger set of situations might help us to make a decision like this with more clarity.

cc'ing @crusaderky because I think that he's interested in this problem.

fjetter commented Mar 11, 2020

View reviewed changes

distributed/scheduler.py Show resolved Hide resolved

fjetter force-pushed the retirement branch from 2ac4844 to e4c5afd Compare April 3, 2020 16:55

fjetter added 5 commits April 15, 2020 16:24

better retirement

d07a9c6

Remove pause again

a1d54af

add retirement attribute again

dcc4652

More fixes to notify retirement

60ad0ae

SpecCluster _correct_state_internal retirement fix

5c6bfb8

fjetter force-pushed the retirement branch from 0996c69 to 5c6bfb8 Compare April 15, 2020 15:45

fjetter mentioned this pull request Apr 15, 2020

Bug: putative race condition when scaling down LocalCluster #3526

Open

lesteve mentioned this pull request Apr 29, 2020

Soft time limit for workers? dask/dask-jobqueue#416

Closed

Base automatically changed from master to main March 8, 2021 19:04

fjetter mentioned this pull request Sep 7, 2021

Memory prioritization on workers #5250

Open

fjetter closed this Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Flag workers as non-active and pause in case of graceful shutdown #3564

[WIP] Flag workers as non-active and pause in case of graceful shutdown #3564

fjetter commented Mar 11, 2020 •

edited

Loading

mrocklin commented May 5, 2020

fjetter commented May 5, 2020

mrocklin commented May 9, 2020

mrocklin commented May 9, 2020

[WIP] Flag workers as non-active and pause in case of graceful shutdown #3564

[WIP] Flag workers as non-active and pause in case of graceful shutdown #3564

Conversation

fjetter commented Mar 11, 2020 • edited Loading

mrocklin commented May 5, 2020

fjetter commented May 5, 2020

mrocklin commented May 9, 2020

mrocklin commented May 9, 2020

fjetter commented Mar 11, 2020 •

edited

Loading