Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Flag workers as non-active and pause in case of graceful shutdown #3564

Closed
wants to merge 5 commits into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Mar 11, 2020

This is still WIP and fails currently due to improper treatment of the pause/worker busy logic but I would like to address this in a separate PR since it turns out to be non trivial.

This PR is supposed to improve the treatment of worker retirement which I believe is still a valuable addition even if the replication/retirement logic is changed in the mid/long term.

There are a few assumptions in here:

  1. Graceful worker shutdown is time limited. In our particular scenario, the cluster manager issues a graceful shutdown and after a certain grace period, the worker is terminated.
  2. No matter how the details are implemented, a worker will require some time until it can actually shut down gracefully. There are two primary reasons for it: First, the worker is currently executing a task which might run for minutes. Second, the worker needs to somehow evict/replicate all tasks in memory which also takes some time.

Combining the two leaves us with with the fact that the Scheduler.retire_workers coroutine may not be finishing in time before the worker can complete it's shutdown, i.e. the coroutine may never hit the Scheduler.remove_worker(address=w, safe=True) call which removes the worker and all of the remaining tasks of the worker safely and transitions the tasks into the Ready state without increasing the suspicious counter.

Worse even, due to the scheduler wide lock around the replication, the scheduler is not open to retire many workers sequentially if the first one blocks so the above issue may cascade.

This change introduces the concept of a retirement notification where the first step of the retirement is to remove the worker from an "active" list and notify the worker to prepare for shutdown. This has two benefits

  1. While not being an active worker anymore, it is not regarded as a viable option for work stealing, ordinary task assignment, replication, scatter, etc. For all intents and purposes, this worker is already dead/dying from the POV of the scheduler and it should be treated this way. The only reason to not remove it from the workers dict directly is because we still want to accept submission of it and still want to be able to properly close it after it has replicated its data, i.e. we still know it but we don't want to include it for new assignments.
  2. If a non-active worker actually dies, the tasks are no longer flagged as suspicious, i.e. the retry counter doesn't increase since this was an expected shutdown/failure

Closes #3526

@mrocklin
Copy link
Member

mrocklin commented May 5, 2020

@fjetter checking in here. I notice that this was WIP/Draft. Did you want review here?

@fjetter
Copy link
Member Author

fjetter commented May 5, 2020

It's in WIP because I couldn't cleanup the code yet and some tests are still failing. I wouldn't recommend to dive too deep into the code changes but there are a few specific things I'd like to get some feedback on

The self.active attribute (new) which filters retired workers but is otherwise similar to self.workers is a quite significant change to the scheduler code. is this something we would accept? Also, should we filter this at runtime or try to maintain an additional set similar to idle, saturated, etc. (we're looping over self.workers at many, many places so I figured it shouldn't be that expensive but some people may use very large clusters and this might be an opportunity to clean things up)

I noticed some irregularities between the nanny and a worker in terms of the states (and handlers) for closed / closing / closing-gracefully / retire / etc. Is there a general concept of how things should work usually? For instance, if there is a nanny, am I supposed to close the nanny AND the worker or would I expect that I just close the nanny and the nanny will then close the worker. Especially w.r.t. graceful downscaling this seems to be quite entangled.

@mrocklin
Copy link
Member

mrocklin commented May 9, 2020

I noticed some irregularities between the nanny and a worker in terms of the states (and handlers) for closed / closing / closing-gracefully / retire / etc. Is there a general concept of how things should work usually? For instance, if there is a nanny, am I supposed to close the nanny AND the worker or would I expect that I just close the nanny and the nanny will then close the worker. Especially w.r.t. graceful downscaling this seems to be quite entangled.

My guess is that there isn't a principled approach here, but that there should be.

@mrocklin
Copy link
Member

mrocklin commented May 9, 2020

The self.active attribute (new) which filters retired workers but is otherwise similar to self.workers is a quite significant change to the scheduler code. is this something we would accept? Also, should we filter this at runtime or try to maintain an additional set similar to idle, saturated, etc. (we're looping over self.workers at many, many places so I figured it shouldn't be that expensive but some people may use very large clusters and this might be an opportunity to clean things up)

I think that keeping track of which workers are appropriate targets for work is generally a good idea. There are lots of things that we might want to check here, including if they're retiring, as you have now, but also if they're paused due to being out of memory, and probably other issues. I haven't yet had enough time to think about this well enough to have a good answer here. Probably it would be useful for someone to try to gather requirements on all the times when we might want to exclude workers from some activity, and the criteria by which we would want to exclude them. Having a larger set of situations might help us to make a decision like this with more clarity.

cc'ing @crusaderky because I think that he's interested in this problem.

Base automatically changed from master to main March 8, 2021 19:04
@fjetter fjetter closed this Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: putative race condition when scaling down LocalCluster
2 participants