Worker lifecycle hooks #3300

jacobtomlinson · 2019-12-04T10:38:00Z

Sometimes workers get killed, memory is lost and tasks need to be run again. Many cloud providers have a cheaper compute option which can be removed at any time in exchange for a discount and this regularly happens when using these services.

Most of these services offer a warning ahead of the machines being pulled. It would be nice to take advantage of this warning, stop workers from executing new tasks and ask them to shuffle memory to other workers.

In Kubernetes a node can be cordoned (do not accept new work) and drained (move existing work to another node) via API calls. This is the kind of functionality that would be useful here also.

A couple of questions:

Is it currently possible to tell a worker not to pick up new tasks?
Are workers able to be drained of tasks and memory currently via an RPC call?
Given that this logic would be cloud provider specific could we implement worker hooks which runs another process alongside the worker automatically (via entrypoints?) and communicates with the worker via RPC to manage these kind of lifecycle events?

mrocklin · 2019-12-04T14:52:07Z

The best current work on this is here: #2844

How do the cloud providers signal that it's time for the process to wrap up? Is it with SIGINT? If so then we might be able to just clean up that PR and be done.

mrocklin · 2019-12-04T15:10:21Z

Or rather, the cleanest way to do this with that PR would be to send a SIGINT signal. We could expose the same functionality with a http route or whatnot as well though.

jacobtomlinson · 2019-12-04T15:15:49Z

It varies between cloud provider. On AWS, for example, there is a magic URL that you need to poll which will tell you when the instance will be terminated (if it is scheduled for termination). The most notice you will get is two minutes.

This is why I think we should enable some way of adding hooks because that polling logic doesn't really belong in here. Maybe it should be part of dask-cloudprovider and hook into the workers when they start via an entry point. Or maybe that package should provide a dask-aws-worker subclass that adds this logic?

Either way if the process gets notified if imminent shutdown sending a SIGINT sounds reasonable.

I'm not sure what the right way is to implement this.

mrocklin · 2019-12-04T15:29:26Z

There is a close_gracefully method on the worker and nanny that we should probably use here.

We would then create a WorkerPlugin or preload script that periodically checked that address (once every five seconds?) and then called worker.close_gracefully if it was true.

jacobtomlinson · 2019-12-04T15:39:18Z

Ah great ok!

I think --preload can take a module. So for AWS I could add something in dask-cloudprovider like dask_cloudprovider.aws.termination_watch which that would poll the API and call worker.close_gracefully if it got notified.

We would need to schedule this as a repeating async task as I assume dask_setup is synchronous. Do you have a preference for how this should be implemented?

Then in ECSCluster and FargateCluster we would add --preload 'dask_cloudprovider.aws.termination_watch' to the startup?

mrocklin · 2019-12-04T15:50:59Z

grep for self.periodic_callbacks. There are a few examples there in the worker.

sodre · 2019-12-04T16:01:15Z

On the AWS' ECS/EC2 case this post should be helpful. For FARGATE spot pricing I am not sure what the callbacks are yet.

jacobtomlinson · 2019-12-04T16:04:32Z

Thanks @sodre. I was just planning on polling http://169.254.169.254/latest/meta-data/spot/termination-time. I think this should still work within Fargate.

Thanks @mrocklin will do!

I think I have enough information to move over to dask-cloudprovider and it seems like no changes are necessary here after all 🎉.

jacobtomlinson closed this as completed Dec 4, 2019

jacobtomlinson mentioned this issue Dec 4, 2019

Add support for spot termination notices dask/dask-cloudprovider#48

Open

jacobtomlinson mentioned this issue Jan 13, 2021

Adaptative stops after some time #4421

Closed

TomAugspurger mentioned this issue Jan 29, 2021

azure: add worker plugin for spot instances dask/dask-cloudprovider#251

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker lifecycle hooks #3300

Worker lifecycle hooks #3300

jacobtomlinson commented Dec 4, 2019

mrocklin commented Dec 4, 2019

mrocklin commented Dec 4, 2019

jacobtomlinson commented Dec 4, 2019 •

edited

Loading

mrocklin commented Dec 4, 2019

jacobtomlinson commented Dec 4, 2019

mrocklin commented Dec 4, 2019

sodre commented Dec 4, 2019

jacobtomlinson commented Dec 4, 2019

Worker lifecycle hooks #3300

Worker lifecycle hooks #3300

Comments

jacobtomlinson commented Dec 4, 2019

mrocklin commented Dec 4, 2019

mrocklin commented Dec 4, 2019

jacobtomlinson commented Dec 4, 2019 • edited Loading

mrocklin commented Dec 4, 2019

jacobtomlinson commented Dec 4, 2019

mrocklin commented Dec 4, 2019

sodre commented Dec 4, 2019

jacobtomlinson commented Dec 4, 2019

jacobtomlinson commented Dec 4, 2019 •

edited

Loading