Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker lifecycle hooks #3300

Closed
jacobtomlinson opened this issue Dec 4, 2019 · 8 comments
Closed

Worker lifecycle hooks #3300

jacobtomlinson opened this issue Dec 4, 2019 · 8 comments

Comments

@jacobtomlinson
Copy link
Member

Sometimes workers get killed, memory is lost and tasks need to be run again. Many cloud providers have a cheaper compute option which can be removed at any time in exchange for a discount and this regularly happens when using these services.

Most of these services offer a warning ahead of the machines being pulled. It would be nice to take advantage of this warning, stop workers from executing new tasks and ask them to shuffle memory to other workers.

In Kubernetes a node can be cordoned (do not accept new work) and drained (move existing work to another node) via API calls. This is the kind of functionality that would be useful here also.

A couple of questions:

  • Is it currently possible to tell a worker not to pick up new tasks?
  • Are workers able to be drained of tasks and memory currently via an RPC call?
  • Given that this logic would be cloud provider specific could we implement worker hooks which runs another process alongside the worker automatically (via entrypoints?) and communicates with the worker via RPC to manage these kind of lifecycle events?
@mrocklin
Copy link
Member

mrocklin commented Dec 4, 2019

The best current work on this is here: #2844

How do the cloud providers signal that it's time for the process to wrap up? Is it with SIGINT? If so then we might be able to just clean up that PR and be done.

@mrocklin
Copy link
Member

mrocklin commented Dec 4, 2019

Or rather, the cleanest way to do this with that PR would be to send a SIGINT signal. We could expose the same functionality with a http route or whatnot as well though.

@jacobtomlinson
Copy link
Member Author

jacobtomlinson commented Dec 4, 2019

It varies between cloud provider. On AWS, for example, there is a magic URL that you need to poll which will tell you when the instance will be terminated (if it is scheduled for termination). The most notice you will get is two minutes.

This is why I think we should enable some way of adding hooks because that polling logic doesn't really belong in here. Maybe it should be part of dask-cloudprovider and hook into the workers when they start via an entry point. Or maybe that package should provide a dask-aws-worker subclass that adds this logic?

Either way if the process gets notified if imminent shutdown sending a SIGINT sounds reasonable.

I'm not sure what the right way is to implement this.

@mrocklin
Copy link
Member

mrocklin commented Dec 4, 2019

There is a close_gracefully method on the worker and nanny that we should probably use here.

We would then create a WorkerPlugin or preload script that periodically checked that address (once every five seconds?) and then called worker.close_gracefully if it was true.

@jacobtomlinson
Copy link
Member Author

Ah great ok!

I think --preload can take a module. So for AWS I could add something in dask-cloudprovider like dask_cloudprovider.aws.termination_watch which that would poll the API and call worker.close_gracefully if it got notified.

We would need to schedule this as a repeating async task as I assume dask_setup is synchronous. Do you have a preference for how this should be implemented?

Then in ECSCluster and FargateCluster we would add --preload 'dask_cloudprovider.aws.termination_watch' to the startup?

@mrocklin
Copy link
Member

mrocklin commented Dec 4, 2019

grep for self.periodic_callbacks. There are a few examples there in the worker.

@sodre
Copy link
Contributor

sodre commented Dec 4, 2019

On the AWS' ECS/EC2 case this post should be helpful. For FARGATE spot pricing I am not sure what the callbacks are yet.

@jacobtomlinson
Copy link
Member Author

Thanks @sodre. I was just planning on polling http://169.254.169.254/latest/meta-data/spot/termination-time. I think this should still work within Fargate.

Thanks @mrocklin will do!

I think I have enough information to move over to dask-cloudprovider and it seems like no changes are necessary here after all 🎉.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants