Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully restart a worker from a nanny #2861

Closed
mrocklin opened this issue Jul 19, 2019 · 1 comment · Fixed by #2892
Closed

Gracefully restart a worker from a nanny #2861

mrocklin opened this issue Jul 19, 2019 · 1 comment · Fixed by #2892

Comments

@mrocklin
Copy link
Member

A common request is that people want to restart a single worker into a clean state. This might be to refresh the imported software environment or to clear out leaked memory. To do this cleanly a worker needs to stop accepting work, offload its data to peers, and then close itself and let the nanny restart it. We can do all of these steps today, but it's a bit of a manual process. It would be nice to make it easier.

As an example use case, long-running Dask workers (days or weeks) may want to refresh themselves every hour

dask-worker scheduler-address:8786 --refresh-restart 1h
@filmor
Copy link
Contributor

filmor commented Jul 23, 2019

This is also connected to #1823 in my opinion (i.e. allowing explicit restart via a Client).

AUtomatic restart would be nice as well, though :)

mrocklin added a commit that referenced this issue Jul 31, 2019
This allows workers to optionally terminate themselves gracefully after a predetermined time.  This can be helpful in a few contexts:

1.  We receive a SIGINT, and know that we need to clean up quickly (though note that the signal handlers are not implemented as part of this commit
2.  We know that we'll be kicked off at a certain time, such as in one hour from now, as is often specified by HPC job schedulers 
3.  We just want to refresh our workers every once in a while, because we know that our code leaks some memory .  Fixes #2861 

This is configurable as keywords to the `Worker` or `Nanny` classes, in config values, or with CLI.  Here is an example with CLI.

### Restart to clear state

```
dask-worker scheduler:8786 --lifetime 1hr --lifetime-restart --lifetime-stagger 5m
```

This will kill the worker roughly 1 hour from now +- a range of 5 minutes (to avoid killing all of our workers at the same time).  It will also allow that worker to be restarted afterwards

### Restart to avoid walltime death

```
dask-worker scheduler:8786 --lifetime 58m 
```

Here we don't try to restart the worker (no point) and we choose a time a bit before our 60m walltime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants