Gracefully restart a worker from a nanny #2861

mrocklin · 2019-07-19T20:55:23Z

A common request is that people want to restart a single worker into a clean state. This might be to refresh the imported software environment or to clear out leaked memory. To do this cleanly a worker needs to stop accepting work, offload its data to peers, and then close itself and let the nanny restart it. We can do all of these steps today, but it's a bit of a manual process. It would be nice to make it easier.

As an example use case, long-running Dask workers (days or weeks) may want to refresh themselves every hour

dask-worker scheduler-address:8786 --refresh-restart 1h

The text was updated successfully, but these errors were encountered:

filmor · 2019-07-23T09:22:31Z

This is also connected to #1823 in my opinion (i.e. allowing explicit restart via a Client).

AUtomatic restart would be nice as well, though :)

This allows workers to optionally terminate themselves gracefully after a predetermined time. This can be helpful in a few contexts: 1. We receive a SIGINT, and know that we need to clean up quickly (though note that the signal handlers are not implemented as part of this commit 2. We know that we'll be kicked off at a certain time, such as in one hour from now, as is often specified by HPC job schedulers 3. We just want to refresh our workers every once in a while, because we know that our code leaks some memory . Fixes #2861 This is configurable as keywords to the `Worker` or `Nanny` classes, in config values, or with CLI. Here is an example with CLI. ### Restart to clear state ``` dask-worker scheduler:8786 --lifetime 1hr --lifetime-restart --lifetime-stagger 5m ``` This will kill the worker roughly 1 hour from now +- a range of 5 minutes (to avoid killing all of our workers at the same time). It will also allow that worker to be restarted afterwards ### Restart to avoid walltime death ``` dask-worker scheduler:8786 --lifetime 58m ``` Here we don't try to restart the worker (no point) and we choose a time a bit before our 60m walltime.

mrocklin mentioned this issue Jul 27, 2019

Close workers gracefully #2892

Merged

mrocklin closed this as completed in #2892 Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully restart a worker from a nanny #2861

Gracefully restart a worker from a nanny #2861

mrocklin commented Jul 19, 2019

filmor commented Jul 23, 2019

Gracefully restart a worker from a nanny #2861

Gracefully restart a worker from a nanny #2861

Comments

mrocklin commented Jul 19, 2019

filmor commented Jul 23, 2019