-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully restart a worker from a nanny #2861
Comments
This is also connected to #1823 in my opinion (i.e. allowing explicit restart via a AUtomatic restart would be nice as well, though :) |
mrocklin
added a commit
that referenced
this issue
Jul 31, 2019
This allows workers to optionally terminate themselves gracefully after a predetermined time. This can be helpful in a few contexts: 1. We receive a SIGINT, and know that we need to clean up quickly (though note that the signal handlers are not implemented as part of this commit 2. We know that we'll be kicked off at a certain time, such as in one hour from now, as is often specified by HPC job schedulers 3. We just want to refresh our workers every once in a while, because we know that our code leaks some memory . Fixes #2861 This is configurable as keywords to the `Worker` or `Nanny` classes, in config values, or with CLI. Here is an example with CLI. ### Restart to clear state ``` dask-worker scheduler:8786 --lifetime 1hr --lifetime-restart --lifetime-stagger 5m ``` This will kill the worker roughly 1 hour from now +- a range of 5 minutes (to avoid killing all of our workers at the same time). It will also allow that worker to be restarted afterwards ### Restart to avoid walltime death ``` dask-worker scheduler:8786 --lifetime 58m ``` Here we don't try to restart the worker (no point) and we choose a time a bit before our 60m walltime.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
A common request is that people want to restart a single worker into a clean state. This might be to refresh the imported software environment or to clear out leaked memory. To do this cleanly a worker needs to stop accepting work, offload its data to peers, and then close itself and let the nanny restart it. We can do all of these steps today, but it's a bit of a manual process. It would be nice to make it easier.
As an example use case, long-running Dask workers (days or weeks) may want to refresh themselves every hour
The text was updated successfully, but these errors were encountered: