-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close workers gracefully #2892
Close workers gracefully #2892
Conversation
Merging this tomorrow if there are no further comments (Although comments would be quite welcome here) |
Do you anticipate this making fixing #2788 easier? Your comment in #2788 (comment) makes it sound like "yes". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good at a glance.
It's the action that I think we would want to take when handling a SIGINT. |
Merging this shortly if there are no further comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @mrocklin, just a fiew comment on the documentation.
Co-Authored-By: Guillaume Eynard-Bontemps <g.eynard.bontemps@gmail.com>
Now that we've added --lifetime-restart this is more clear there
…uted into worker-close-gracefully
87c3082
to
2061fd4
Compare
e11eaed
to
3e7aa67
Compare
Just a quick comment (I don't think this is a problem for this PR, since the signal handling is only part of it AFAICT): for SGE the signal you get when the job exceed the walltime is a |
Ouch! AFAIK Slurm and PBS send a |
Good to know that other schedulers are a bit nicer regarding job termination. I found this which seems to indicate that for SGE |
This allows workers to optionally terminate themselves gracefully after a predetermined time. This can be helpful in a few contexts:
This is configurable as keywords to the
Worker
orNanny
classes, in config values, or with CLI. Here is an example with CLI.Restart to clear state
This will kill the worker roughly 1 hour from now +- a range of 5 minutes (to avoid killing all of our workers at the same time). It will also allow that worker to be restarted afterwards
Restart to avoid walltime death
Here we don't try to restart the worker (no point) and we choose a time a bit before our 60m walltime.
Also cc @jacobtomlinson @jcrist for general deployment information
For folks who want to review. I recommend going commit-by-commit (it should be fairly clean). (this is also a decently simple/educational PR to review)