-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Make workers gracefully handle sigint #2844
base: main
Are you sure you want to change the base?
Conversation
@TomAugspurger I'd very much appreciate you picking this up, as I don't think I can dedicate much time to it in the next few weeks... |
OK, I'll try to get #2788 sorted out today. |
Well, that whole "step away from the problem for a bit" thing actually worked. diff --git a/distributed/cli/dask_worker.py b/distributed/cli/dask_worker.py
index e23a8ab9..5a396686 100755
--- a/distributed/cli/dask_worker.py
+++ b/distributed/cli/dask_worker.py
@@ -394,6 +394,7 @@ def main(
raise TimeoutError("Timed out starting worker.") from None
finally:
logger.info("End worker")
+ return 0
def go(): is all I was missing for #2788 :) Now to write a test. edit: never mind, that's not working :/ |
On distributed master, sending SigInt to a worker results in a TimeoutError raised from Tornado, which is not at all what happened. This test checks that this error is not raised.
e4064b6
to
8770d44
Compare
OK, I've fiddled with this a bit. Things seem to behave well on linux, but windows CI is unhappy. |
for sig in [signal.SIGINT, signal.SIGTERM]: | ||
asyncio.get_event_loop().add_signal_handler( | ||
sig, functools.partial(on_signal, sig) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect windows will not be happy with this. https://stackoverflow.com/questions/45987985/asyncio-loops-add-signal-handler-in-windows
if signum == signal.SIGINT: | ||
logger.info("Gracefully closing worker because of SIGINT call") | ||
await asyncio.gather(*[n.close_gracefully() for n in nannies]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be reasonable to give SIGTERM the same treatment as SIGINT as well?
Was hacking with @mrocklin trying to fix dask/dask-jobqueue#122, ran into #2788. These are early attempts to fix both things. @TomAugspurger you might find my test code helpful, if overly verbose (I just copied the worker/scheduler creation from the test above it).
The basic idea behind the
unregister_with_scheduler
coroutine is that the cluster (apparently? @mrocklin told me this) sends SIGINT to processes before killing them for exceeding their time allocation. We can use this to close out the workers withsafe=True
so that the tasks running on them are not marked as suspicious.