Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Make workers gracefully handle sigint #2844

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

jni
Copy link

@jni jni commented Jul 15, 2019

Was hacking with @mrocklin trying to fix dask/dask-jobqueue#122, ran into #2788. These are early attempts to fix both things. @TomAugspurger you might find my test code helpful, if overly verbose (I just copied the worker/scheduler creation from the test above it).

The basic idea behind the unregister_with_scheduler coroutine is that the cluster (apparently? @mrocklin told me this) sends SIGINT to processes before killing them for exceeding their time allocation. We can use this to close out the workers with safe=True so that the tasks running on them are not marked as suspicious.

@jni
Copy link
Author

jni commented Jul 15, 2019

(Note: I do not want to unduly associate @mrocklin with the quality of this PR (or lack thereof) — @mrocklin was just patiently sitting next to me answering my questions and pointing me to the right files and SO answers. =)

@TomAugspurger
Copy link
Member

Thanks @jni. I haven't worked any more on #2788 since a couple weeks ago when it completely stumped me.

Do you plan to continue working on this? If not, I'll try to pick it up again by the end of the week.

@jni
Copy link
Author

jni commented Jul 23, 2019

@TomAugspurger I'd very much appreciate you picking this up, as I don't think I can dedicate much time to it in the next few weeks...

@TomAugspurger
Copy link
Member

OK, I'll try to get #2788 sorted out today.

@TomAugspurger
Copy link
Member

TomAugspurger commented Jul 23, 2019

Well, that whole "step away from the problem for a bit" thing actually worked.

diff --git a/distributed/cli/dask_worker.py b/distributed/cli/dask_worker.py
index e23a8ab9..5a396686 100755
--- a/distributed/cli/dask_worker.py
+++ b/distributed/cli/dask_worker.py
@@ -394,6 +394,7 @@ def main(
         raise TimeoutError("Timed out starting worker.") from None
     finally:
         logger.info("End worker")
+        return 0
 
 
 def go():

is all I was missing for #2788 :) Now to write a test.

edit: never mind, that's not working :/

jni and others added 5 commits November 30, 2019 08:39
On distributed master, sending SigInt to a worker results in a
TimeoutError raised from Tornado, which is not at all what happened.
This test checks that this error is not raised.
@mrocklin
Copy link
Member

mrocklin commented Dec 1, 2019

OK, I've fiddled with this a bit. Things seem to behave well on linux, but windows CI is unhappy.

Comment on lines +412 to +415
for sig in [signal.SIGINT, signal.SIGTERM]:
asyncio.get_event_loop().add_signal_handler(
sig, functools.partial(on_signal, sig)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrocklin
Copy link
Member

cc @jacobtomlinson

Base automatically changed from master to main March 8, 2021 19:03
Comment on lines +399 to +401
if signum == signal.SIGINT:
logger.info("Gracefully closing worker because of SIGINT call")
await asyncio.gather(*[n.close_gracefully() for n in nannies])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable to give SIGTERM the same treatment as SIGINT as well?

@jni jni requested a review from fjetter as a code owner January 23, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling workers with expiring allocation requests
5 participants