Interrupt and sigterm before sigkill kernel. #620

Carreau · 2021-02-22T17:01:56Z

This implements the discussion in #618, to try to more progressively stop
the kernels.

this always sends and interrupt before any shutdown requests;
goal being to stop any processing happening that may block any event
loop.
sends the shutdown_requests (no changes here).
wait 50% of the wait time, and sends a "terminate" in quote as this
depends on your platform/os, and the type of kernel you have.
- if subprocess.Popen calls .terminate() which is SIGTERM on unix; the same as
  .kill() on windows; if not a Popen instance send SIGTERM.
wait the other 50% and kills, (same as before).

This does this both on the sync client and async client.

TBD: write tests and docs; and test more properly;

This implements the discussion in jupyter#18, to try to more progressively stop the kernels. 1) this always sends and interrupt before any shutdown requests; goal being to stop any processing happening that may block any event loop. 2) sends the shutdown_requests (no changes here). 3) wait 50% of the wait time, and sends a "terminate" in quote as this depends on your platform/os, and the type of kernel you have. - if subprocess.Popen calls `.terminate()` which is SIGTERM on unix; the same as `.kill()` on windows; if not a Popen instance send SIGTERM. 4) wait the other 50% and kills, (same as before). This does this both on the sync client and async client. TBD: write tests and docs; and test more properly;

Carreau · 2021-02-22T18:41:39Z

I'm thinking about changing the shutdown_kernel to return an object with how far it went to stop the kernel; something that we can grow later; which basically tell us if the kernel replied to the shutdown_request; or had to go to the sigterm, or the sigkill. That might make it easier to test (as it is hard to get info from a dead process), and thus implement multiple kernels that will forcibly not shutdown on request, and not shut down on sigterm.

MSeal

Overall looks fine beyond some minor comments. Might be worth making a manual test plan for this one to make sure we have someone try it out in a few event loops patterns / os's? Just because I could see some odd edge case in windows for M1 osx that our automation might not catch.

jupyter_client/manager.py

MSeal · 2021-02-27T23:57:25Z

jupyter_client/tests/signalkernel.py

+    @gen.coroutine
+    def shutdown_request(self, stream, ident, parent):
+        if os.environ.get("NO_SHUTDOWN_REPLY") != "1":
+            yield gen.maybe_future(super().shutdown_request(stream, ident, parent))


Not super familiar with gen from tornado. Is there a risk of issues here depending on the wrapping event loop?

I don't believe so, that's the exact way ipykernel uses coroutines:

@gen.coroutine def shutdown_request(self, stream, ident, parent): content = yield gen.maybe_future(self.do_shutdown(parent['content']['restart'])) ...

So if this breaks; IPykernel would be broken, and if it works with ipykernel, I see no reason this wouldn't.

I would prefer async def and await, but that unfortunately does not mix.

MSeal · 2021-02-27T23:59:57Z

jupyter_client/manager.py

+                else:
+                    # If there's still a proc, wait and clear
+                    if self.has_kernel:
+                        self.kernel.wait()


Just double checking that we can't get stuck here... I think we can't since the kernel isn't alive but I don't recall why we wait in that case.

I don't think so, that's the same logic as line 360; I could make it a local function and call it in both location to be clearer.

jupyter_client/manager.py

kevin-bates · 2021-02-28T00:55:19Z

jupyter_client/manager.py

+            await asyncio.wait_for(
+                self._async_wait(pollinterval=pollinterval), timeout=waittime / 2
+            )


I think this spin-wait needs to be associated with the except block - otherwise, it will occur again for the successful shutdown-request case (which will immediately exit, but probably not the intent).

Yeah, it's with the one just below, let me try to change the layout to be clearer.

I've split the two try/except to make the intention clearer, is that better ?

jupyter_client/manager.py

kevin-bates

(Sorry, I fat-fingered the 'single-comment' button, and its like it turned previous comments into single comments as well, so I continued in that mode.)

This is looking good and I believe introducing interrupt and terminate to the shutdown mix is great.

I don't think this is anything to change, but just noting that the shutdown-request status is set after the request, while the signal-based statuses are set prior to their respective requests. Although not consistent, I think it's probably the right approach across the three and I'm not advocating for additional (pre/post) statuses, but with Kernel Provisioning, the terminate() and kill() are not necessarily "signal" requests and we shouldn't assume that a shtudown_status of SigtermRequest has been completely delivered. (Assuming there were a means of checking the status - which I could see some admin-level apps wanting to do.)

Hmm - on that note, should we consider name changes to TerminateRequest and KillRequest. These sound like kernel messages. I suppose we could use Request as a prefix on all three: RequestShutdown, RequestTerminate, and RequestKill.

kevin-bates

Thanks for the update Matthias - they look good.

I just had the one comment regarding possibly cutting to the chase and using the subprocess methods unconditionally. There might be some history here that I'm not seeing. Thanks.

kevin-bates · 2021-03-01T17:18:34Z

jupyter_client/manager.py

@@ -507,6 +514,11 @@ def _send_kernel_sigterm(self):
                    self.kernel.terminate()
                elif hasattr(signal, "SIGTERM"):
                    self.signal_kernel(signal.SIGTERM)
+                else:
+                    self.log.debug(
+                        "Cannot set term signal to kernel, no"


I see we prefer SIGKILL over kill() and assume that kill() exists. Should we do the same with SIGTERM and terminate()? (I suspect this is due to Windows not having SIGTERM?)

Actually, given that Popen is well-documented to use SIGKILL and SIGTERM for kill() and terminate(), respectively, and there don't appear to be python version compatibility issues, I think we'd be fine just calling the desired methods directly. This would then forgo the need to log anything and remove the conditional attribute checks.

Diggning in:

I don't think that's compatibility issue, signal_kernel signal the process group, not the kernel alone.
Sigterm can be handled by a subprocess, not sigkill.

Signal_kernel send the signal to the process group (when possible) instead of only the single process, so if the kernel has children you don't want to send sigterm to the (grand)children, put leave the child process a chance to clean up it children, in case there is some cleanup logic.

Thus terminate() is prefered, and if you can't you sigterm the process group.

Kill is the opposite, as the kernel has no chance to terminate children by catching sigkill, you want to tell them to die as well. See #314

I guess in a perfect world you would

1 shutdown request

2 wait

3 term child

4 wait

4b if don't respond term grand children

5 wait

5b kill everybody.

we just skip 4,4b, and replace 3 by "term child, if you can't term everybody".

Good point on the process group portion of things. I had conflated signal_kernel() with send_signal().

Yeah, i I had to dig deep in the code and history to figure that one :-)

Carreau · 2021-03-01T20:55:12Z

I don't think this is anything to change, but just noting that the shutdown-request status is set after the request, while the signal-based statuses are set prior to their respective requests. Although not consistent, I think it's probably the right approach across the three and I'm not advocating for additional (pre/post) statuses, but with Kernel Provisioning, the terminate() and kill() are not necessarily "signal" requests and we shouldn't assume that a shtudown_status of SigtermRequest has been completely delivered. (Assuming there were a means of checking the status - which I could see some admin-level apps wanting to do.)

Hmm - on that note, should we consider name changes to TerminateRequest and KillRequest. These sound like kernel messages. I suppose we could use Request as a prefix on all three: RequestShutdown, RequestTerminate, and RequestKill.

I do not have any preference, this is mostly for testing in order to understand the internal state and why I kept them private. ; I initially thought of returning the values but I'm not feeling confident with making these public.

kevin-bates

Thank you Matthias (and @mlucool for proposing these changes) - this makes a kernel's management more robust.

mlucool · 2021-03-03T20:52:17Z

Thanks @Carreau - I tested this and it does fix #618

Carreau · 2021-03-08T16:00:56Z

there has been no objecting in a week; merging. Will do a release soon.

Carreau force-pushed the sigterm branch from bfb2d3d to 93dd3ac Compare February 22, 2021 18:25

Carreau added 4 commits February 22, 2021 12:18

sync

55ba451

add testing

d2c1250

add same async tests

331c558

generalize param

c91c993

Carreau marked this pull request as ready for review February 23, 2021 15:51

Carreau changed the title ~~[DRAFT] interrupt and sigterm before sigkill kernel.~~ Interrupt and sigterm before sigkill kernel. Feb 23, 2021

MSeal approved these changes Feb 28, 2021

View reviewed changes

kevin-bates reviewed Feb 28, 2021

View reviewed changes

jupyter_client/manager.py Outdated Show resolved Hide resolved

kevin-bates reviewed Feb 28, 2021

View reviewed changes

jupyter_client/manager.py Outdated Show resolved Hide resolved

kevin-bates reviewed Feb 28, 2021

View reviewed changes

Carreau added 3 commits March 1, 2021 08:23

Lower case K

dae1def

consistent log + timeout layout

d6cd86b

refactor local logic to be clearer

6024cca

kevin-bates reviewed Mar 1, 2021

View reviewed changes

comment on private reason for enum

60aa969

kevin-bates approved these changes Mar 3, 2021

View reviewed changes

Carreau merged commit 7df16d5 into jupyter:master Mar 8, 2021

Carreau added this to the 6.1.11 milestone Mar 10, 2021

Carreau linked an issue Mar 10, 2021 that may be closed by this pull request

Kernel Shutdown Proposal #618

Closed

Carreau modified the milestones: 6.1.11, 6.1.12 Mar 14, 2021

davidbrochart mentioned this pull request Mar 19, 2021

Add type annotations, refactor sync/async #623

Merged

Zsailer mentioned this pull request Aug 19, 2021

Jupyter Server Notes 2021 jupyter-server/team-compass#4

Closed

kevin-bates mentioned this pull request Feb 2, 2022

subprocesses of kernels are not killed by restart #104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interrupt and sigterm before sigkill kernel. #620

Interrupt and sigterm before sigkill kernel. #620

Carreau commented Feb 22, 2021 •

edited by kevin-bates

Loading

Carreau commented Feb 22, 2021

MSeal left a comment

MSeal Feb 27, 2021

Carreau Mar 1, 2021

MSeal Feb 27, 2021

Carreau Mar 1, 2021

kevin-bates Feb 28, 2021

Carreau Mar 1, 2021

Carreau Mar 1, 2021

kevin-bates left a comment

kevin-bates left a comment

kevin-bates Mar 1, 2021

Carreau Mar 1, 2021

kevin-bates Mar 1, 2021

Carreau Mar 3, 2021

Carreau commented Mar 1, 2021

kevin-bates left a comment

mlucool commented Mar 3, 2021

Carreau commented Mar 8, 2021

Interrupt and sigterm before sigkill kernel. #620

Interrupt and sigterm before sigkill kernel. #620

Conversation

Carreau commented Feb 22, 2021 • edited by kevin-bates Loading

Carreau commented Feb 22, 2021

MSeal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin-bates left a comment

Choose a reason for hiding this comment

kevin-bates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Carreau commented Mar 1, 2021

kevin-bates left a comment

Choose a reason for hiding this comment

mlucool commented Mar 3, 2021

Carreau commented Mar 8, 2021

Carreau commented Feb 22, 2021 •

edited by kevin-bates

Loading