-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interrupt and sigterm before sigkill kernel. #620
Conversation
This implements the discussion in jupyter#18, to try to more progressively stop the kernels. 1) this always sends and interrupt before any shutdown requests; goal being to stop any processing happening that may block any event loop. 2) sends the shutdown_requests (no changes here). 3) wait 50% of the wait time, and sends a "terminate" in quote as this depends on your platform/os, and the type of kernel you have. - if subprocess.Popen calls `.terminate()` which is SIGTERM on unix; the same as `.kill()` on windows; if not a Popen instance send SIGTERM. 4) wait the other 50% and kills, (same as before). This does this both on the sync client and async client. TBD: write tests and docs; and test more properly;
I'm thinking about changing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks fine beyond some minor comments. Might be worth making a manual test plan for this one to make sure we have someone try it out in a few event loops patterns / os's? Just because I could see some odd edge case in windows for M1 osx that our automation might not catch.
@gen.coroutine | ||
def shutdown_request(self, stream, ident, parent): | ||
if os.environ.get("NO_SHUTDOWN_REPLY") != "1": | ||
yield gen.maybe_future(super().shutdown_request(stream, ident, parent)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super familiar with gen
from tornado. Is there a risk of issues here depending on the wrapping event loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe so, that's the exact way ipykernel uses coroutines:
@gen.coroutine
def shutdown_request(self, stream, ident, parent):
content = yield gen.maybe_future(self.do_shutdown(parent['content']['restart']))
...
So if this breaks; IPykernel would be broken, and if it works with ipykernel, I see no reason this wouldn't.
I would prefer async def
and await
, but that unfortunately does not mix.
jupyter_client/manager.py
Outdated
else: | ||
# If there's still a proc, wait and clear | ||
if self.has_kernel: | ||
self.kernel.wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checking that we can't get stuck here... I think we can't since the kernel isn't alive but I don't recall why we wait in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, that's the same logic as line 360; I could make it a local function and call it in both location to be clearer.
await asyncio.wait_for( | ||
self._async_wait(pollinterval=pollinterval), timeout=waittime / 2 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this spin-wait needs to be associated with the except
block - otherwise, it will occur again for the successful shutdown-request case (which will immediately exit, but probably not the intent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's with the one just below, let me try to change the layout to be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've split the two try/except to make the intention clearer, is that better ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry, I fat-fingered the 'single-comment' button, and its like it turned previous comments into single comments as well, so I continued in that mode.)
This is looking good and I believe introducing interrupt and terminate to the shutdown mix is great.
I don't think this is anything to change, but just noting that the shutdown-request status is set after the request, while the signal-based statuses are set prior to their respective requests. Although not consistent, I think it's probably the right approach across the three and I'm not advocating for additional (pre/post) statuses, but with Kernel Provisioning, the terminate()
and kill()
are not necessarily "signal" requests and we shouldn't assume that a shtudown_status of SigtermRequest
has been completely delivered. (Assuming there were a means of checking the status - which I could see some admin-level apps wanting to do.)
Hmm - on that note, should we consider name changes to TerminateRequest
and KillRequest
. These sound like kernel messages. I suppose we could use Request
as a prefix on all three: RequestShutdown
, RequestTerminate
, and RequestKill
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update Matthias - they look good.
I just had the one comment regarding possibly cutting to the chase and using the subprocess methods unconditionally. There might be some history here that I'm not seeing. Thanks.
@@ -507,6 +514,11 @@ def _send_kernel_sigterm(self): | |||
self.kernel.terminate() | |||
elif hasattr(signal, "SIGTERM"): | |||
self.signal_kernel(signal.SIGTERM) | |||
else: | |||
self.log.debug( | |||
"Cannot set term signal to kernel, no" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we prefer SIGKILL
over kill()
and assume that kill()
exists. Should we do the same with SIGTERM
and terminate()
? (I suspect this is due to Windows not having SIGTERM
?)
Actually, given that Popen
is well-documented to use SIGKILL
and SIGTERM
for kill()
and terminate()
, respectively, and there don't appear to be python version compatibility issues, I think we'd be fine just calling the desired methods directly. This would then forgo the need to log anything and remove the conditional attribute checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diggning in:
I don't think that's compatibility issue, signal_kernel signal the process group, not the kernel alone.
Sigterm can be handled by a subprocess, not sigkill.
Signal_kernel send the signal to the process group (when possible) instead of only the single process, so if the kernel has children you don't want to send sigterm to the (grand)children, put leave the child process a chance to clean up it children, in case there is some cleanup logic.
Thus terminate()
is prefered, and if you can't you sigterm the process group.
Kill is the opposite, as the kernel has no chance to terminate children by catching sigkill, you want to tell them to die as well. See #314
I guess in a perfect world you would
- 1 shutdown request
- 2 wait
- 3 term child
- 4 wait
- 4b if don't respond term grand children
- 5 wait
- 5b kill everybody.
we just skip 4,4b, and replace 3 by "term child, if you can't term everybody".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on the process group portion of things. I had conflated signal_kernel()
with send_signal()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i I had to dig deep in the code and history to figure that one :-)
I do not have any preference, this is mostly for testing in order to understand the internal state and why I kept them private. ; I initially thought of returning the values but I'm not feeling confident with making these public. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Matthias (and @mlucool for proposing these changes) - this makes a kernel's management more robust.
there has been no objecting in a week; merging. Will do a release soon. |
This implements the discussion in #618, to try to more progressively stop
the kernels.
this always sends and interrupt before any shutdown requests;
goal being to stop any processing happening that may block any event
loop.
sends the shutdown_requests (no changes here).
wait 50% of the wait time, and sends a "terminate" in quote as this
depends on your platform/os, and the type of kernel you have.
.terminate()
which is SIGTERM on unix; the same as.kill()
on windows; if not a Popen instance send SIGTERM.wait the other 50% and kills, (same as before).
This does this both on the sync client and async client.
TBD: write tests and docs; and test more properly;