Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interrupt and sigterm before sigkill kernel. #620

Merged
merged 9 commits into from
Mar 8, 2021

Conversation

Carreau
Copy link
Member

@Carreau Carreau commented Feb 22, 2021

This implements the discussion in #618, to try to more progressively stop
the kernels.

  1. this always sends and interrupt before any shutdown requests;
    goal being to stop any processing happening that may block any event
    loop.

  2. sends the shutdown_requests (no changes here).

  3. wait 50% of the wait time, and sends a "terminate" in quote as this
    depends on your platform/os, and the type of kernel you have.

    • if subprocess.Popen calls .terminate() which is SIGTERM on unix; the same as
      .kill() on windows; if not a Popen instance send SIGTERM.
  4. wait the other 50% and kills, (same as before).

This does this both on the sync client and async client.

TBD: write tests and docs; and test more properly;

This implements the discussion in jupyter#18, to try to more progressively stop
the kernels.

1) this always sends and interrupt before any shutdown requests;
  goal being to stop any processing happening that may block any event
  loop.

2) sends the shutdown_requests (no changes here).

3) wait 50% of the wait time, and sends a "terminate" in quote as this
depends on your platform/os, and the type of kernel you have.
    - if subprocess.Popen calls `.terminate()` which is SIGTERM on unix; the same as
    `.kill()` on windows; if not a Popen instance send SIGTERM.

4) wait the other 50% and kills, (same as before).

This does this both on the sync client and async client.

TBD: write tests and docs; and test more properly;
@Carreau
Copy link
Member Author

Carreau commented Feb 22, 2021

I'm thinking about changing the shutdown_kernel to return an object with how far it went to stop the kernel; something that we can grow later; which basically tell us if the kernel replied to the shutdown_request; or had to go to the sigterm, or the sigkill. That might make it easier to test (as it is hard to get info from a dead process), and thus implement multiple kernels that will forcibly not shutdown on request, and not shut down on sigterm.

@Carreau Carreau marked this pull request as ready for review February 23, 2021 15:51
@Carreau Carreau changed the title [DRAFT] interrupt and sigterm before sigkill kernel. Interrupt and sigterm before sigkill kernel. Feb 23, 2021
Copy link
Contributor

@MSeal MSeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks fine beyond some minor comments. Might be worth making a manual test plan for this one to make sure we have someone try it out in a few event loops patterns / os's? Just because I could see some odd edge case in windows for M1 osx that our automation might not catch.

jupyter_client/manager.py Show resolved Hide resolved
jupyter_client/manager.py Show resolved Hide resolved
@gen.coroutine
def shutdown_request(self, stream, ident, parent):
if os.environ.get("NO_SHUTDOWN_REPLY") != "1":
yield gen.maybe_future(super().shutdown_request(stream, ident, parent))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familiar with gen from tornado. Is there a risk of issues here depending on the wrapping event loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so, that's the exact way ipykernel uses coroutines:

 @gen.coroutine
    def shutdown_request(self, stream, ident, parent):
        content = yield gen.maybe_future(self.do_shutdown(parent['content']['restart']))
        ...

So if this breaks; IPykernel would be broken, and if it works with ipykernel, I see no reason this wouldn't.

I would prefer async def and await, but that unfortunately does not mix.

else:
# If there's still a proc, wait and clear
if self.has_kernel:
self.kernel.wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking that we can't get stuck here... I think we can't since the kernel isn't alive but I don't recall why we wait in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, that's the same logic as line 360; I could make it a local function and call it in both location to be clearer.

Comment on lines +660 to +662
await asyncio.wait_for(
self._async_wait(pollinterval=pollinterval), timeout=waittime / 2
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this spin-wait needs to be associated with the except block - otherwise, it will occur again for the successful shutdown-request case (which will immediately exit, but probably not the intent).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's with the one just below, let me try to change the layout to be clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've split the two try/except to make the intention clearer, is that better ?

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry, I fat-fingered the 'single-comment' button, and its like it turned previous comments into single comments as well, so I continued in that mode.)

This is looking good and I believe introducing interrupt and terminate to the shutdown mix is great.

I don't think this is anything to change, but just noting that the shutdown-request status is set after the request, while the signal-based statuses are set prior to their respective requests. Although not consistent, I think it's probably the right approach across the three and I'm not advocating for additional (pre/post) statuses, but with Kernel Provisioning, the terminate() and kill() are not necessarily "signal" requests and we shouldn't assume that a shtudown_status of SigtermRequest has been completely delivered. (Assuming there were a means of checking the status - which I could see some admin-level apps wanting to do.)

Hmm - on that note, should we consider name changes to TerminateRequest and KillRequest. These sound like kernel messages. I suppose we could use Request as a prefix on all three: RequestShutdown, RequestTerminate, and RequestKill.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update Matthias - they look good.

I just had the one comment regarding possibly cutting to the chase and using the subprocess methods unconditionally. There might be some history here that I'm not seeing. Thanks.

@@ -507,6 +514,11 @@ def _send_kernel_sigterm(self):
self.kernel.terminate()
elif hasattr(signal, "SIGTERM"):
self.signal_kernel(signal.SIGTERM)
else:
self.log.debug(
"Cannot set term signal to kernel, no"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we prefer SIGKILL over kill() and assume that kill() exists. Should we do the same with SIGTERM and terminate()? (I suspect this is due to Windows not having SIGTERM?)

Actually, given that Popen is well-documented to use SIGKILL and SIGTERM for kill() and terminate(), respectively, and there don't appear to be python version compatibility issues, I think we'd be fine just calling the desired methods directly. This would then forgo the need to log anything and remove the conditional attribute checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diggning in:

I don't think that's compatibility issue, signal_kernel signal the process group, not the kernel alone.
Sigterm can be handled by a subprocess, not sigkill.

Signal_kernel send the signal to the process group (when possible) instead of only the single process, so if the kernel has children you don't want to send sigterm to the (grand)children, put leave the child process a chance to clean up it children, in case there is some cleanup logic.

Thus terminate() is prefered, and if you can't you sigterm the process group.

Kill is the opposite, as the kernel has no chance to terminate children by catching sigkill, you want to tell them to die as well. See #314

I guess in a perfect world you would

  • 1 shutdown request
  • 2 wait
  • 3 term child
  • 4 wait
    • 4b if don't respond term grand children
  • 5 wait
    • 5b kill everybody.

we just skip 4,4b, and replace 3 by "term child, if you can't term everybody".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the process group portion of things. I had conflated signal_kernel() with send_signal().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i I had to dig deep in the code and history to figure that one :-)

@Carreau
Copy link
Member Author

Carreau commented Mar 1, 2021

I don't think this is anything to change, but just noting that the shutdown-request status is set after the request, while the signal-based statuses are set prior to their respective requests. Although not consistent, I think it's probably the right approach across the three and I'm not advocating for additional (pre/post) statuses, but with Kernel Provisioning, the terminate() and kill() are not necessarily "signal" requests and we shouldn't assume that a shtudown_status of SigtermRequest has been completely delivered. (Assuming there were a means of checking the status - which I could see some admin-level apps wanting to do.)

Hmm - on that note, should we consider name changes to TerminateRequest and KillRequest. These sound like kernel messages. I suppose we could use Request as a prefix on all three: RequestShutdown, RequestTerminate, and RequestKill.

I do not have any preference, this is mostly for testing in order to understand the internal state and why I kept them private. ; I initially thought of returning the values but I'm not feeling confident with making these public.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Matthias (and @mlucool for proposing these changes) - this makes a kernel's management more robust.

@mlucool
Copy link
Contributor

mlucool commented Mar 3, 2021

Thanks @Carreau - I tested this and it does fix #618

@Carreau Carreau merged commit 7df16d5 into jupyter:master Mar 8, 2021
@Carreau
Copy link
Member Author

Carreau commented Mar 8, 2021

there has been no objecting in a week; merging. Will do a release soon.

@Carreau Carreau added this to the 6.1.11 milestone Mar 10, 2021
@Carreau Carreau linked an issue Mar 10, 2021 that may be closed by this pull request
@Carreau Carreau modified the milestones: 6.1.11, 6.1.12 Mar 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kernel Shutdown Proposal
4 participants