Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel Shutdown Proposal #618

Closed
mlucool opened this issue Feb 5, 2021 · 14 comments · Fixed by #620
Closed

Kernel Shutdown Proposal #618

mlucool opened this issue Feb 5, 2021 · 14 comments · Fixed by #620
Milestone

Comments

@mlucool
Copy link
Contributor

mlucool commented Feb 5, 2021

Shutdown kernel states the following in it's doc:

This attempts to shutdown the kernels cleanly by:

  1. Sending it a shutdown message over the control channel.
  2. If that fails, the kernel is shutdown forcibly by sending it a signal.

While this seems like the right thing to do, it does not work well for kernels like IPykernel (but may be fine for kernels like Xeus-Python). The problem is the signal sent via the control channel may be blocked by a running process, so there is no real chance to cleanly shut it down.

I propose that we add a step between 1 and 2 which tries SIGINT and then tries SIGTERM before finally using SIGKILL (step 2 today). What do you think?

@kevin-bates
Copy link
Member

hi @mlucool, this makes sense to me - thanks for raising it.

I have some questions, comments...

  • I'm hoping we could just invoke interrupt_kernel() for the SIGINT step. Chances are that a kernel that supports message-based interrupts have probably responded to the message-based termination anyway, and this would be cleaner.
  • I'm assuming interrupt would occur after the 5-second period to determine if the message-based termination succeeded, correct?
    • Is the SIGTERM even necessary at that point? I see that ipykernel and IPython don't handle SIGTERM anyway, so, relative to that kernel, there's really no need, although SIGTERM prior to SIGKILL is more correct and should probably be honored.
    • I'd prefer we not add another 5-second period following the SIGTERM and prior to SIGKILL. Perhaps we could split the poll loops in 60/40 or something - so that the overall time for these edge cases isn't necessarily increased.

Another approach would be to unconditionally issue an interrupt_kernel() call prior to request_shutdown() and be done. Although cleaner, this would lengthen the shutdown sequence by a message.

((I'm hoping we could drop the if hasattr(signal, 'SIGKILL'): code and just use self.kernel.kill() and self.kernel.terminate(). Sending the signal and calling the method is the same thing anyway.))

@mlucool
Copy link
Contributor Author

mlucool commented Feb 6, 2021

Another approach would be to unconditionally issue an interrupt_kernel() call prior to request_shutdown() and be done. Although cleaner, this would lengthen the shutdown sequence by a message.

I like this idea. Still, it seems like SIGTERM is needed for things inside the kernel to be able to cleanly shutdown.

New proposal:

  1. Interrupt then shutdown message over the control channel (with no delay between the two)
  2. SIGTERM
  3. SIGKILL

I don't have a good intuition as to how long we need to wait between the steps. Is the 60/40 idea to wait 3 seconds after the control channel and 2 seconds after SIGTERM?

cc @Carreau - what do you think w.r.t. ipykernel?

@kevin-bates
Copy link
Member

Still, it seems like SIGTERM is needed for things inside the kernel to be able to cleanly shutdown.

Assuming the interrupt request would proceed today's graceful shutdown sequence and restore the kernel's ability to respond to control messages, I would argue that the request-shutdown message is tantamount to issuing SIGTERM prior to SIGKILL and we don't need the additional SIGTERM.

Is the 60/40 idea to wait 3 seconds after the control channel and 2 seconds after SIGTERM?

Yes, but only if we don't perform the interrupt as a prerequisite to the request-shutdown message.

I'm not familiar with kernel message handling in general, so would like to hear other opinions as well.

@Carreau
Copy link
Member

Carreau commented Feb 6, 2021

I agree that sending interrupt before request_shutdown make sens.
I think it is ok to also send sigterm shortly after request_shutdown, before sigkill, just in case the kernel is blocked in a compiled-routine that may ignore sigint.

@mlucool
Copy link
Contributor Author

mlucool commented Feb 6, 2021

In case xeus authors have any thoughts here as well - cc @martinRenou @SylvainCorlay

@SylvainCorlay
Copy link
Member

While this seems like the right thing to do, it does not work well for kernels like IPykernel (but may be fine for kernels like Xeus-Python).

Hey @mlucool, thanks for the ping!

FYI, I opened a PR to ipykernel last week to run the control channel in a separate thread, so that ipykernel can properly handle shutdown request while processing execute requests. We are also working on plugging the debugger on top of that PR.

I think it makes sense for kernels to implement signal handlers, but I am not sure it is really necessary from a jupyter_client perspective, although I am not opposed to this change.

@mlucool
Copy link
Contributor Author

mlucool commented Feb 8, 2021

Based on above, I think it's still best to send SIGTERM in case some code is blocking the shutdown. What if we set the default to 7s and split the time between proposed steps 1&2 and 1&3 - so 3.5 for the control channel to shut down and 3.5 for graceful SIGTERM?

@kevin-bates
Copy link
Member

I think this seems fine but would just like to clarify the sequencing. Here's my understanding...

  1. Issue interrupt via interrupt_kernel()
  2. Issue shutdown-request via request_shutdown()
    Poll for up 50% of waittime to see if kernel is still alive
    If alive:
    1. Issue SIGTERM via self.kernel.terminate()
      Poll for up 50% of waittime to see if kernel is still alive
      If alive:
      1. Issue SIGKILL via self.kernel.kill()

Is that the plan?

(Thanks for dealing with this.)

@mlucool
Copy link
Contributor Author

mlucool commented Feb 9, 2021

Correct @kevin-bates. Only final question is should we increase the total waittime or not. Moving to 6 or 7 seconds seems like a good idea to me at least, but I don't feel strongly.

@kevin-bates
Copy link
Member

I'm beginning to think it might be best to keep waittime at its current default of 5. Use that value for the graceful shutdown attempt (after interrupt and request-shutdown), then use, say, 50-60% of that value for the forceful shutdown attempt sequence (after SIGTERM).

The help text for shutdown_wait_time says: "Time to wait for a kernel to terminate before killing it, in seconds." which is still all true. Just that the time to kill can now take 50% more of that time. 😄

Does that sound reasonable?

@mlucool
Copy link
Contributor Author

mlucool commented Feb 9, 2021

I think users care about the total max time they would have to wait to have the kernel be stopped (SIGKILL). So long as the language in the documentation is updated to make it clear it's all the same to me.

@kevin-bates
Copy link
Member

ok - I'm fine with that.

@mlucool
Copy link
Contributor Author

mlucool commented Feb 16, 2021

Great. @kevin-bates, in terms of implementing this change, are you planning to do it? If not, I'm happy to have someone on my end contribute it.

@kevin-bates
Copy link
Member

I wasn't planning on implementing this; a contribution would be welcome. Thanks Marc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants