-
-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os.proc.call
's timeout
has a termination race-condition from SIGTERM
and SIGKILL
#284
Comments
The option 3 feels like the most sensible/useful to me. |
Option 3 can be ok if the kill delay is configurable or sufficient length. Some people might argue that's an unexpectedly longer timeout than they expected though. In my case, I don't want forcible termination at all. At the very least, in addition to The default could then be |
I think a configurable kill delay with a default sounds good. Such an addition can be made binary compatible and semantically compatible, and mirrors the Linux timeout command as you have mentioned. Feel free to send a PR |
What do you envision the binary compatible change looking like?
|
Take all the methods currently taking AFAICT, this affects |
The problem is that you can't have two overloadings with default arguments. This is why default arguments is a bin-compat nightmare. I think if the argument is placed last and the underlying forwarder is stripped of its defaults this might work... |
|
Yes, keep the default arguments only on the longest overload and strip them from the shorter ones. |
Also, as a side note, |
Probably not, but most people won't extend them anyway so a bit of sloppiness isn't a huge deal |
Well, it might well be because of the new overload: it might be better to |
sure that works |
Alternatively, I notice we are still in |
Let's preserve bincompat and release it as a 0.10.x for now. Although we can break compat whenever we want and just bump a version, it's also good to preserve compat where possible to give our users an easier time upgrading. |
Fair! (it's been so long since I've been in 0.x that I forgot you can release minor in the patch digit) |
The current implementation of
os.proc.call
'stimeout
flag uses two consecutive calls top.destroy()
andp.forciblyDestroy()
. The effects of these calls are to first sendSIGTERM
, and then sendSIGKILL
to the process.The roles of these two signals are:
SIGTERM
: instruct the process to terminate, the process may intercept this and perform necessary clean-up operations, or may decide to ignore it entirelySIGKILL
: instruct the process to terminate immediately -- this signal cannot be intercepted.By sending these two signals back-to-back, the parent process produces a race-condition between how quickly the child can execute its
SIGTERM
handler and clean up resources and the issue of theSIGKILL
. In my experiments, I've found thatSIGKILL
it the cause of process exit the vast majority of the time. This means that the (potentially necessary) clean-up of the process is often not performed or worse interrupted. If the handler itself contained code to write file contents back to disk, modify a database, and so on, these operations may be corrupted. If the child process itself has children that need terminating, this could not be issued, leading to the parent process hanging.What are the possible desired outcomes?
There are three ways that the timeout should be terminating the process:
SIGTERM
: it doesn't matter how long it takes, we need to ensure safe clean-upSIGKILL
: the process has no important state, it should be terminated immediatelySIGTERM
, wait an appropriate amount of time, then sendSIGKILL
: we want to offer the process an opportunity to clean-up, but if this takes too long (perhaps the clean-up process itself is hanging), we want to forcibly terminate -- this is the scenario done byos.lib
, albeit without allowing sufficient time to perform the handler.What is normally done?
The
SIGKILL
signal is useful to issue when a process is not responding in a timely fashion to itsSIGTERM
event and the two are usually sent together with a delay. The Linuxtimeout
command offers this with the-k n
flag, which sends aSIGKILL
signaln
seconds after the original timeout sentSIGTERM
.Solutions
The race condition caused by the consecutive calls to
destroy
andforciblyDestroy
is a bug, and could be addressed by supporting a similar system allowing outcomes (1), (2), or (3) configurably and safely.For backwards compatibility, however, it might be wise to just support (1) with the current system.
The text was updated successfully, but these errors were encountered: