-
-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange (probably unnecessary) behavior in sysv stop script #174
Comments
@pforman What would happen if Edit: Updated question for clarity |
If leave_on_terminate is true, then a TERM should issue another leave if it actually runs. Normally leave_on_terminate is false. (There's another default but modifiable behavior around SIGINT as well, basically SIGINT will cause a leave unless specified not to). Normally the leave works correctly. In some rare cases, I've seen it not work, but I can't reliably duplicate it (yet). Luckily my test cluster just triggered it for me. Here's what I got:
Here's the log:
This looks identical to any other client leave in the logs. So it looks like the shutdown code returned "1", but then shut down correctly anyway. Checking from another client gives a "left" status:
This may just be a consul thing from internal goroutines and the shutdown order. I'll look over on their issues. As far as I can tell, the TERM is just sort of superfluous, unless you really want to force a "failed" state instead of left. I haven't ever seen a leave actually fail, but with the old script I did see some "failed" states come about when clients should have left. I think that's because the SIGKILL can't be caught, and the process just terminates abruptly. Seems like having clients simply do "consul leave" and having servers do "kill -TERM" is probably the best. Having clients use INT (which the handler will generate a leave from) is also an option. Not sure if that's more or less confusing. I spent some quality time reading the consul signals code today to get what understanding I do have. I'll dig in a little more, and go check the issues of consul to see if they've noticed this behavior. |
@pforman So then why not remove |
I think that's viable. Let me beat on it a bit and see if I can get it to misbehave in any way. The nice part about using INT is that it's controllable by config_hash, using the "skip_leave_on_interrupt" parameter. All of this discussion about leave-vs-TERM is currently only the case for "sysv", though. Everything else appears to just use "consul leave". Seems like a lot of us with weird needs are using sysv-based systems... I'm not sure if consistency across the init types is a good goal or not. |
@pforman Thank You! The more testing the better 😄 Currently So please keep us posted 👍 |
Initial results with INT look good. We get this in the log (turned up to DEBUG):
The one oddity is that the script contains this:
That's too fast, if the
Still looking into the original condition that causes "consul leave" to occasionally generate a return code of 1. Seems like that should be sent upstream. "consul leave" is a bit more readable than "kill -INT", so I'd hate to obfuscate everything if we can find the root issue. |
@pforman 💡 We can do one of the following:
OR
Also, I kinda understand your point about |
@pforman I checked the other init scripts and most are sleeping/waiting before removing the pid file, so I think its fine to do that here as well (option 1). Upstart is the exception to that rule, but I will add support to it as well. |
I opened hashicorp/consul#1189 about this. We'll see what happens. I think INT and backoff is satisfactory, and if they fix this issue upstream going back to |
@pforman Awesome. Looking forward to your PR 😎 ⛵ |
Was just thinking/working on this exact same issue today. I'd prefer if we can let the leave_on_terminate property determine the behavior of the cluster leave/failing otherwise there are situations in which this can cause a node to leave the cluster when you didn't really want it to because of a stop/start. @pforman have you made headway on this and have a PR ready soon? |
PR is #181, waiting for merge. The behavior for both clients and servers is somewhat configurable with the Consul properties, because the script just sends INT or TERM respectively. If the signal isn't handled in a reasonable time it escalates the signal, eventually to KILL. Consul is pretty well behaved as a daemon so I haven't seen that happen. |
Merged in #181. |
Something like this could be used to standardize init scripts https://github.com/jordansissel/pleaserun |
The sysv stop script has this in it for the stop section:
I read the original PR (#87), and I agree with the theory to have clients leave but servers stay in "failed" state to preserve their state for a rejoin. However, the implementation doesn't seem to address this correctly, and "kill -9" is extremely heavy-handed for a distributed consensus system. It doesn't seem like that's ever going to be the right move.
There's also a problem where if the leave works on a client, the kill will fail, resulting in a "FAILED" response from $retcode.
I also have observed some cases of clients in a "failed" state where they should have left, which I think is down to a race condition between issuing a leave and the subsequent 'kill -9'.
I have a PR almost ready to go for this, but then I saw #173 in the queue working in exactly the same files (and same lines), so to avoid a conflict I've held off.
I also figured some discussion about what the actual effect should be was in order. Consul will normally quit quickly (without issuing a leave) when given a TERM, however this can be controlled by the leave_on_terminate config option. Seems like issuing a TERM is correct for servers wanting to preserve state, and can still be controlled if desired in the config_hash.
What to do with a client that fails to leave is a little harder. In a few cases, I've seen a failure to leave immediately, which manifested as this message.
However, looking in the logs it appears this is a temporary issue in resending gossip, and doesn't actually affect the leave process.
The logic I've used is like this (irrelevant stuff removed):
But honestly, I'm questioning the use of the TERM case at all in the client section. Any thoughts on this before I send in a PR?
As far as I can tell, this "kill -9" usage is unique to the sysv script. Every other method uses "consul" leave, or possibly TERM. Debian escalates TERM to KILL after a timeout, but doesn't start there.
tl;dr : The sysv script uses kill -9 on consul and I don't think it should.
The text was updated successfully, but these errors were encountered: