Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange (probably unnecessary) behavior in sysv stop script #174

Closed
pforman opened this issue Aug 20, 2015 · 14 comments
Closed

Strange (probably unnecessary) behavior in sysv stop script #174

pforman opened this issue Aug 20, 2015 · 14 comments

Comments

@pforman
Copy link

pforman commented Aug 20, 2015

The sysv stop script has this in it for the stop section:

stop() {
        echo -n "Shutting down consul: "
        # If consul is not acting as a server, exit gracefully
        if ("${CONSUL}" info 2>/dev/null | grep -q 'server = false' 2>/dev/null) ; then
            "$CONSUL" leave
        fi

        # If acting as a server, or if leave failed, kill it.
        mkpidfile
        killproc $KILLPROC_OPT $CONSUL -9

        retcode=$?
        rm -f /var/lock/subsys/consul $PID_FILE
        return $retcode
}

I read the original PR (#87), and I agree with the theory to have clients leave but servers stay in "failed" state to preserve their state for a rejoin. However, the implementation doesn't seem to address this correctly, and "kill -9" is extremely heavy-handed for a distributed consensus system. It doesn't seem like that's ever going to be the right move.

There's also a problem where if the leave works on a client, the kill will fail, resulting in a "FAILED" response from $retcode.

I also have observed some cases of clients in a "failed" state where they should have left, which I think is down to a race condition between issuing a leave and the subsequent 'kill -9'.

I have a PR almost ready to go for this, but then I saw #173 in the queue working in exactly the same files (and same lines), so to avoid a conflict I've held off.

I also figured some discussion about what the actual effect should be was in order. Consul will normally quit quickly (without issuing a leave) when given a TERM, however this can be controlled by the leave_on_terminate config option. Seems like issuing a TERM is correct for servers wanting to preserve state, and can still be controlled if desired in the config_hash.

What to do with a client that fails to leave is a little harder. In a few cases, I've seen a failure to leave immediately, which manifested as this message.

Shutting down consul: Error leaving: client closed

However, looking in the logs it appears this is a temporary issue in resending gossip, and doesn't actually affect the leave process.

The logic I've used is like this (irrelevant stuff removed):

  if client
     leave OR kill -TERM
     retcode = $?
  else
    kill -TERM
    retcode = $?
  end
  return retcode $?

But honestly, I'm questioning the use of the TERM case at all in the client section. Any thoughts on this before I send in a PR?

As far as I can tell, this "kill -9" usage is unique to the sysv script. Every other method uses "consul" leave, or possibly TERM. Debian escalates TERM to KILL after a timeout, but doesn't start there.

tl;dr : The sysv script uses kill -9 on consul and I don't think it should.

@aj-jester
Copy link

There's also a problem where if the leave works on a client, the kill will fail, resulting in a "FAILED" response from $retcode.

I also have observed some cases of clients in a "failed" state where they should have left, which I think is down to a race condition between issuing a leave and the subsequent 'kill -9'.

@pforman What would happen if leave_on_terminate is set to true, keep consul leave, then execute kill -TERM right after? Would that cause a race condition as well? Would the state of the client be failed or left?

Edit: Updated question for clarity

@pforman
Copy link
Author

pforman commented Aug 21, 2015

If leave_on_terminate is true, then a TERM should issue another leave if it actually runs. Normally leave_on_terminate is false. (There's another default but modifiable behavior around SIGINT as well, basically SIGINT will cause a leave unless specified not to).

Normally the leave works correctly. In some rare cases, I've seen it not work, but I can't reliably duplicate it (yet). Luckily my test cluster just triggered it for me.

Here's what I got:

[root@ip-10-103-65-99 ~]# /etc/init.d/consul stop
Shutting down consul: Error leaving: client closed

Here's the log:

    2015/08/21 03:25:39 [INFO] agent: Synced check 'service:nginx'
    2015/08/21 03:31:53 [INFO] agent: Synced check 'service:nginx'
    2015/08/21 03:34:34 [INFO] agent.rpc: Accepted client: 127.0.0.1:49830
    2015/08/21 03:34:34 [INFO] agent.rpc: Accepted client: 127.0.0.1:49831
    2015/08/21 03:34:34 [INFO] agent.rpc: Graceful leave triggered
    2015/08/21 03:34:34 [INFO] consul: client starting leave
    2015/08/21 03:34:34 [INFO] serf: EventMemberLeave: ip-10-103-65-99 10.103.65.99
    2015/08/21 03:34:34 [INFO] agent: requesting shutdown
    2015/08/21 03:34:34 [INFO] consul: shutting down client
    2015/08/21 03:34:34 [INFO] agent: shutdown complete

This looks identical to any other client leave in the logs.

So it looks like the shutdown code returned "1", but then shut down correctly anyway. Checking from another client gives a "left" status:

[zinc@ip-10-103-67-243 ~]$ consul members
Node              Address             Status  Type    Build  Protocol  DC
ip-10-103-66-5    10.103.66.5:8301    alive   server  0.5.2  2         pfdemo-us-west-2
ip-10-103-65-5    10.103.65.5:8301    alive   server  0.5.2  2         pfdemo-us-west-2
ip-10-103-65-99   10.103.65.99:8301   left    client  0.5.2  2         pfdemo-us-west-2
ip-10-103-67-243  10.103.67.243:8301  alive   client  0.5.2  2         pfdemo-us-west-2
ip-10-103-67-5    10.103.67.5:8301    alive   server  0.5.2  2         pfdemo-us-west-2

This may just be a consul thing from internal goroutines and the shutdown order. I'll look over on their issues. As far as I can tell, the TERM is just sort of superfluous, unless you really want to force a "failed" state instead of left.

I haven't ever seen a leave actually fail, but with the old script I did see some "failed" states come about when clients should have left. I think that's because the SIGKILL can't be caught, and the process just terminates abruptly.

Seems like having clients simply do "consul leave" and having servers do "kill -TERM" is probably the best. Having clients use INT (which the handler will generate a leave from) is also an option.

Not sure if that's more or less confusing. I spent some quality time reading the consul signals code today to get what understanding I do have.
It's at https://github.com/hashicorp/consul/blob/master/command/agent/command.go

I'll dig in a little more, and go check the issues of consul to see if they've noticed this behavior.

@aj-jester
Copy link

As far as I can tell, the TERM is just sort of superfluous, unless you really want to force a "failed" state instead of left.

Seems like having clients simply do "consul leave" and having servers do "kill -TERM" is probably the best. Having clients use INT (which the handler will generate a leave from) is also an option.

@pforman So then why not remove consul leave and just use INT? Isin't that what we want, consul to leave gracefully? That way we don't have to deal with any race condition with consul leave.

@pforman
Copy link
Author

pforman commented Aug 21, 2015

I think that's viable. Let me beat on it a bit and see if I can get it to misbehave in any way.

The nice part about using INT is that it's controllable by config_hash, using the "skip_leave_on_interrupt" parameter.

All of this discussion about leave-vs-TERM is currently only the case for "sysv", though. Everything else appears to just use "consul leave". Seems like a lot of us with weird needs are using sysv-based systems...

I'm not sure if consistency across the init types is a good goal or not.

@aj-jester
Copy link

@pforman Thank You! The more testing the better 😄

Currently consul leave is only implemented for sysv, debian and upstart scripts. So if you can confirm INT isin't doing any funny business, I will update the upstart and debian scripts to use INT too.

So please keep us posted 👍

@pforman
Copy link
Author

pforman commented Aug 21, 2015

Initial results with INT look good. We get this in the log (turned up to DEBUG):

==> Caught signal: interrupt
==> Gracefully shutting down agent...
    2015/08/21 05:05:15 [INFO] consul: client starting leave
    2015/08/21 05:05:15 [DEBUG] serf: messageLeaveType: ip-10-103-65-99
    2015/08/21 05:05:15 [DEBUG] serf: messageLeaveType: ip-10-103-65-99
    2015/08/21 05:05:15 [DEBUG] serf: messageLeaveType: ip-10-103-65-99
    2015/08/21 05:05:15 [INFO] serf: EventMemberLeave: ip-10-103-65-99 10.103.65.99
    2015/08/21 05:05:15 [DEBUG] memberlist: Initiating push/pull sync with: 10.103.65.5:8301
    2015/08/21 05:05:15 [DEBUG] http: Shutting down http server (127.0.0.1:8500)
    2015/08/21 05:05:15 [INFO] agent: requesting shutdown
    2015/08/21 05:05:15 [INFO] consul: shutting down client
    2015/08/21 05:05:15 [INFO] agent: shutdown complete

The one oddity is that the script contains this:

        rm -f /var/lock/subsys/consul $PID_FILE

That's too fast, if the rm is allowed to remove $PID_FILE, then consul itself complains in the logs when it subsequently tries to. So the signal is sent to consul, but then the script marches on. Turns out the script is quicker at issuing "rm" than consul is at shutting down.

    2015/08/21 04:58:03 [WARN] agent: could not delete pid file  Could not remove pid file: stat /var/run/consul/consul.pid: no such file or directory

Still looking into the original condition that causes "consul leave" to occasionally generate a return code of 1. Seems like that should be sent upstream. "consul leave" is a bit more readable than "kill -INT", so I'd hate to obfuscate everything if we can find the root issue.

@aj-jester
Copy link

@pforman 💡 We can do one of the following:

  1. We can check if the process exists pid -0 $PID (possibly fixed loop count with a one second back off)?
  • This raises another question. If after x amount of tries the process is still running then we have to decide whether to exit right then and echo "hey something went wrong, pid is still running" or forcefully kill -9 and clean up the pid and lock files?

OR

  1. We can just issue INT and assume consul took care of everything else but on start up ensure the pid is not running (if it does do kill -9) and clean up the lock and pid files.

Also, I kinda understand your point about consul leave being more clear. But the log doesn't really show where the leave was triggered from. It would been nice if consul added a flag called -reason or something so we could do consul leave -reason 'triggered from init' or something. Right now its quite ambiguous.

@aj-jester
Copy link

@pforman I checked the other init scripts and most are sleeping/waiting before removing the pid file, so I think its fine to do that here as well (option 1).

Upstart is the exception to that rule, but I will add support to it as well.

@pforman
Copy link
Author

pforman commented Aug 21, 2015

I opened hashicorp/consul#1189 about this. We'll see what happens.

I think INT and backoff is satisfactory, and if they fix this issue upstream going back to consul leave might be preferable then.

@aj-jester
Copy link

@pforman Awesome. Looking forward to your PR 😎 ⛵

@hopperd
Copy link
Contributor

hopperd commented Sep 4, 2015

Was just thinking/working on this exact same issue today. I'd prefer if we can let the leave_on_terminate property determine the behavior of the cluster leave/failing otherwise there are situations in which this can cause a node to leave the cluster when you didn't really want it to because of a stop/start.

@pforman have you made headway on this and have a PR ready soon?

@pforman
Copy link
Author

pforman commented Sep 5, 2015

PR is #181, waiting for merge.

The behavior for both clients and servers is somewhat configurable with the Consul properties, because the script just sends INT or TERM respectively.

If the signal isn't handled in a reasonable time it escalates the signal, eventually to KILL.

Consul is pretty well behaved as a daemon so I haven't seen that happen.

@solarkennedy
Copy link
Contributor

Merged in #181.
I look forward to the day when we can just "use the upstream init script" :(

@aj-jester
Copy link

Something like this could be used to standardize init scripts https://github.com/jordansissel/pleaserun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants