consul should not 'leave' for init script 'stop' action #85

runswithd6s · 2015-02-24T17:56:15Z

When consul is provided with the leave subcommand, the node is removed from the cluster. This requires that the cluster be added back in at reboot with a join action. This breaks expected behavior for the service, in which the node automatically rejoins the cluster upon service start. A more acceptable init-style script behavior would be to kill the consul process so it does not remove itself from the cluster.

The text was updated successfully, but these errors were encountered:

solarkennedy · 2015-02-25T03:42:20Z

I think I agree. That init script was probably stolen from
https://gist.github.com/stojg/d0d96a123761830c0ff1

Can you PR?

runswithd6s · 2015-02-25T04:45:06Z

After working more with consul, there are other considerations. When acting as an agent publishing a service, it is safe and preferred to cleanly 'leave' the cluster. When acting as a server, we don't want to 'leave' unless done so administratively. With sysvinit on Linux, we can use /etc/sysconfig/consul (Red Hat) or /etc/defaults/consul (Debian) to set a stop behavior variable to select the appropriate one depending upon the 'server' parameter in the puppet init class. I'll put some thought into it at work tomorrow. Thanks, Chad

EvanKrall · 2015-02-25T05:37:12Z

Would consul leave on a server cause the other servers to decrease the expected number of servers w.r.t quorum calculations? That would be somewhat scary.

runswithd6s · 2015-02-25T14:58:36Z

It appears that the bootstrap-expect is a "maybe" condition if no raft data has been initilized. If you've already bootstrapped, then it will be ignored. The expect/quorum is managed dynamically based on the number of hosts in the cluster. We've been playing with a Vagrant build of a 3 server cluster and ran into issues that led us to recognize this behavior. Both the consul leave and kill -INT PROCESSID will cause the node to leave the cluster. A leader can be elected with the remaining two hosts, due to the time-based elections, but they will not expect a 3rd until you re-start and re-join.

We've run into issues with using bootstrap-expect to elect leaders at all and have needed to manually bootstrap the cluster. This closed bug plagues us now: hashicorp/consul#370. Here's the comment that is telling for us:

All of the server nodes ever leaving is not considered a
standard operating case. The servers are long running, and if
you expect to run more than one for HA, it is an outage
scenario if only one is running. In the case of all the
machines losing power / failing, when they start up it will
automatically heal. In the case of all nodes leaving the
cluster and shutting down, that is not considered a normal
mode of operation.

Recovery for this: https://www.consul.io/docs/guides/outage.html

If all three servers have 'left' the cluster, recovering from this logically forced outage isn't as simple as clearing out the peers.json file. Leaders can't be elected, and the KV data is effecively lost. Perhaps it is because the nodes are not in a boostrap mode any more. Who knows. We have found that sending SIGKILL the consul service preserves its state, at least to the capability that the nodes can rejoin the cluster without losing information.

From a pragmatic position, it is better for a 'server' process to 'die' rather than 'leave'. For agent services, it's a different story.

solarkennedy · 2015-02-25T15:40:39Z

I'm down with this. The upstream upstart scripts don't leave
https://github.com/hashicorp/consul/tree/0c7ca91c74587d0a378831f63e189ac6bf7bab3f/terraform/aws/scripts

@runswithd6s if you make a PR I would accept it or I will do it myself.

runswithd6s · 2015-02-25T15:48:13Z

Ok. We're almost done with the investigation spike. I'll talk to our team to see if I can carve out some time to do it.

This fixes voxpupuli#85. The 'stop' action in init scripts for sysv and Debian will only 'leave' the cluster if acting as an agent. When running as a server, as determined by a call to `consul info`, kill the process instead. Both updates also enforce the use of a PID file located in a /var/run/consul directory, writeable by the consul::user configured in Puppet.

runswithd6s mentioned this issue Feb 25, 2015

sysv & debian init updates to kill or leave #87

Merged

solarkennedy closed this as completed in #87 Feb 26, 2015

aj-jester mentioned this issue Jul 16, 2015

add maintenance mode option to init scripts #124

Closed

solarkennedy mentioned this issue Sep 11, 2020

Add 'consul leave' as ExecStop-parameter for systemd-unit #551

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consul should not 'leave' for init script 'stop' action #85

consul should not 'leave' for init script 'stop' action #85

runswithd6s commented Feb 24, 2015

solarkennedy commented Feb 25, 2015

runswithd6s commented Feb 25, 2015 via email

EvanKrall commented Feb 25, 2015

runswithd6s commented Feb 25, 2015

solarkennedy commented Feb 25, 2015

runswithd6s commented Feb 25, 2015

consul should not 'leave' for init script 'stop' action #85

consul should not 'leave' for init script 'stop' action #85

Comments

runswithd6s commented Feb 24, 2015

solarkennedy commented Feb 25, 2015

runswithd6s commented Feb 25, 2015 via email

EvanKrall commented Feb 25, 2015

runswithd6s commented Feb 25, 2015

solarkennedy commented Feb 25, 2015

runswithd6s commented Feb 25, 2015