Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul should not 'leave' for init script 'stop' action #85

Closed
runswithd6s opened this issue Feb 24, 2015 · 6 comments · Fixed by #87
Closed

consul should not 'leave' for init script 'stop' action #85

runswithd6s opened this issue Feb 24, 2015 · 6 comments · Fixed by #87

Comments

@runswithd6s
Copy link
Contributor

When consul is provided with the leave subcommand, the node is removed from the cluster. This requires that the cluster be added back in at reboot with a join action. This breaks expected behavior for the service, in which the node automatically rejoins the cluster upon service start. A more acceptable init-style script behavior would be to kill the consul process so it does not remove itself from the cluster.

@solarkennedy
Copy link
Contributor

I think I agree. That init script was probably stolen from
https://gist.github.com/stojg/d0d96a123761830c0ff1

Can you PR?

@runswithd6s
Copy link
Contributor Author

runswithd6s commented Feb 25, 2015 via email

@EvanKrall
Copy link
Contributor

Would consul leave on a server cause the other servers to decrease the expected number of servers w.r.t quorum calculations? That would be somewhat scary.

@runswithd6s
Copy link
Contributor Author

It appears that the bootstrap-expect is a "maybe" condition if no raft data has been initilized. If you've already bootstrapped, then it will be ignored. The expect/quorum is managed dynamically based on the number of hosts in the cluster. We've been playing with a Vagrant build of a 3 server cluster and ran into issues that led us to recognize this behavior. Both the consul leave and kill -INT PROCESSID will cause the node to leave the cluster. A leader can be elected with the remaining two hosts, due to the time-based elections, but they will not expect a 3rd until you re-start and re-join.

We've run into issues with using bootstrap-expect to elect leaders at all and have needed to manually bootstrap the cluster. This closed bug plagues us now: hashicorp/consul#370. Here's the comment that is telling for us:

All of the server nodes ever leaving is not considered a
standard operating case. The servers are long running, and if
you expect to run more than one for HA, it is an outage
scenario if only one is running. In the case of all the
machines losing power / failing, when they start up it will
automatically heal. In the case of all nodes leaving the
cluster and shutting down, that is not considered a normal
mode of operation.

Recovery for this: https://www.consul.io/docs/guides/outage.html

If all three servers have 'left' the cluster, recovering from this logically forced outage isn't as simple as clearing out the peers.json file. Leaders can't be elected, and the KV data is effecively lost. Perhaps it is because the nodes are not in a boostrap mode any more. Who knows. We have found that sending SIGKILL the consul service preserves its state, at least to the capability that the nodes can rejoin the cluster without losing information.

From a pragmatic position, it is better for a 'server' process to 'die' rather than 'leave'. For agent services, it's a different story.

@solarkennedy
Copy link
Contributor

I'm down with this. The upstream upstart scripts don't leave
https://github.com/hashicorp/consul/tree/0c7ca91c74587d0a378831f63e189ac6bf7bab3f/terraform/aws/scripts

@runswithd6s if you make a PR I would accept it or I will do it myself.

@runswithd6s
Copy link
Contributor Author

Ok. We're almost done with the investigation spike. I'll talk to our team to see if I can carve out some time to do it.

runswithd6s pushed a commit to runswithd6s/puppet-consul that referenced this issue Feb 25, 2015
This fixes voxpupuli#85. The 'stop' action in init scripts for sysv and Debian
will only 'leave' the cluster if acting as an agent. When running as a
server, as determined by a call to `consul info`, kill the process
instead.

Both updates also enforce the use of a PID file located in a
/var/run/consul directory, writeable by the consul::user configured in
Puppet.
runswithd6s pushed a commit to runswithd6s/puppet-consul that referenced this issue Feb 25, 2015
This fixes voxpupuli#85. The 'stop' action in init scripts for sysv and Debian
will only 'leave' the cluster if acting as an agent. When running as a
server, as determined by a call to `consul info`, kill the process
instead.

Both updates also enforce the use of a PID file located in a
/var/run/consul directory, writeable by the consul::user configured in
Puppet.
runswithd6s pushed a commit to runswithd6s/puppet-consul that referenced this issue Feb 25, 2015
This fixes voxpupuli#85. The 'stop' action in init scripts for sysv and Debian
will only 'leave' the cluster if acting as an agent. When running as a
server, as determined by a call to `consul info`, kill the process
instead.

Both updates also enforce the use of a PID file located in a
/var/run/consul directory, writeable by the consul::user configured in
Puppet.
runswithd6s pushed a commit to runswithd6s/puppet-consul that referenced this issue Feb 25, 2015
This fixes voxpupuli#85. The 'stop' action in init scripts for sysv and Debian
will only 'leave' the cluster if acting as an agent. When running as a
server, as determined by a call to `consul info`, kill the process
instead.

Both updates also enforce the use of a PID file located in a
/var/run/consul directory, writeable by the consul::user configured in
Puppet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants