-
-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consul should not 'leave' for init script 'stop' action #85
Comments
I think I agree. That init script was probably stolen from Can you PR? |
After working more with consul, there are other considerations. When
acting as an agent publishing a service, it is safe and preferred to
cleanly 'leave' the cluster. When acting as a server, we don't want to
'leave' unless done so administratively. With sysvinit on Linux, we can
use /etc/sysconfig/consul (Red Hat) or /etc/defaults/consul (Debian) to
set a stop behavior variable to select the appropriate one depending
upon the 'server' parameter in the puppet init class.
I'll put some thought into it at work tomorrow.
Thanks,
Chad
|
Would |
It appears that the We've run into issues with using bootstrap-expect to elect leaders at all and have needed to manually bootstrap the cluster. This closed bug plagues us now: hashicorp/consul#370. Here's the comment that is telling for us:
Recovery for this: https://www.consul.io/docs/guides/outage.html If all three servers have 'left' the cluster, recovering from this logically forced outage isn't as simple as clearing out the peers.json file. Leaders can't be elected, and the KV data is effecively lost. Perhaps it is because the nodes are not in a boostrap mode any more. Who knows. We have found that sending SIGKILL the consul service preserves its state, at least to the capability that the nodes can rejoin the cluster without losing information. From a pragmatic position, it is better for a 'server' process to 'die' rather than 'leave'. For agent services, it's a different story. |
I'm down with this. The upstream upstart scripts don't leave @runswithd6s if you make a PR I would accept it or I will do it myself. |
Ok. We're almost done with the investigation spike. I'll talk to our team to see if I can carve out some time to do it. |
This fixes voxpupuli#85. The 'stop' action in init scripts for sysv and Debian will only 'leave' the cluster if acting as an agent. When running as a server, as determined by a call to `consul info`, kill the process instead. Both updates also enforce the use of a PID file located in a /var/run/consul directory, writeable by the consul::user configured in Puppet.
This fixes voxpupuli#85. The 'stop' action in init scripts for sysv and Debian will only 'leave' the cluster if acting as an agent. When running as a server, as determined by a call to `consul info`, kill the process instead. Both updates also enforce the use of a PID file located in a /var/run/consul directory, writeable by the consul::user configured in Puppet.
This fixes voxpupuli#85. The 'stop' action in init scripts for sysv and Debian will only 'leave' the cluster if acting as an agent. When running as a server, as determined by a call to `consul info`, kill the process instead. Both updates also enforce the use of a PID file located in a /var/run/consul directory, writeable by the consul::user configured in Puppet.
This fixes voxpupuli#85. The 'stop' action in init scripts for sysv and Debian will only 'leave' the cluster if acting as an agent. When running as a server, as determined by a call to `consul info`, kill the process instead. Both updates also enforce the use of a PID file located in a /var/run/consul directory, writeable by the consul::user configured in Puppet.
When consul is provided with the
leave
subcommand, the node is removed from the cluster. This requires that the cluster be added back in at reboot with ajoin
action. This breaks expected behavior for the service, in which the node automatically rejoins the cluster upon service start. A more acceptable init-style script behavior would be to kill the consul process so it does not remove itself from the cluster.The text was updated successfully, but these errors were encountered: