-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Servers can't agree on cluster leader after restart when gossiping on WAN #454
Comments
Can you provide more log output from the servers after a restart? Specifically, those prefixed with "raft:" are of interest. Also you don't usually need |
I am having the same issue. Note that this only happens with Here's my steps and the logs: https://gist.github.com/adrienbrault/ad8d13802913b095415a |
It looks like both of you are forcing the cluster into an outage state. The This is what is happening:
When you are starting all the servers again you now have 3 servers again. But this time there is no leader, and no quorum. At this point, it is unsafe for any of the servers to gain leadership (split-brain risks), they will sit there until an operator intervenes. The best way is to avoid causing an outage. If you cause quorum to be lost, manual intervention is required. Any other approach on the part of Consul would introduce safety issues. |
Very interesting @armon! Thanks for the explanation. |
@armon What about being able to specify the expected quorum size ? It is up to the user to use a correct value, like it is for |
@adrienbrault Not currently. There are issues around changing the quorum size then once it is specified. The current approach I think makes the very reasonable trade off of being a zero-touch bootstrap, and zero-touch scale up and down as long as quorum isn't lost. With a sensible amount of redundancy, it should be incredibly unlikely that an operator needs to intervene. |
I can understand why the cluster can't currently recover from losing its quorum, but I too would like to see a way to allow automatic recovery by compromising elsewhere, eg having a fixed quorum size (or a fixed maximum number of Consul servers?) and losing the ability to zero-touch scale up and down. Here's our use case: We have a bunch of servers running in AWS. I'd like to have Consul servers running on three of them, and Consul clients on the rest. So far so good. The catch is that we shut down all but one of the servers each night (to save money while they are idle). It seems that even if I put a Consul server on the lone surviving machine, I'd have to jump through some extra hoops to have the cluster recover each morning? I don't see us wanting to dynamically add additional Consul servers anytime soon but automatic recovery even after a complete shutdown (intentional or otherwise) would be very beneficial to us. Thoughts? |
Reported the exact same issue when I interrupted the servers in #476. I expected |
Basically we can safely provide one of two things:
I don't see a way for us to safely provide both. Currently So we can have a flag like |
Hi, I'm doing some testing of consul and just ran into this myself, since restarting the cluster at once is trivial with config management tools. I would expect this type of user error to be frequent, especially with folks used to tools that regain quorum when possible (mongo, heartbeat, etc..). I think "never loose quorum" is not tenable in the long run. It seems like the 80% use case is the "Nothing bad happened, I just restarted my cluster and would like it to return to operation." In that case it appears that consul already has most of the information required to do so without having to sacrifice the zero-touch scale or bootstrap? It knows the previous server list from raft/peers.json and could remember who the last leader was as a replicated fact. Quorum re-convergence then could simply be "all the previous peer members are gossiping again, and I was the previous leader, let's try an election" ? -n |
Can we at least have steps documented somewhere for restoring the cluster in case we lose the quorum? |
@mohitarora Is that not covered here? http://www.consul.io/docs/guides/outage.html |
@ryanbreen That didn't help. Here is what i did
Everything looks good at this point. I forced the cluster to lose quorum by restarting 2 of the 3 nodes at same time. Nodes came back online but leader was not selected on its own. I want to know what should be my next step here. Should i again start node 1 in bootstrap mode and re-execute the steps mentioned above? |
I would suggest bootstrapping with |
|
Thanks @ryanbreen . -bootstrap-expect is better than -bootstrap, I will start using that but both these are used when cluster is initialized. I still need the steps for recovery once quorum is lost. In my case no leader was selected once all nodes came back to life after quorum loss. |
I also had a similar question to @mohitarora. If we do lose quorum, and there is no leader, what do we do? |
@saulshanabrook Then you're in an outage scenario (since you've lost quorum) and need to decide which server is authoritative, then follow the outage recovery guide. One thing I've also found: If your leaders actually leave the cluster when shutdown cleanly (rather than |
@highlyunavailable What if all three go down at once? How do I restart then? |
If they went down hard or if you had leave-on-terminate set to false you should be able to just start them back up in any order assuming their IPs didn't change. If they did you need to do outage recovery. |
I'm getting the same problem with I'm not following why it doesn't use this value to know when it's safe to vote a leader, whether there was ever a leader before or not... Here are some logs as requested from the OP after a restart of a node
|
This is becoming a problem for me in production and I may have to move away from Consul. If I stop all 3 consul nodes (in my 3 node cluster), I cannot start the Consul cluster back up without a major headache. Are there any thoughts as to how to properly handle this? |
Most of these problems are caused by our default behavior of attempting a graceful leave. Our mental model is that servers are long lived and don't shutdown for any reason other than unexpected power loss, or a graceful maintenance in which case you need to leave the cluster. In retrospect that was a bad default. Almost all of this can be avoided by just There is clearly a UX issue here with Consul that we need to address, but this behavior is not a bug. It is a manifestation of bad UX leading to operator error that is causing a quorum loss in a way that is predictable and expected. You can either tune the settings to be non-default behavior and force a non-graceful exit, or just "pull the plug" with kill. This is a classic "damned if you do, damned if you don't", since if we change the defaults to the inverse, we will have a new corresponding ticket where anybody who was expected the leave to be graceful has now caused quorum loss by operator error in the reverse sense. I'm not sure what the best answer is here. |
Hey @armon I think there may actually be a bug here though, no? In order to fix the issue I have to:
The issue with peers.json is the actual contents seem to be written as "null". (Aka the string "null"). |
@jwestboston From the perspective of Consul, all three servers have left that cluster. They should not rejoin any gossip peers or take part in any future replication. If they had not left the cluster, they would still have those peers and would rejoin the cluster on start. Because all the servers or a majority of them have done this, it is an outage that now requires manual intervention. Does this make sense? |
@armon Ahh .. yes .. that does make sense! :-) Thanks for the clarification. The peers.json file is a list of folks actually, expected to be in the cluster at the current time. Stopping consul == gracefully exiting the cluster == removal from that list across the cluster. So really, indeed, things are operating as designed. And all we need (maybe? perhaps?) is an easier experience for cold restarting an existing Consul cluster. |
So we should potentially set skip_leave_on_interrupt to true? |
I have an ansible playbook that will bounce the cluster in this situation. It pushes out a valid peers.json based on the hosts file. (I had to do a kludge to get the double quotes right with sed). Then restarts the consul service (I run consul as an installed service on linux) |
I'm running into similar problems by just Ctrl-C shutting down agents and then starting them up again. So has this issue ever been solved? |
@tkyang99 do you have |
@armon Regarding your post this seems to be an expected behaviour. Thanks for your clarification at this point! Is there any way to prepare the consul cluster for such restarts - like shutting down 2 of 3 instances before stopping the instances? Is there a feature planned to be implemented in future releases? Cheers, |
Hi I am getting following error 2016/05/09 09:56:50 [ERR] consul: 'cf-vaultdemo-vault_consul-0' and 'cf-vaultdemo-vault_consul-1' are both in bootstrap mode. Only one node should be in bootstrap mode, not adding Raft peer. mode, not adding Raft peer. Node 1: ], ], Node 2 ], ], Node 3 { ], ], |
I am having the same issue with Consul 0.6.4. After playing with it for a couple of days I found that the easiest way to fix this is:
On my Ansible playbook, I have a shell task that deals with this problem: Hope this helps. |
I was having an issue with a 3 node cluster so I figured I'd restart them to see if that addressed the issue. First I created a /data/raft/peers.json file populated as described in this guide: I tried @angelosanramon suggestion which got me slightly further but not far enough. ==> Log data will now stream in as it occurs:
At this point I am actually going to tear down my entire Consul cluster and start from scratch. This is definitely an issue. |
Hi @ljsommer sorry about that.
It looks like you have the leave entry in your Raft log, which is un-doing your peers.json change. This should be fixed in master as the peers.json contents are applied last, any chance you can try this with the 0.7.0-rc2 build? |
I have the good fortune to be able to actually rebuild from scratch without losing any critical data, and yes I'll definitely be using the latest version of Consul to do it. When I get fully rebuilt I will be simulating this same scenario and documenting the results. I'll make sure and update this thread when I do with a step by step guide. |
Any updates on your tryst @ljsommer? We are also facing this issue. We have used the Consul on Kubernetes recipe from https://github.com/kelseyhightower/consul-on-kubernetes, hosted on GKE. Its a typical three peer cluster. GKE crashed all the nodes when scaling up the cluster and since then the nodes stopped electing leader. Finally I had to remove and re-deploy them. |
Facing this issue as well with Consul on Kubernetes |
Same here ! |
I too am using Nomad + Consul in a multi region (three AWS regions so far) mode with Cloud AutoJoin settings. The option
Not setting
At least setting up |
@armon This is definitely still an issue. Why was this bug closed? |
@mud5150 this is a very old issue - closed in 2014. If you are still seeing this behaviour, could you create a new issue with your setup and a reference to this issue? |
This is totally still an issue. It would be great if there was a way to manually designate a leader in this case. I'm having a production outage right now because of this. Thanks @hashicorp-support |
@davidhhessler you are commenting on a closed issue from 2014. if you are having a issue, i would suggest open a new issue and provide the information required on the template, and share more information so development team can evaluate this. if this is not a bug, then the best place would be use the community places: |
Hi,
I'm running consul in a all "WAN" environment, one DC. All my boxes is in the same rack, but do not have a private lan to gossip over.
The first time they join each other, with an empty
/opt/consul
directory they manage to join and agree on a leader.If I restart the cluster, they still connect and find each other - but they never seem to agree on a leader
they just keep repeating
2014/11/05 13:09:41 [ERR] agent: failed to sync remote state: No cluster leader
in theconsul monitor
outputAll nodes are started with
/usr/local/bin/consul agent -config-dir /etc/consul
server 1
server 2
server 3
The text was updated successfully, but these errors were encountered: