-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul servers won't elect a leader #993
Comments
This is our consul.json config:
even numbers belong to the sit0 data center, odd numbers belong to the sit1 data center. So it should already be in bootstrap mode? (We always start the consul servers in bootstrap mode, I guess it cannot hurt?) |
I've had this same issue. I have brought up a cluster of 3 servers with |
After some fiddling, I got it to work now. There are 2 suspects:
|
@eirslett what did you change to get it to work? |
I removed the "retry_join_wan" setting from all servers, and removed the "bootstrap_expect" setting from all servers except one of them. Then, I restarted the servers, and the leader election worked. |
The servers should all be started with the same Could you link to some logs in your original configuration so we can see On Mon, Jun 8, 2015 at 11:11 AM, Eirik Sletteberg notifications@github.com
|
My configuration didn't have any wan connections at all so I doubt it had anything to do with it (unless it's automatic). |
Here's the log from the 3 servers in my cluster when this happens:
|
Also, when this happens killing the servers with either QUIT or INT still leaves it in this state when they come back up, they won't elect a leader. |
@reversefold Those look like the same node's log got pasted three times - can you grab the logs for the other two nodes? |
Command-line I'm using:
If I try killing them all and replacing
Note that no other consul servers are up at this point, so I don't understand why it's losing leadership...
|
@slackpad Apologies, I've updated the comment above with the other 2 logs. |
I've also tried removing the data directory on one node and starting it with bootstrap. It elects itself as leader but when I start up the other 2 servers they won't accept it as leader.
|
Also worth mentioning that my config dir only has service entries in it. |
@reversefold - After digging more I think you are seeing the same thing as is being discussed on #750 and #454. I was able to reproduce your situation locally and verify that the consul.data/raft/peers.json file ends up with I'll let @armon and/or @ryanuber weigh in on the best practice to follow to get into a working state. Maybe we should add a new "Failure of all Servers in a Multi-Server Cluster" section in the outage recovery docs since this seems to be confusing to people. I think it will end up being something similar to the manual bootstrapping procedure, picking one of the servers as the lead. In my local testing I was able to get them going again by manually editing the peers.json file, but I'm not sure if that's a safe thing to do. |
@slackpad is correct here. @reversefold when you delete the entire data directory, you are also deleting the raft log entirely, which is why you get the error about the voting terms not matching up (servers with data joining a leader with no data). Editing the peers.json file and leaving the raft data intact is the current best practice for recovering in outage scenarios, and the |
I understand, I was just throwing things at the wall to try to make the cluster come back up. This does seem to be a bug in the system, though. In general if I am shutting down a consul agent I want to do it "nicely" so the consul cluster doesn't complain about failing checks. I know I shut it down and I don't need spurious errors in my logs or being reported by the consul cluster. Regardless of that, though, this is a state that the consul cluster is putting itself in. If, when a multi-node consul server cluster is shutdown with INT, the last server is leaving and is about to put itself into this state, it should throw a warning or error and not do that. I've also tested this with a single node and the same problem does not happen, which I would expect given your explanation. If I have a single server with |
I'm seeing this same issue. Can someone either provide a sample or else point me to the documentation of what the peers.json is supposed to look like (you know, before it gets corrupted with a null value)? |
Hi @orclev - there's a sample in the Outage Recovery Guide. It's basically a list of IPs and port numbers for the peers:
|
@slackpad thanks, I somehow missed that. I figured it was probably somewhere in the docs and I was just missing it. |
This bug is hitting me as well on 0.5.2 |
@slackpad |
Hi @juaby. Unfortunately, there's not a good way we could automatically recover from the case where the servers have all left the cluster. I was just going to improve the documentation on how to recover from this case by manually editing the peers.json file per the comments above. |
If this is a state that a human can recognize and fix, then it's a state that a computer can recognize and fix. At the very least there should be a command-line option that will fix this state. |
Perhaps we should always start with --bootstrap-expect=1. That way we would work around the bring up problem. The remaining servers will join subsequently, so in the end we'll get our redundancy back. I believe joining servers need to nuke their raft data. I know this is not recommended but in our case, manual fiddling with descriptors is not an option. @slackpad do you see any issues with this approach? |
I suppose --bootstrap-expect=1 would lead to split brain? |
I hope not as the joining servers would run with --join whereas the first one would not |
After a day of testing, this almost works. It starts in bootstrap-expect=1 and elects itself a leader. The others join and I have my cluster back. Unfortunately, I am running into a case, where it decides to give up as a leader. For some reason it detects long dead peers as active and wants to run election which it cannot win because well... the peers are really dead. Is this a bug or is there some reason for that? |
I've read through #750, #454 and this issue, but I don't feel like I'm any closer to understanding the Right Way (tm) to do things. Here's the workflow I need and expect:
#2 is where everyone is getting stuck. Consul should Just Work even when all servers bounce due to a planned or unplanned outage - even if a split brain situation occurs temporarily (such as when node 2 / 3 disappears and the cluster loses quorum), the cluster should heal itself when any 2 (or all?) are back up, without requiring manual intervention. If this is infeasible due to technical limitations in raft or Consul, I would like to see explicit documentation detailing which failure (and maintenance!) modes support automatic remediation, best practices around said maintenance, and a clear description of cases requiring manual remediation. |
Ok - that RPC error looks like it may have talked to an old server. I'll need to look at the stale issue as that seems to have been reported by you and another person. If a single server of a 3 server cluster going down causes an outage that's likely from a stale peer in the Raft configuration. You can use |
Our staging environment went down in a similar fashion; here's my write-up https://gist.github.com/haf/1983206cf11846f6f3f291f78acee5cf |
Raising my hand as another one hitting this issue. We had the same thing, where the current leader aws node died, a new one was spun up, and nothing converged. We ran into the same issue trying to manually fix with the "no leader" making it difficult to find out which raft node is dead and remove it. Also tried doing the peers.json recovery, and that failed because the server wouldn't even start with that file in a format as documented. ): Our ultimate solution/fix was to blow away all 3 nodes and let it bootstrap from scratch. This left it disconnected from all the agents, but doing a join from to the agents that were all still part of the old cluster, brought everything back into sync. (Services anyway, didn't check KV data) Our cluster is all 0.7.2+. We're still in test mode, so no production impact from it, just some slowed development cycles and an injection of a yellow flag to the consul solution rollout. This is very easy to reproduce. Setup a new 3 node cluster with --bootstrap 3, wait until it's all converged with a leader, then kill off the leader (terminate the instance). The cluster will never recover. |
Isn't this the most basic feature consul should support? Unbelievable it's still not working. Any workarounds? |
We've got automation coming in Consul 0.8 that'll fix this - https://github.com/hashicorp/consul/blob/master/website/source/docs/guides/autopilot.html.markdown. |
that isso good to hear :) . our workaround is to scratch consul data dirs on EVERY master host, and re-run puppet which then re-sets consul. our set-up automation can handle that pretty well, without this we'd have been lost a couple of times. |
Hi We are also facing this issue(no leader elected after system restart). However our consul instance is running in a docker container on multiple EC2 instances. Can any one suggest what is the simple workaround in case of dockerization ? |
Closing this out now that Autopilot is available in 0.8.x - https://www.consul.io/docs/guides/autopilot.html. We've also (in 0.7.x):
|
We also (in 0.7.x) made this change:
|
@slackpad In our situation, we have 3-member consul deployed on Kubernetes cluster. Each member is in its own pod. We've recently made changes into our cluster and did a rolling-update. After that the 3 consuls are running fine as per status in Kubernetes but looking at the logs on each member it says no cluster leader. I am able to list all members with consul members (pls see below)
Should I try the peers.json file? |
Hi @edbergavera if the servers are trying to elect a leader and there are dead servers in the quorum from the rolling update that's preventing it, then you would need to use peers.json per https://www.consul.io/docs/guides/outage.html#manual-recovery-using-peers-json. |
Hello James,
I did follow the instructions described in outage document but to no avail.
I think this is specific to Kubenetes pod issue with Consul. So, I ended up
re-creating the cluster in Kubernetes and restored KVs and that worked.
Thank you for your suggestion and looking into this.
…On Thu, May 18, 2017 at 11:27 PM, James Phillips ***@***.***> wrote:
Hi @edbergavera <https://github.com/edbergavera> if the servers are
trying to elect a leader and there are dead servers in the quorum from the
rolling update that's preventing it, then you would need to use peers.json
per https://www.consul.io/docs/guides/outage.html#manual-
recovery-using-peers-json.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#993 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABhWeY8zsS3IDPVUk-KQDGxPO2lsE48fks5r7GN0gaJpZM4E0aKV>
.
--
*Eduardo D. Bergavera, Jr.*
Linux Admin
Email: edbergavera@gmail.com
OpenID: https://launchpad.net/~edbergavera
Github: https://github.com/edbergavera
|
Having this exact same issue with |
Experienced this issue when I activated raft_protocol version 3, reverting to raft_protocol version 2 fixed the issue. But still investigating why switch to v3 triggered the issue. |
Cluster of 5 running 0.9.0 will not elect a leader w/raft_protocol = 3, but will elect with raft_protocol = 2. working config: |
Hi @dgulinobw can you please open a new issue and include a gist with the server logs when you see this? Thanks! |
fix an issue like this hashicorp/consul#993
I also ran into this on a new consul cluster running 1.6.0 As soon as I made sure all the consul servers had both a
|
@spuder Interesting. What was your token policy for |
you saved my day dear ! |
I have 3 consul servers running (+ a handful of other nodes), and they can all speak to each other - or so I think; at least they're sending UDP messages between themselves.
The logs still show
[ERR] agent: failed to sync remote state: No cluster leader
, so even if the servers know about each other, it looks like they fail to perform an actual leader election...Is there a way to trigger a leader election manually?
I'm running consul 0.5.2 on all nodes.
The text was updated successfully, but these errors were encountered: