-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster goes into inconsistent state #1271
Comments
Guys, could anyone comment on that? Even "you don't understand what you're talking about" would be helpful. Btw, I realized the problem doesn't reproduce if start order of the nodes is different. What is the rationale behind connecting only to the nodes that were alive some time ago? Why not try failed nodes as well? |
Have you tried the conntrack fix? #1337 |
Sorry wrong issue #1335 |
@djenriquez |
Gotcha. I apologize, I admit i didn't fully understand the problem on first read. Very interested to understand what is happening here as well. I'm first thinking maybe all servers except the first have to call |
@jakubzytka take a look at this: https://consul.io/docs/guides/outage.html |
I'm sorry, but I believe you are wrong in pretty much every sentence. Please point any mistake in my reasoning if you see one.
The outage recovery guide describes a solution for a situation when a majority of cluster is permanently dead, so there is no quorum anymore (the yellow box on the page tells this in different words).
I believe you are mixing failed nodes with dead ones (or, possibly, with nodes that left). The outage recovery guide you mentioned tells in case of failed nodes it is enough to just restart the servers. In fact what the guide recommends is to fail all nodes, change config, and just restart them: In fact the cluster will work if you start servers in a certain order (i.e. in the reverse order of getting failed). But the order should not matter, as each node has full raft-peers knowledge. I believe it is the root cause of this issue.
bootstrap-expect 3 is no good, because some node can still be down. Bootstrap-expect 2 is a better solution, because quorum is 2 anyway.
You are mistaken. Election did start, and a new leader was elected: 2015/10/05 07:55:50 [DEBUG] raft: Votes needed: 2 The problem is that according to raft all peers are there (so election works), but according to serf each node is in isolation. It is inconsistency even within a single node. |
@jakubzytka Sorry I wasted your time, I didn't read it thoroughly. |
@half-dead I hope it doesn't have loopholes and doesn't break serf in some other regard. |
I manged to reproduce steps to get cluster to fail to elect leader. Setup I have : Reproducing inconsistent state steps:
After inspecting peers.json I could see all good machines there and failed machines as well. since quorum for 3 server cluster in 2 nodes, having 2 working nodes and 2 failed Steps to recover cluster are:
I managed to partially automate this by:
What I think is main issue is that quorum of 2 nodes for 3 server cluster Also I was wondering if there is way to reduce 72 hours for failed nodes to some other value. |
@engine07 could you please create an issue for your case which is just normal behaviour (quorum for 5 nodes cluster is 3, not 2), and not mix it with my bugreport which relates to something completely different? |
Hi, I apologize, it looked to me like issue with same origin (it is 3 server nodes and 2 follower and I thought only server nodes are counted for quorum). |
We are going to close this out. Since 0.5.2 we've changed the leave defaults for servers to not leave the cluster when interrupted or shut down, and we've fixed several issues with Raft's server configuration management. The test scenario in the description works as expected now, and the servers will re-elect a leader. |
Intro
I observe a cluster going into sort-of inconsistent state where one node becomes a leader but other nodes claim there is no leader. It also looks as if there is some permanent inconsistency of knowledge regarding node presence between raft and serf layers within one agent when they should, eventually, become consistent.
The scenario below contains failure of all nodes which, I believe, was (is?) not considered a real scenario for consul. Nevertheless, I kindly ask to review it because:
Test scenario
The scenario is as follows:
The expected result of step 3 is that:
Results
Used consul version is 0.5.2; Test environment is Linux, network is virtual, the agents are run inside docker containers and -advertise is always used when starting an agent. I expect it shouldn't matter.
What really happens in step 3:
this is surprising; what does that mean? B and A agreed on a leader, so B now wants to commit "A is dead" to the history?
agent: failed to sync remote state: No cluster leader
agent: failed to sync remote state: No cluster leader
agent: Service 'consul' in sync
(???)consul members
call, but onconsul info
it tells there are 2 raft peersI suppose this is related to the fact that serf data is not persisted. Maybe after restart the serf layer is oblivious of some nodes that raft knows about? Nevertheless, I'd expect raft layer to feed serf with some information regarding presence in the cluster (e.g. "hey, serf, could you re-check this guy A? He's taking part in the voting, so maybe you shouldn't reap him after all...")
Please note that there is a scenario with only one server being down at one time, which most likely expresses the same behaviour (1 node down + network split)
The logs of all three nodes are here: https://gist.github.com/jakubzytka/c0e6046f4d6d7f5ab788
In the logs the servers are named sles12_[1-3] instead of A, B or C.
From my perspective this is a rather common scenario (a massive power-failure or a maintenance shutdown), and I can't see a reason why it shouldn't work straight-away. I also smell the fix should be rather uncomplicated (e.g. feed serf with raft data on start; restart 72h reap timer)
Workaround
Would passing a list of all known raft-peers to retry-join be a good workaround?
The text was updated successfully, but these errors were encountered: