cluster goes into inconsistent state #1271

jakubzytka · 2015-10-05T09:48:24Z

Intro

I observe a cluster going into sort-of inconsistent state where one node becomes a leader but other nodes claim there is no leader. It also looks as if there is some permanent inconsistency of knowledge regarding node presence between raft and serf layers within one agent when they should, eventually, become consistent.

The scenario below contains failure of all nodes which, I believe, was (is?) not considered a real scenario for consul. Nevertheless, I kindly ask to review it because:

it really happens
a similar scenario with one node down at a time and a network split can be conceived
it should, in principle, work (I don't know consul details, but there shouldn't be a problem as far as raft is considered)
the fix should, in principle, be simple (once again, I don't know consul details, but I suspect what is wrong)
I imagined a workaround, but to introduce it in production it would help to have first-hand confirmation that it makes sense

Test scenario

The scenario is as follows:

start a cluster with 3 servers; lets call them A, B and C; it is bootstrapped with --bootstrap-expect 3
terminate ('without leave') servers one by one, with several seconds delay between subsequent terminations (first A, then B, then C)
start servers in the same order, so that the node with oldest history starts first; for starting use -server option; no -rejoin, no -join, no -bootstrap etc.

The expected result of step 3 is that:

the first started server (A) starts looking for 2 remaining peers and gets 1 vote in raft voting
the second started server (B) becomes the leader, because there is a quorum and it has newer history than A
the third started server (C) just reconnects (I'm not sure, maybe another election should be held?)
all nodes should act as if turning them off never happened

Results

Used consul version is 0.5.2; Test environment is Linux, network is virtual, the agents are run inside docker containers and -advertise is always used when starting an agent. I expect it shouldn't matter.

What really happens in step 3:

the first started server (A) starts looking for 2 remaining peers and gets 1 vote in raft voting
the second started server (B) becomes the leader, because there is a quorum and it has newer history than A
after server B becomes the leader it logs:

consul: member 'A' reaped, deregistering
consul: member 'C' reaped, deregistering

this is surprising; what does that mean? B and A agreed on a leader, so B now wants to commit "A is dead" to the history?

sever A logs: agent: failed to sync remote state: No cluster leader
server C logs: agent: failed to sync remote state: No cluster leader
server B logs: agent: Service 'consul' in sync (???)
server C reports only self on consul members call, but on consul info it tells there are 2 raft peers
this result is 100% reproducible

I suppose this is related to the fact that serf data is not persisted. Maybe after restart the serf layer is oblivious of some nodes that raft knows about? Nevertheless, I'd expect raft layer to feed serf with some information regarding presence in the cluster (e.g. "hey, serf, could you re-check this guy A? He's taking part in the voting, so maybe you shouldn't reap him after all...")

Please note that there is a scenario with only one server being down at one time, which most likely expresses the same behaviour (1 node down + network split)

The logs of all three nodes are here: https://gist.github.com/jakubzytka/c0e6046f4d6d7f5ab788
In the logs the servers are named sles12_[1-3] instead of A, B or C.

From my perspective this is a rather common scenario (a massive power-failure or a maintenance shutdown), and I can't see a reason why it shouldn't work straight-away. I also smell the fix should be rather uncomplicated (e.g. feed serf with raft data on start; restart 72h reap timer)

Workaround

Would passing a list of all known raft-peers to retry-join be a good workaround?

The text was updated successfully, but these errors were encountered:

jakubzytka · 2015-10-26T14:36:56Z

Guys, could anyone comment on that? Even "you don't understand what you're talking about" would be helpful.

Btw, I realized the problem doesn't reproduce if start order of the nodes is different.
It would seem that after restart a consul server is trying to connect to agents that were alive at the time it went down, but not to those that were failed when it went down.
Since in the test scenario all nodes that went down are online, and all that were online are failed there is no reconnection after restart.

What is the rationale behind connecting only to the nodes that were alive some time ago? Why not try failed nodes as well?

djenriquez · 2015-10-26T15:37:31Z

Have you tried the conntrack fix? #1337

djenriquez · 2015-10-26T15:39:17Z

Sorry wrong issue #1335

jakubzytka · 2015-10-26T17:04:24Z

@djenriquez
This issue is not docker-related. It is 100% reproducible on bare metal.
There is also no "flapping" observed. Furthermore, there is inconsistency and not just nodes going down and up.

djenriquez · 2015-10-26T17:16:44Z

Gotcha. I apologize, I admit i didn't fully understand the problem on first read. Very interested to understand what is happening here as well.

I'm first thinking maybe all servers except the first have to call -join <first server>, but that would be hard to detect, and why wouldn't it just continue where it left off.

Any ideas @armon @slackpad ?

half-dead · 2015-10-27T04:15:11Z

@jakubzytka take a look at this: https://consul.io/docs/guides/outage.html
In my opinion, when the majority of a cluster have failed, simply restart all the servers will not bring the cluster back, and I've tried many times to verify this.
You can either restart all the servers with bootstrap-expect 3 flag or follow the steps in the Outage Recovery guide.
Btw, in your test scenario, I think the problem is not that the servers didn't know about each other, it's the election can not be started.

jakubzytka · 2015-10-27T09:10:23Z

@half-dead

I'm sorry, but I believe you are wrong in pretty much every sentence. Please point any mistake in my reasoning if you see one.

"take a look at this: https://consul.io/docs/guides/outage.html"

The outage recovery guide describes a solution for a situation when a majority of cluster is permanently dead, so there is no quorum anymore (the yellow box on the page tells this in different words).
This requires manual intervention and it is perfectly understandable. This is a different case that I reported here though, because in my case all 3 nodes are back alive.

"In my opinion, when the majority of a cluster have failed, simply restart all the servers will not bring the cluster back, and I've tried many times to verify this."

I believe you are mixing failed nodes with dead ones (or, possibly, with nodes that left). The outage recovery guide you mentioned tells in case of failed nodes it is enough to just restart the servers. In fact what the guide recommends is to fail all nodes, change config, and just restart them:
"At this point, you can restart all the remaining servers. If any servers managed to perform a graceful leave, you may need to have then rejoin the cluster using the join command"
Once again, it is perfectly understandable - those nodes that left need to rejoin. There is no need to join failed nodes, because they are part of the cluster and know all their raft-peers. Even if the cluster evolved once they were down they will update their raft-log after restart.

In fact the cluster will work if you start servers in a certain order (i.e. in the reverse order of getting failed). But the order should not matter, as each node has full raft-peers knowledge. I believe it is the root cause of this issue.

"You can either restart all the servers with bootstrap-expect 3 flag or follow the steps in the Outage Recovery guide."

bootstrap-expect 3 is no good, because some node can still be down. Bootstrap-expect 2 is a better solution, because quorum is 2 anyway.
It wouldn't work though, as bootstrap-expect will not influence visibility of nodes in serf layer, which seems to be the problem.

"Btw, in your test scenario, I think the problem is not that the servers didn't know about each other, it's the election can not be started."

You are mistaken. Election did start, and a new leader was elected:

2015/10/05 07:55:50 [DEBUG] raft: Votes needed: 2
2015/10/05 07:55:50 [DEBUG] raft: Vote granted. Tally: 1
2015/10/05 07:55:50 [DEBUG] raft: Vote granted. Tally: 2
2015/10/05 07:55:50 [INFO] raft: Election won. Tally: 2
2015/10/05 07:55:50 [INFO] raft: Node at 192.168.9.2:8300 [Leader] entering Leader state
2015/10/05 07:55:50 [INFO] consul: cluster leadership acquired
2015/10/05 07:55:50 [INFO] consul: New leader elected: sles12_2

The problem is that according to raft all peers are there (so election works), but according to serf each node is in isolation. It is inconsistency even within a single node.
As I stated before, the node that acquired leadership reapes a node that voted for him few seconds earlier.

half-dead · 2015-10-27T11:11:34Z

@jakubzytka Sorry I wasted your time, I didn't read it thoroughly.
Seems to me that this is by design, simply add -retry-join when restarting servers will solve this problem.

jakubzytka · 2015-10-27T11:33:52Z

@half-dead
I suppose rejoin will work, but this means I (and everybody else) need to parse raft/peers.json to get the addresses which is some inconvenience and perhaps not the best solution in general (e.g. suppose you are a client - I suppose you do not know about raft peers).
I browsed consul and serf code and proposed some more general solution in hashicorp/serf#333

I hope it doesn't have loopholes and doesn't break serf in some other regard.

ilijaljubicic · 2015-11-16T13:43:59Z

I manged to reproduce steps to get cluster to fail to elect leader.
Thing is that it looks to me that this is not suppose to happen.

Setup I have :
-3 server nodes and 2 client nodes
-server nodes bootstrapped with expect-bootstrap=3
-data volume persisted on server
-sever and client nodes part of ASG on aws with max number 1 so that
it always brings new server or client machine in case of termination
--->it will get another IP and data volume wan't exist in this case

Reproducing inconsistent state steps:

kill one of server machines
--->it will show failed state in consul gui
--->new machine will come up and eventually be registered as new server
--->gui will show 3 server nodes in working state, 1 in failed state and 2 client nodes
2.now kill another of server nodes
--->new machine will come up but cluster will not be able to elect leader
--->not sure now but it seems that without bringing new machine up cluster is already down

After inspecting peers.json I could see all good machines there and failed machines as well.

since quorum for 3 server cluster in 2 nodes, having 2 working nodes and 2 failed
in peers.json should fail the cluster but it seems to be doing so.

Steps to recover cluster are:

edit peers.json on all machines and remove failed nodes
restart all servers and clients

I managed to partially automate this by:

regenerating peers.json on every boot by reading ASC and tags for server nodes
add retry-join with a same list when starting client and server nodes
restart all other client and server nodes if cluster failure happens

What I think is main issue is that quorum of 2 nodes for 3 server cluster
is somehow lost if peers.json have more then 1 failed node included in list together with 2 working nodes.

Also I was wondering if there is way to reduce 72 hours for failed nodes to some other value.
For setup I have failed nodes will come back in some minutes (in case of reboot) or will never come back (well if machine gets same IP before failed node is removed this is another issue
since there will be no data folder persisted in that case).
So I would really benefit from being able to set a value for removing failed nodes
to something as low as 10 minutes.

jakubzytka · 2015-11-16T13:49:06Z

@engine07 could you please create an issue for your case which is just normal behaviour (quorum for 5 nodes cluster is 3, not 2), and not mix it with my bugreport which relates to something completely different?

ilijaljubicic · 2015-11-30T08:31:19Z

Hi, I apologize, it looked to me like issue with same origin
since fix with parsing peers.json and rejoining worked here as well.

(it is 3 server nodes and 2 follower and I thought only server nodes are counted for quorum).

slackpad · 2017-05-05T19:47:50Z

We are going to close this out. Since 0.5.2 we've changed the leave defaults for servers to not leave the cluster when interrupted or shut down, and we've fixed several issues with Raft's server configuration management. The test scenario in the description works as expected now, and the servers will re-elect a leader.

jakubzytka mentioned this issue Oct 27, 2015

Improving (hopefully ;) Serf reconnection ability hashicorp/serf#333

Open

slackpad added the performance label Jan 9, 2016

slackpad closed this as completed May 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster goes into inconsistent state #1271

cluster goes into inconsistent state #1271

jakubzytka commented Oct 5, 2015

jakubzytka commented Oct 26, 2015

djenriquez commented Oct 26, 2015

djenriquez commented Oct 26, 2015

jakubzytka commented Oct 26, 2015

djenriquez commented Oct 26, 2015

half-dead commented Oct 27, 2015

jakubzytka commented Oct 27, 2015

half-dead commented Oct 27, 2015

jakubzytka commented Oct 27, 2015

ilijaljubicic commented Nov 16, 2015

jakubzytka commented Nov 16, 2015

ilijaljubicic commented Nov 30, 2015

slackpad commented May 5, 2017

cluster goes into inconsistent state #1271

cluster goes into inconsistent state #1271

Comments

jakubzytka commented Oct 5, 2015

Intro

Test scenario

Results

Workaround

jakubzytka commented Oct 26, 2015

djenriquez commented Oct 26, 2015

djenriquez commented Oct 26, 2015

jakubzytka commented Oct 26, 2015

djenriquez commented Oct 26, 2015

half-dead commented Oct 27, 2015

jakubzytka commented Oct 27, 2015

half-dead commented Oct 27, 2015

jakubzytka commented Oct 27, 2015

ilijaljubicic commented Nov 16, 2015

jakubzytka commented Nov 16, 2015

ilijaljubicic commented Nov 30, 2015

slackpad commented May 5, 2017