Prevent disruptive rejoining node #9333

gyuho · 2018-02-16T23:05:13Z

Restart a follower to an existing cluster with an active leader.
Follower advances election ticks with only tick left for campaign.
Follower's last tick elapses, and becomes a candidate.
Once candidate, it drops all incoming proposals until leader gets elected (thus unavailable).

Ideally, the acting leader sends a heartbeat message to this follower before step 3, so that follower resets its election ticks and stays available without becoming a candidate.

advanceTicksForElection helps cross-datacenter deployments with larger election timeouts, by speeding up its bootstrap process. However, some prefer high availabilities, and disabling advanceTicksForElection can increase the availability of restarting follower node.

/cc @jpbetz

The text was updated successfully, but these errors were encountered:

xiang90 · 2018-02-16T23:10:06Z

We expect the follower to receive a heartbeat from the leader within the heartbeat timeout. This is a reasonable assumption to make. If restarting nodes keep on disrupting the cluster, the user probably set heartbeat timeout to some low value. Also we will have pre vote enable enabled in the future.

I do not think we should make what etcd server does today does today configurable unless I missed something.

gyuho · 2018-02-16T23:14:00Z

Also we will have pre vote enable enabled in the future.

Agree. This should reduce disruptive elections and resolve most issues.

wojtek-t · 2018-02-20T07:29:20Z

Copying some things from the email thread here - I think we either misunderstand some logs or there is some bug in etcd:

Hmm... maybe I'm misunderstanding the log entries, but in my case leader is dropping leadership in that case:

2018-02-11 23:38:16.450210 I | raft: f813a994322b2b40 [term: 9] received a MsgAppResp message with higher term from 9937f2a5f3634020 [term: 10]
2018-02-11 23:38:16.450261 I | raft: f813a994322b2b40 became follower at term 10
2018-02-11 23:38:16.450274 I | raft: raft.node: f813a994322b2b40 changed leader from f813a994322b2b40 to 9937f2a5f3634020 at term 10

where f813a994322b2b40 is the current leader and 9937f2a5f3634020 is a member that restarted.
Is it a bug?

Interestingly, the third member ignores such vote requests:

2018-02-11 23:38:16.430232 I | raft: ad782d6b7abde5c3 [logterm: 9, index: 2853, vote: f813a994322b2b40] ignored MsgVote from 9937f2a5f3634020 [logterm: 9, index: 2364] at term 9: lease is not expired (remaining ticks: 10)

I believe this is what happens - see log entries above.

One more interesting I found is that in the logs of restarted member, election is started before 'peer X become active' and 'established a TCP streaming connection with peer X' log entries:

2018-02-11 23:38:15.892374 I | etcdmain: etcd Version: 3.1.11
(...)
2018-02-11 23:38:16.163995 I | rafthttp: started streaming with peer ad782d6b7abde5c3 (writer)
2018-02-11 23:38:16.164045 I | rafthttp: started streaming with peer ad782d6b7abde5c3 (writer)
2018-02-11 23:38:16.164142 I | rafthttp: started streaming with peer ad782d6b7abde5c3 (stream MsgApp v2 reader)
2018-02-11 23:38:16.192182 I | rafthttp: started streaming with peer ad782d6b7abde5c3 (stream Message reader)
2018-02-11 23:38:16.220700 I | rafthttp: started streaming with peer f813a994322b2b40 (writer)
2018-02-11 23:38:16.220760 I | rafthttp: started streaming with peer f813a994322b2b40 (writer)
2018-02-11 23:38:16.220806 I | rafthttp: started streaming with peer f813a994322b2b40 (stream MsgApp v2 reader)
2018-02-11 23:38:16.249231 I | rafthttp: started streaming with peer f813a994322b2b40 (stream Message reader)
2018-02-11 23:38:16.311931 I | raft: 9937f2a5f3634020 is starting a new election at term 9
2018-02-11 23:38:16.312034 I | raft: 9937f2a5f3634020 became candidate at term 10
2018-02-11 23:38:16.312155 I | raft: 9937f2a5f3634020 received MsgVoteResp from 9937f2a5f3634020 at term 10
2018-02-11 23:38:16.312176 I | raft: 9937f2a5f3634020 [logterm: 9, index: 2364] sent MsgVote request to f813a994322b2b40 at term 10
2018-02-11 23:38:16.312191 I | raft: 9937f2a5f3634020 [logterm: 9, index: 2364] sent MsgVote request to ad782d6b7abde5c3 at term 10
2018-02-11 23:38:16.366186 I | rafthttp: peer f813a994322b2b40 became active
2018-02-11 23:38:16.366257 I | rafthttp: established a TCP streaming connection with peer f813a994322b2b40 (stream Message writer)
2018-02-11 23:38:16.366418 I | rafthttp: established a TCP streaming connection with peer f813a994322b2b40 (stream MsgApp v2 writer)
2018-02-11 23:38:16.369265 I | rafthttp: peer ad782d6b7abde5c3 became active
2018-02-11 23:38:16.369301 I | rafthttp: established a TCP streaming connection with peer ad782d6b7abde5c3 (stream Message writer)
2018-02-11 23:38:16.369448 I | rafthttp: established a TCP streaming connection with peer ad782d6b7abde5c3 (stream MsgApp v2 writer)
2018-02-11 23:38:16.425179 I | rafthttp: established a TCP streaming connection with peer ad782d6b7abde5c3 (stream MsgApp v2 reader)
2018-02-11 23:38:16.434434 I | rafthttp: established a TCP streaming connection with peer ad782d6b7abde5c3 (stream Message reader)
2018-02-11 23:38:16.453593 I | rafthttp: established a TCP streaming connection with peer f813a994322b2b40 (stream MsgApp v2 reader)
2018-02-11 23:38:16.496026 I | rafthttp: established a TCP streaming connection with peer f813a994322b2b40 (stream Message reader)

Does it mean that election started before we even initialized connections to peers?
Shouldn't we wait for initialization of connections (which?) before we start measuring heartbeat timeout?

gyuho · 2018-02-22T22:58:21Z

We are hitting this issue from #5468.

https://github.com/coreos/etcd/blob/df4aafbbdfe5a96699f01bee66b4ddb04d24444f/raft/raft.go#L806-L822

UPDATE: On second thought, #5468 makes sense in case there is an isolated candidate. Responding with pb.MsgAppResp of higher term would free the stuck candidate at the cost of disrupting the current leader. And this is expected; pre-vote should help.

Will enable pre-vote in etcd.

xiang90 · 2018-02-26T20:37:42Z

@gyuho even without pre-vote, the restarted node should not ALWAYS disrupt the leadership. If it is the case, we need to figure out why. Enabling pre-vote might mask the bug.

xiang90 · 2018-02-26T20:38:46Z

@gyuho

@wojtek-t is probably right. we start the timer too fast.

/cc @jpbetz

gyuho · 2018-02-26T20:44:33Z

@xiang90 Right, loosening advance election timeout ticks should help, in addition to pre-vote. Otherwise, we are just hoping restarted follower receives heartbeat before last tick elapse.

raft: f813a994322b2b40 [term: 9] received a MsgAppResp message with higher term

This is clearly follower(or candidate) with higher term responding to leader heartbeat, thus forcing the leader to step down, which is expected in Raft to prevent isolated follower being stuck with repeating elections.

Maybe add more logs around this?

xiang90 · 2018-02-26T20:49:11Z

Right, loosening advance election timeout ticks should help, i

Not really. What we should do it to account for the connection/initialization time when setting the initial election timeout to a low value when the node is restarted. Right now, we set it low to the heartbeat timeout to shorten the initial leader election, where the timeout might be even faster than the initialization time as @wojtek-t pointed out.

mborsz · 2018-04-13T08:51:29Z

Hi,

I was able to reproduce this issue (rejoining node triggers leader election) in version 3.1.13:

8cc587412352bc16 is elected as a leader in the cluster
f34ba41f615b4eff is a follower
2fae52714728f955 rejoins cluster

2fae52714728f955's logs:

A  2018-04-12 22:33:12.130522 I | etcdserver: 2fae52714728f955 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s) 
A  2018-04-12 22:33:12.260686 I | rafthttp: established a TCP streaming connection with peer 8cc587412352bc16 (stream Message reader) 
A  2018-04-12 22:33:12.260771 I | rafthttp: established a TCP streaming connection with peer 8cc587412352bc16 (stream MsgApp v2 reader) 
A  2018-04-12 22:33:12.260998 I | rafthttp: established a TCP streaming connection with peer f34ba41f615b4eff (stream MsgApp v2 reader) 
A  2018-04-12 22:33:12.261115 I | rafthttp: established a TCP streaming connection with peer f34ba41f615b4eff (stream Message reader) 
A  2018-04-12 22:33:12.384623 I | raft: 2fae52714728f955 is starting a new election at term 8 
A  2018-04-12 22:33:12.384687 I | raft: 2fae52714728f955 became candidate at term 9 
A  2018-04-12 22:33:12.384704 I | raft: 2fae52714728f955 received MsgVoteResp from 2fae52714728f955 at term 9 
A  2018-04-12 22:33:12.384717 I | raft: 2fae52714728f955 [logterm: 8, index: 2307] sent MsgVote request to f34ba41f615b4eff at term 9 
A  2018-04-12 22:33:12.384730 I | raft: 2fae52714728f955 [logterm: 8, index: 2307] sent MsgVote request to 8cc587412352bc16 at term 9 
A  2018-04-12 22:33:12.384741 I | raft: raft.node: 2fae52714728f955 lost leader 8cc587412352bc16 at term 9 
A  2018-04-12 22:33:13.886168 I | raft: 2fae52714728f955 is starting a new election at term 9 
A  2018-04-12 22:33:13.886218 I | raft: 2fae52714728f955 became candidate at term 10 
A  2018-04-12 22:33:13.886235 I | raft: 2fae52714728f955 received MsgVoteResp from 2fae52714728f955 at term 10 
A  2018-04-12 22:33:13.886247 I | raft: 2fae52714728f955 [logterm: 8, index: 2307] sent MsgVote request to f34ba41f615b4eff at term 10 
A  2018-04-12 22:33:13.886259 I | raft: 2fae52714728f955 [logterm: 8, index: 2307] sent MsgVote request to 8cc587412352bc16 at term 10 
A  2018-04-12 22:33:13.896923 I | raft: 2fae52714728f955 received MsgVoteResp rejection from 8cc587412352bc16 at term 10 
A  2018-04-12 22:33:13.896955 I | raft: 2fae52714728f955 [quorum:2] has received 1 MsgVoteResp votes and 1 vote rejections 
A  2018-04-12 22:33:13.897041 I | raft: 2fae52714728f955 received MsgVoteResp rejection from f34ba41f615b4eff at term 10 
A  2018-04-12 22:33:13.897066 I | raft: 2fae52714728f955 [quorum:2] has received 1 MsgVoteResp votes and 2 vote rejections 
A  2018-04-12 22:33:13.897080 I | raft: 2fae52714728f955 became follower at term 10 
A  2018-04-12 22:33:15.023927 I | raft: 2fae52714728f955 [term: 10] received a MsgVote message with higher term from 8cc587412352bc16 [term: 11] 
A  2018-04-12 22:33:15.023972 I | raft: 2fae52714728f955 became follower at term 11 
A  2018-04-12 22:33:15.023986 I | raft: 2fae52714728f955 [logterm: 8, index: 2307, vote: 0] cast MsgVote for 8cc587412352bc16 [logterm: 8, index: 2748] at term 11 
A  2018-04-12 22:33:15.026218 I | raft: raft.node: 2fae52714728f955 elected leader 8cc587412352bc16 at term 11 
A  2018-04-12 22:33:15.075479 I | etcdserver: published {Name:etcd-10.127.240.58 ClientURLs:[http://127.0.0.1:2379]} to cluster e7a652e35e44715f 
A  2018-04-12 22:33:15.075640 I | embed: ready to serve client requests

f34ba41f615b4eff's logs:

A  2018-04-12 22:33:10.810396 W | rafthttp: health check for peer 2fae52714728f955 could not connect: dial tcp 10.127.240.58:2380: getsockopt: connection refused 
A  2018-04-12 22:33:11.868355 I | rafthttp: peer 2fae52714728f955 became active 
A  2018-04-12 22:33:11.868510 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream MsgApp v2 reader) 
A  2018-04-12 22:33:11.870258 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream Message reader) 
A  2018-04-12 22:33:12.258999 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream Message writer) 
A  2018-04-12 22:33:12.259318 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream MsgApp v2 writer) 
A  2018-04-12 22:33:12.555504 I | raft: f34ba41f615b4eff [logterm: 8, index: 2748, vote: 8cc587412352bc16] ignored MsgVote from 2fae52714728f955 [logterm: 8, index: 2307] at term 8: lease is not expired (remaining ticks: 9) 
A  2018-04-12 22:33:13.893774 I | raft: f34ba41f615b4eff [term: 8] received a MsgVote message with higher term from 2fae52714728f955 [term: 10] 
A  2018-04-12 22:33:13.893823 I | raft: f34ba41f615b4eff became follower at term 10 
A  2018-04-12 22:33:13.893855 I | raft: f34ba41f615b4eff [logterm: 8, index: 2748, vote: 0] rejected MsgVote from 2fae52714728f955 [logterm: 8, index: 2307] at term 10 
A  2018-04-12 22:33:13.893867 I | raft: raft.node: f34ba41f615b4eff lost leader 8cc587412352bc16 at term 10 
A  2018-04-12 22:33:15.026885 I | raft: f34ba41f615b4eff [term: 10] received a MsgVote message with higher term from 8cc587412352bc16 [term: 11] 
A  2018-04-12 22:33:15.026968 I | raft: f34ba41f615b4eff became follower at term 11 
A  2018-04-12 22:33:15.026991 I | raft: f34ba41f615b4eff [logterm: 8, index: 2748, vote: 0] cast MsgVote for 8cc587412352bc16 [logterm: 8, index: 2748] at term 11 
A  2018-04-12 22:33:15.029122 I | raft: raft.node: f34ba41f615b4eff elected leader 8cc587412352bc16 at term 11

8cc587412352bc16's logs:

A  2018-04-12 22:33:10.603211 W | rafthttp: health check for peer 2fae52714728f955 could not connect: dial tcp 10.127.240.58:2380: getsockopt: connection refused 
A  2018-04-12 22:33:11.769421 I | rafthttp: peer 2fae52714728f955 became active 
A  2018-04-12 22:33:11.860876 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream MsgApp v2 reader) 
A  2018-04-12 22:33:11.864928 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream Message reader) 
A  2018-04-12 22:33:12.243043 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream MsgApp v2 writer) 
A  2018-04-12 22:33:12.243180 I | rafthttp: established a TCP streaming connection with peer 2fae52714728f955 (stream Message writer) 
A  2018-04-12 22:33:12.539016 I | raft: 8cc587412352bc16 [logterm: 8, index: 2748, vote: 8cc587412352bc16] ignored MsgVote from 2fae52714728f955 [logterm: 8, index: 2307] at term 8: lease is not expired (remaining ticks: 6) 
A  2018-04-12 22:33:12.539133 I | raft: 8cc587412352bc16 [term: 8] received a MsgAppResp message with higher term from 2fae52714728f955 [term: 9] 
A  2018-04-12 22:33:12.539162 I | raft: 8cc587412352bc16 became follower at term 9 
A  2018-04-12 22:33:12.539173 I | raft: raft.node: 8cc587412352bc16 changed leader from 8cc587412352bc16 to 2fae52714728f955 at term 9 
A  2018-04-12 22:33:13.877402 I | raft: 8cc587412352bc16 [term: 9] received a MsgVote message with higher term from 2fae52714728f955 [term: 10] 
A  2018-04-12 22:33:13.877469 I | raft: 8cc587412352bc16 became follower at term 10 
A  2018-04-12 22:33:13.877484 I | raft: 8cc587412352bc16 [logterm: 8, index: 2748, vote: 0] rejected MsgVote from 2fae52714728f955 [logterm: 8, index: 2307] at term 10 
A  2018-04-12 22:33:13.877498 I | raft: raft.node: 8cc587412352bc16 lost leader 2fae52714728f955 at term 10 
A  2018-04-12 22:33:15.008423 I | raft: 8cc587412352bc16 is starting a new election at term 10 
A  2018-04-12 22:33:15.008490 I | raft: 8cc587412352bc16 became candidate at term 11 
A  2018-04-12 22:33:15.008506 I | raft: 8cc587412352bc16 received MsgVoteResp from 8cc587412352bc16 at term 11 
A  2018-04-12 22:33:15.008518 I | raft: 8cc587412352bc16 [logterm: 8, index: 2748] sent MsgVote request to 2fae52714728f955 at term 11 
A  2018-04-12 22:33:15.008529 I | raft: 8cc587412352bc16 [logterm: 8, index: 2748] sent MsgVote request to f34ba41f615b4eff at term 11 
A  2018-04-12 22:33:15.011935 I | raft: 8cc587412352bc16 received MsgVoteResp from 2fae52714728f955 at term 11 
A  2018-04-12 22:33:15.011986 I | raft: 8cc587412352bc16 [quorum:2] has received 2 MsgVoteResp votes and 0 vote rejections 
A  2018-04-12 22:33:15.012007 I | raft: 8cc587412352bc16 became leader at term 11 
A  2018-04-12 22:33:15.012035 I | raft: raft.node: 8cc587412352bc16 elected leader 8cc587412352bc16 at term 11

Questions:

Why 2fae52714728f955 says it has two active peers at 22:33:12.130522 while 'established a TCP connection' logs starts at 22:33:12.260686?
What is origin of 8 in 'fast-forwarding 8 ticks (election ticks 10)'? Why "election ticks - 2"?
Can we fix this? :)
Can we make it possible to disable fast-forwarding mechanism completely e.g. by flag?
I believe in our use case it doesn't provide any useful value, but as we can see it degrades availability of the cluster.

gyuho · 2018-04-13T22:09:13Z

@wojtek-t @mborsz We will re-investigate on this.

Why "election ticks - 2"?

Previously, we had only one tick left (election ticks - 1). And patch was adding "one more" tick in order to have more time for the rejoining follower to receive a leader heartbeat thus not starting a disruptive election. Our assumption was that just one more tick would be enough (it's 100ms more when heartbeat is 100 ms and election timeout is 1 sec).

Why 2fae52714728f955 says it has two active peers at 22:33:12.130522 while 'established a TCP connection' logs starts at 22:33:12.260686?

It's because the rebooting follower 2fae52714728f955 rejoins the cluster of active 2 members, and it was able to establish streams onto them.

etcdserver: 2fae52714728f955 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)

And this is where the rejoining follower finds two other peers out of 3-node cluster, and fast-forwards 8 ticks, and with only 2 ticks left before it triggers an election. And our expectation was within the last 2 ticks, the rejoining follower is able to receive leader heartbeat and does not start a disruptive election. This is useful for cross-datacenter deploy with larger heartbeat intervals; when one tick is 1-second, without fast-forwarding you would wait the 10 seconds to find out there's no leader, with fast-forward you just wait 2-second. But this is still theoretical scenario, and if this happens very often in your use case, might make sense to make this configurable.

Logs tell that rejoining follower was expecting leader heartbeat within 100ms x 2 (200 ms) at 2018-04-12 22:33:12.130522, but didn't get it, so started an election at 2018-04-12 22:33:12.384623. What were your heartbeat and leader election configuration? Does this happen often?

Before we make this configurable for low network latency environments, I want to be sure that it won't mask any other bugs.

wojtek-t · 2018-04-14T12:29:49Z

Logs tell that rejoining follower was expecting leader heartbeat within 100ms x 2 (200 ms) at 2018-04-12 22:33:12.130522, but didn't get it, so started an election at 2018-04-12 22:33:12.384623. What were your heartbeat and leader election configuration? Does this happen often?

Defaults (heartbeat - 100ms, leader election - 1s).
We see it happening from time to time - it's definitely NOT a one-off.

By default, etcd --initial-election-tick-advance=true, then local member fast-forwards election ticks to speed up "initial" leader election trigger. This benefits the case of larger election ticks. For instance, cross datacenter deployment may require longer election timeout of 10-second. If true, local node does not need wait up to 10-second. Instead, forwards its election ticks to 8-second, and have only 2-second left before leader election. Major assumptions are that: cluster has no active leader thus advancing ticks enables faster leader election. Or cluster already has an established leader, and rejoining follower is likely to receive heartbeats from the leader after tick advance and before election timeout. However, when network from leader to rejoining follower is congested, and the follower does not receive leader heartbeat within left election ticks, disruptive election has to happen thus affecting cluster availabilities. Disabling this would slow down initial bootstrap process for cross datacenter deployments. We don't care too much about the delay in cluster bootstrap, but we do care about the availability of etcd clusters. With "initial-election-tick-advance" set to false, a rejoining node has more chance to receive leader heartbeats before disrupting the cluster. etcd-io/etcd#9333

gyuho self-assigned this Feb 20, 2018

gyuho added the area/raft label Feb 20, 2018

gyuho changed the title ~~Make "advanceTicksForElection" configurable~~ Prevent disruptive rejoining node Feb 23, 2018

gyuho mentioned this issue Feb 23, 2018

*: configure Raft Pre-Vote to reduce disruptive rejoining servers #9352

Merged

gyuho added the stage/investigating label Feb 26, 2018

gyuho mentioned this issue Feb 26, 2018

etcdserver: adjust election timeout on restart #9364

Closed

This was referenced Mar 8, 2018

Release planning for March #9411

Closed

etcdserver: adjust election timeout on restart #9415

Merged

gyuho closed this as completed Mar 11, 2018

gyuho reopened this Apr 13, 2018

gyuho added this to the etcd-v3.4 milestone Apr 13, 2018

This was referenced Apr 19, 2018

*: add --initial-election-tick-advance to configure election fast-forward on bootstrap #9591

Merged

*: support "continue" to label within for-loop etcd-io/gofail#12

Closed

gyuho closed this as completed in #9591 Apr 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent disruptive rejoining node #9333

Prevent disruptive rejoining node #9333

gyuho commented Feb 16, 2018 •

edited

Loading

xiang90 commented Feb 16, 2018

gyuho commented Feb 16, 2018 •

edited

Loading

wojtek-t commented Feb 20, 2018 •

edited by gyuho

Loading

gyuho commented Feb 22, 2018 •

edited

Loading

xiang90 commented Feb 26, 2018

xiang90 commented Feb 26, 2018

gyuho commented Feb 26, 2018

xiang90 commented Feb 26, 2018

mborsz commented Apr 13, 2018

gyuho commented Apr 13, 2018 •

edited

Loading

wojtek-t commented Apr 14, 2018

Prevent disruptive rejoining node #9333

Prevent disruptive rejoining node #9333

Comments

gyuho commented Feb 16, 2018 • edited Loading

xiang90 commented Feb 16, 2018

gyuho commented Feb 16, 2018 • edited Loading

wojtek-t commented Feb 20, 2018 • edited by gyuho Loading

gyuho commented Feb 22, 2018 • edited Loading

xiang90 commented Feb 26, 2018

xiang90 commented Feb 26, 2018

gyuho commented Feb 26, 2018

xiang90 commented Feb 26, 2018

mborsz commented Apr 13, 2018

gyuho commented Apr 13, 2018 • edited Loading

wojtek-t commented Apr 14, 2018

gyuho commented Feb 16, 2018 •

edited

Loading

gyuho commented Feb 16, 2018 •

edited

Loading

wojtek-t commented Feb 20, 2018 •

edited by gyuho

Loading

gyuho commented Feb 22, 2018 •

edited

Loading

gyuho commented Apr 13, 2018 •

edited

Loading