Consul server leave try to trigger an election #10970

dhiaayachi · 2021-09-01T13:47:29Z

Summary

When a user trigger a server node to leave a cluster, the following error will happen in the other cluster nodes
agent.server.raft: rejecting vote request since we have a leader

How To reproduce

Use the script documented in #9755

Working Assumptions

The cluster is keeping Quorum all the time, even after one node leave. This is crucial to have the cluster stable after the node left.
A leader leaving the cluster will always trigger an election (should happen once)

Analysis:

When calling consul leave, 2 possible scenarios could happen each will lead to the same issue:

The node is the cluster Leader:
1. The node is removed from raft servers list using s.autopilot.RemoveServer , this will lead to the node not receiving heart beat and no raft updates and an election happening to establish a new leader.
2. The node is removed from serf on both LAN and WAN (not relevant in this case)
3. The node will wait for 5 seconds (leave_drain_time) here, During the 5 seconds the node will still be running it's raft go routines (including the leader loop) and the following will happen:
  - The node will detect that it's removed here
  - The node will set its raft state to Follower because ShutdownOnRemove is false
  - After some time the node will timeout on updates/heartbeat and set it state to Candidate (this is because of 1)
  - The Candidate loop will try to trigger an election
  - Two possible cases at this point:
    - The other nodes will refuse to trigger an election because a leader is established and the requesting node is not a leader here The leaving node continue to run the Candidate loop and retry again and continue to fail to trigger an election until it shutdown.
    - No leader is established yet and request vote between the leaving node and other nodes will compete and only one is accepted (newer term). the worst case here would be that the leaving node establish leadership and shutdown (see 4) which trigger a second election, but this should not trigger more then 2 elections overall instead of 1 but it will stabilize at the end.
4. After 5 seconds the node will be stopped and all raft go routines are stopped too.
The node is a follower:
1. The node is removed from serf on both LAN and WAN using serf.Leave, this will set the node serf state to Left
2. Serf will trigger a reconcile based on the node serf state change here and remove the node from raft server list here, this will lead to the node not receiving heart beat and no raft updates
3. The node will wait for 5 seconds (leave_drain_time) here, During the 5 seconds the node will still be running it's raft go routines (including the follower loop) and the following will happen:
  - After some time the node will timeout on updates/heartbeat and set it state to Candidate (this is because of 2)
  - The Candidate loop will try to trigger an election
  - The other nodes will refuse to trigger an election because a leader is established and the requesting node is not a leader here
  - The leaving node continue to run the Candidate loop and retry again but fail until it shutdown
4. After 5 seconds the node will be stopped and all raft go routines are stopped too.

Workaround

The only possible workaround that can effectively reduce the number of errors is reducing the time window for this bug to happen by reducing leave_drain_time. That said this could lead to a more severe issues:

RPC connection not drained and RPC errors happening
In the case of consul leave on a leader node, the leader could not be able to replicate all of its raft logs
Therefor, the workaround is not advised.

To minimize the impact of having a possible unnecessary election and in general keep the cluster as stable as possible it's advised to replace all the follower nodes first (one node at a time to keep the Quorum) and replace the leader node at the end. This should trigger only 1 election (2 in the condition described in the leader scenario above)

Fix

Set raft config flag ShutdownOnRemove to true, this will lead to raft properly stopping raft go routine cleanly when removing the node from raft, the replication go routine is not affected by this.
The only caveat is to test thoroughly the interaction with features like enterprise autopilot and make sure it does not impact a single server scenario (The flag was historically set when adding the ability to consul to run as a single server)

The text was updated successfully, but these errors were encountered:

jkirschner-hashicorp added the theme/reliability label Sep 2, 2021

dhiaayachi mentioned this issue Sep 16, 2021

only disable ShutdownOnRemove for -dev and -bootstrap modes #11058

Closed

dhiaayachi mentioned this issue Oct 7, 2021

fix node leaving the cluster to not try to trigger an election #11242

Merged

dhiaayachi mentioned this issue Oct 21, 2021

update raft to v1.3.2 #11375

Merged

dhiaayachi closed this as completed in #11375 Oct 21, 2021

blake mentioned this issue Oct 23, 2021

Joining a Consul server node to a 5 node cluster causes periodic loss of leader #11355

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul server leave try to trigger an election #10970

Consul server leave try to trigger an election #10970

dhiaayachi commented Sep 1, 2021

Consul server leave try to trigger an election #10970

Consul server leave try to trigger an election #10970

Comments

dhiaayachi commented Sep 1, 2021

Summary

How To reproduce

Working Assumptions

Analysis:

Workaround

Fix