Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul server leave try to trigger an election #10970

Closed
dhiaayachi opened this issue Sep 1, 2021 · 0 comments · Fixed by #11375
Closed

Consul server leave try to trigger an election #10970

dhiaayachi opened this issue Sep 1, 2021 · 0 comments · Fixed by #11375

Comments

@dhiaayachi
Copy link
Collaborator

Summary

When a user trigger a server node to leave a cluster, the following error will happen in the other cluster nodes
agent.server.raft: rejecting vote request since we have a leader

How To reproduce

Use the script documented in #9755

Working Assumptions

  • The cluster is keeping Quorum all the time, even after one node leave. This is crucial to have the cluster stable after the node left.
  • A leader leaving the cluster will always trigger an election (should happen once)

Analysis:

When calling consul leave, 2 possible scenarios could happen each will lead to the same issue:

  • The node is the cluster Leader:
    1. The node is removed from raft servers list using s.autopilot.RemoveServer , this will lead to the node not receiving heart beat and no raft updates and an election happening to establish a new leader.

    2. The node is removed from serf on both LAN and WAN (not relevant in this case)

    3. The node will wait for 5 seconds (leave_drain_time) here, During the 5 seconds the node will still be running it's raft go routines (including the leader loop) and the following will happen:

      • The node will detect that it's removed here
      • The node will set its raft state to Follower because ShutdownOnRemove is false
      • After some time the node will timeout on updates/heartbeat and set it state to Candidate (this is because of 1)
      • The Candidate loop will try to trigger an election
      • Two possible cases at this point:
        • The other nodes will refuse to trigger an election because a leader is established and the requesting node is not a leader here The leaving node continue to run the Candidate loop and retry again and continue to fail to trigger an election until it shutdown.
        • No leader is established yet and request vote between the leaving node and other nodes will compete and only one is accepted (newer term). the worst case here would be that the leaving node establish leadership and shutdown (see 4) which trigger a second election, but this should not trigger more then 2 elections overall instead of 1 but it will stabilize at the end.
    4. After 5 seconds the node will be stopped and all raft go routines are stopped too.

  • The node is a follower:
    1. The node is removed from serf on both LAN and WAN using serf.Leave, this will set the node serf state to Left
    2. Serf will trigger a reconcile based on the node serf state change here and remove the node from raft server list here, this will lead to the node not receiving heart beat and no raft updates
    3. The node will wait for 5 seconds (leave_drain_time) here, During the 5 seconds the node will still be running it's raft go routines (including the follower loop) and the following will happen:
      • After some time the node will timeout on updates/heartbeat and set it state to Candidate (this is because of 2)
      • The Candidate loop will try to trigger an election
      • The other nodes will refuse to trigger an election because a leader is established and the requesting node is not a leader here
      • The leaving node continue to run the Candidate loop and retry again but fail until it shutdown
    4. After 5 seconds the node will be stopped and all raft go routines are stopped too.

Workaround

The only possible workaround that can effectively reduce the number of errors is reducing the time window for this bug to happen by reducing leave_drain_time. That said this could lead to a more severe issues:

  • RPC connection not drained and RPC errors happening
  • In the case of consul leave on a leader node, the leader could not be able to replicate all of its raft logs
    Therefor, the workaround is not advised.

To minimize the impact of having a possible unnecessary election and in general keep the cluster as stable as possible it's advised to replace all the follower nodes first (one node at a time to keep the Quorum) and replace the leader node at the end. This should trigger only 1 election (2 in the condition described in the leader scenario above)

Fix

Set raft config flag ShutdownOnRemove to true, this will lead to raft properly stopping raft go routine cleanly when removing the node from raft, the replication go routine is not affected by this.
The only caveat is to test thoroughly the interaction with features like enterprise autopilot and make sure it does not impact a single server scenario (The flag was historically set when adding the ability to consul to run as a single server)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants