unexpected election timeouts when rejecting vote requests in raft #89

freddyrios · 2021-12-21T12:16:53Z

When handling vote requests in the following method, the election timeout when rejecting vote requests (when line 544 is false):

dotNext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs

Lines 516 to 552 in 6b01f63

    
           protected async Task<Result<bool>> VoteAsync(ClusterMemberId sender, long senderTerm, long lastLogIndex, long lastLogTerm, CancellationToken token) 
        
           { 
        
               var currentTerm = auditTrail.Term; 
        
               if (currentTerm > senderTerm) 
        
                   return new(currentTerm, false); 
        
               using var tokenSource = token.LinkTo(LifecycleToken); 
        
               using var transitionLock = await transitionSync.AcquireAsync(token).ConfigureAwait(false); 
        
               var result = false; 
        
               if (currentTerm != senderTerm) 
        
               { 
        
                   Leader = null; 
        
                   await StepDown(senderTerm).ConfigureAwait(false); 
        
               } 
        
               else if (state is FollowerState follower) 
        
               { 
        
                   follower.Refresh(); 
        
               } 
        
               else if (state is StandbyState) 
        
               { 
        
                   Metrics?.ReportHeartbeat(); 
        
               } 
        
               else 
        
               { 
        
                   goto exit; 
        
               } 
        
               if (auditTrail.IsVotedFor(sender) && await auditTrail.IsUpToDateAsync(lastLogIndex, lastLogTerm, token).ConfigureAwait(false)) 
        
               { 
        
                   await auditTrail.UpdateVotedForAsync(sender).ConfigureAwait(false); 
        
                   result = true; 
        
               } 
        
           exit: 
        
               return new(currentTerm, result); 
        
           }

Note that this happens both explicitely in line 533, and indirectly in line 529. It seems to be problematic in both cases, although in the later when it is a leader it does need to start a new election timeout (it is supposed to change though, instead of remaining the same which seems to be the case following the implementation of StepDown).

Some of this is not explicit at all in the raft papers, not even the phd dissertation. However, one can simulate various scenarios in the raft visualization by sending requests/stopping/resuming/timing out nodes and seeing how exactly it deals with those cases https://raft.github.io/

More info on some of the scenarios related to the votes being rejected with the checks in line 529:

(low impact) we already voted for a node and a second node timed out (goes to line 533). The node first saw the election when it first voted, so it resets its election timeout accordingly (so it can propose itself as a leader if the candidate fails to be established as a leader). By resetting it on the second vote, we are extending our election timeout a second time potentially beyond a single election time window. It reduces our chances of becoming a candidate if the election fails.
(high impact) we have committed entries the vote requester does not have. Without the reset it is now more likely to time out first before the sender (which will reset its timeout after losing the election) + it is also more likely to time out before other nodes that don't have the committed entries (as the accepted vote would reset their election timeout). By resetting here we are removing the advante of nodes with committed entries, so more rounds are needed until by chance one of the nodes with the committed entries becomes the leader.

Consider how the later scenario can play when there are only 2 out of 3 nodes remaining. One of them has an entry the other one does not. If one is very unlucky with timings then the node without the extra entries keeps becoming a candidate first on many rounds. This alone does not match the many rounds I have seen it take, but this might combine with other implementation details in dotnext raft, as maybe an unlucky election timeout might keep getting reused.

freddyrios · 2021-12-21T12:24:24Z

@sakno I will likely be trying this change locally, so I could turn it into a PR. A follow up I might be doing depending on the results is exploring how the reuse of timeouts plays with this and other raft behaviors (but that should likely be a topic for a separate time).

sakno · 2021-12-21T13:18:04Z

@freddyrios , thanks for your work! As for indicated potential issue, there are few protections that are implemented already:

PreVote message which is not covered in original Raft paper. See this doc for more details. This type of RPC call allows to prevent term inflation.

Leader stickiness:

dotNext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs

Lines 494 to 498 in 6b01f63

    
               // provide leader stickiness 
        
               result = Timestamp.Current - Timestamp.VolatileRead(ref lastUpdated).Value >= ElectionTimeout && 
        
                   currentTerm <= nextTerm && 
        
                   await auditTrail.IsUpToDateAsync(lastLogIndex, lastLogTerm, token).ConfigureAwait(false); 
        
           }

These two extensions prevent the cluster from re-elections in case of unavailable leader for a short time. Candidate state has this additional step called pre-voting:

dotNext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs

Lines 695 to 729 in 6b01f63

    
           async Task<bool> PreVoteAsync(long currentTerm) 
        
           { 
        
               var lastIndex = auditTrail.LastUncommittedEntryIndex; 
        
               var lastTerm = await auditTrail.GetTermAsync(lastIndex, LifecycleToken).ConfigureAwait(false); 
        
               ICollection<Task<Result<bool>>> responses = new LinkedList<Task<Result<bool>>>(); 
        
               foreach (var member in Members) 
        
                   responses.Add(member.PreVoteAsync(currentTerm, lastIndex, lastTerm, LifecycleToken)); 
        
               var votes = 0; 
        
               // analyze responses 
        
               foreach (var response in responses) 
        
               { 
        
                   try 
        
                   { 
        
                       var result = await response.ConfigureAwait(false); 
        
                       votes += result.Value ? +1 : -1; 
        
                   } 
        
                   catch (OperationCanceledException) 
        
                   { 
        
                       return false; 
        
                   } 
        
                   catch (MemberUnavailableException) 
        
                   { 
        
                       votes -= 1; 
        
                   } 
        
                   finally 
        
                   { 
        
                       response.Dispose(); 
        
                   } 
        
               } 
        
               return votes > 0; 
        
           }

So I expect that the situations described previously by you can be mitigated by this extra step. Anyway, PRs are welcome! You can submit changes and we can take a look at them together.

freddyrios · 2022-01-07T16:48:47Z

@sakno thanks for the reply.

I agree prevote prevents the second scenario, since a node can't become a candidate (and send related vote requests) if the majority has more committed entries than the node.

I guess the first scenario could still happen if the leader does go down. So some nodes do not have a fair chance to win a new election if the current election ends with a split vote.

That said, we should probably close this issue, as I don't currently have a strong scenario where that is shown to be the root cause of issues. More below.

I found more about this: This alone does not match the many rounds I have seen it take, but this might combine with other implementation details in dotnext raft, as maybe an unlucky election timeout might keep getting reused.

Apparently depending on the size and frequency of entries being saved vs. some of the settings raft dotnext supports, the implementation is affected by some of the unstability issues we were seeing. This includes problems holding an elected leader.

Specifically, we found that tuning these greatly helped the stability of some of our test deployments:

TcpConfiguration.LowerElectionTimeout
TcpConfiguration.UpperElectionTimeout
TcpConfiguration.TransmissionBlockSize
MemoryBasedStateMachine.Options.BufferSize
MemoryBasedStateMachine.Options.InitialParitionSize

Note that increasing the election timeouts alone was not enough to achieve a cluster that is unlikely to go into recurrent elections and/or recurrent request timeouts. We see more stable behavior even when saving 8 KB entries when the parameters are turned compared to sending just the 8 bytes the example sends without changing those configurations. When we are sending 8 KB entries we increased the values of transmisssion block size, buffer size and initial partition size by 1000 compared to those in the example (as our data is that much larger, but no extra exploration has been done in a smaller increment being suitable). For the lower and upper election timeouts doubling the values was enough.

So the only thing I am left wondering about the unstability with smaller transmission block size, buffer size and initial partition size is if the cluster would still have stuck in recurrent elections in the same way without the behavior I mentioned in the second scenario of the ticket or if all nodes would have faced the same fate.

ps. with the current configuration killing only one node out of 3 tends to consistently recover, while killing 2 nodes out of 3 + restarting 1 killed one does not always recover succesfully (extra program restarts if it fails to start does tend to recover it).

sakno · 2022-01-07T20:28:34Z

while killing 2 nodes out of 3 + restarting 1 killed one does not always recover succesfully

It would be great if you have a working example that reproduces this situation.

sakno · 2022-01-07T20:31:52Z

One more question - is this behavior specific to TCP transport? Did you observe the same for UDP/HTTP?

sakno · 2022-01-08T00:55:11Z

Another potential reason is suboptimal implementation of TCP transport for Raft. UDP/TCP both shares the same transmission flow control procedure. Historically, my primary focus as HTTP implementation. TCP/UDP transports were added later. To reduce development efforts, I decided to share the same logic for both transports.

Transmission flow control is based on Exchange concept. Exchange is a dialogue between the client and the server. During the exchange, the client and the server send packets (a logical portion of data) to each other using Request-Reply pattern. These packets are then translated down to the transport layer. For instance, AppendEntries as the most complex RPC call has the following flow:

dotNext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/TransportServices/EntriesExchange.cs

Lines 13 to 26 in 30e4188

    
                   Message flow: 
        
                   1.REQ(None) Announce number of entries, prevLogIndex, prevLogTerm etc. 
        
                   1.RES(Ack) Wait for command: NextEntry to start sending content, None to abort transmission 
        
                   2.REQ(StreamStart) with information about content-type and length of the record 
        
                   2.REP(Ack) Wait for command: NextEntry to start sending content, Continue to send next chunk, None to finalize transmission 
        
                   3.REQ(Fragment) with the chunk of record data 
        
                   3.REP(Ack) Wait for command: NextEntry to start sending content, Continue to send next chunk, None to finalize transmission 
        
                   4.REQ(StreamEnd) with the final chunk of record data 
        
                   4.REP(Ack) Wait for command: NextEntry to start sending content, None to finalize transmission 
        
               */ 
        
               private protected readonly Pipe pipe;

That was absolutely reasonable for UDP because it has no connection concept and packet ordering. As a result, packet ordering is guaranteed through a sequence of requests/acknowledgments. For example, transmission of small log entry that fits to single UDP packet requires REQ packet and REP packet The situation is much worse if a log entries doesn't fit to single UDP packet. In this situation, the transmission of a single log entry requires multiple REQ/REP packets. As a result, a single call of AppendEntries causes a bunch of packets to be transferred over the wire in both directions.

This architecture is not very efficient for TCP. Probably, it's time to rethink the current implementation.

freddyrios · 2022-01-10T10:21:56Z

while killing 2 nodes out of 3 + restarting 1 killed one does not always recover succesfully

It would be great if you have a working example that reproduces this situation.

I don't have a test, but it happens when running in a 3 cluster node with a modification of the example in this branch: https://github.com/freddyrios/dotNext/tree/feature/bigstruct-example.
It only has one commit with notes on what changed, but the modified example is more of a fairly busy system, which could play a part.

One more question - is this behavior specific to TCP transport? Did you observe the same for UDP/HTTP?

I have not tried it with other transports.

freddyrios · 2022-02-07T13:53:10Z

thanks for the fixes, working great now

sakno added the enhancement New feature or request label Dec 21, 2021

freddyrios mentioned this issue Jan 10, 2022

stability: improve recovery after a crash of the majority of servers copenhagenatomics/dotNext#1

Closed

freddyrios closed this as completed Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unexpected election timeouts when rejecting vote requests in raft #89

unexpected election timeouts when rejecting vote requests in raft #89

freddyrios commented Dec 21, 2021

freddyrios commented Dec 21, 2021

sakno commented Dec 21, 2021

freddyrios commented Jan 7, 2022

sakno commented Jan 7, 2022

sakno commented Jan 7, 2022

sakno commented Jan 8, 2022

freddyrios commented Jan 10, 2022

freddyrios commented Feb 7, 2022

unexpected election timeouts when rejecting vote requests in raft #89

unexpected election timeouts when rejecting vote requests in raft #89

Comments

freddyrios commented Dec 21, 2021

freddyrios commented Dec 21, 2021

sakno commented Dec 21, 2021

freddyrios commented Jan 7, 2022

sakno commented Jan 7, 2022

sakno commented Jan 7, 2022

sakno commented Jan 8, 2022

freddyrios commented Jan 10, 2022

freddyrios commented Feb 7, 2022