Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected election timeouts when rejecting vote requests in raft #89

Closed
freddyrios opened this issue Dec 21, 2021 · 8 comments
Closed
Labels
enhancement New feature or request

Comments

@freddyrios
Copy link
Contributor

When handling vote requests in the following method, the election timeout when rejecting vote requests (when line 544 is false):

protected async Task<Result<bool>> VoteAsync(ClusterMemberId sender, long senderTerm, long lastLogIndex, long lastLogTerm, CancellationToken token)
{
var currentTerm = auditTrail.Term;
if (currentTerm > senderTerm)
return new(currentTerm, false);
using var tokenSource = token.LinkTo(LifecycleToken);
using var transitionLock = await transitionSync.AcquireAsync(token).ConfigureAwait(false);
var result = false;
if (currentTerm != senderTerm)
{
Leader = null;
await StepDown(senderTerm).ConfigureAwait(false);
}
else if (state is FollowerState follower)
{
follower.Refresh();
}
else if (state is StandbyState)
{
Metrics?.ReportHeartbeat();
}
else
{
goto exit;
}
if (auditTrail.IsVotedFor(sender) && await auditTrail.IsUpToDateAsync(lastLogIndex, lastLogTerm, token).ConfigureAwait(false))
{
await auditTrail.UpdateVotedForAsync(sender).ConfigureAwait(false);
result = true;
}
exit:
return new(currentTerm, result);
}

Note that this happens both explicitely in line 533, and indirectly in line 529. It seems to be problematic in both cases, although in the later when it is a leader it does need to start a new election timeout (it is supposed to change though, instead of remaining the same which seems to be the case following the implementation of StepDown).

Some of this is not explicit at all in the raft papers, not even the phd dissertation. However, one can simulate various scenarios in the raft visualization by sending requests/stopping/resuming/timing out nodes and seeing how exactly it deals with those cases https://raft.github.io/

More info on some of the scenarios related to the votes being rejected with the checks in line 529:

  • (low impact) we already voted for a node and a second node timed out (goes to line 533). The node first saw the election when it first voted, so it resets its election timeout accordingly (so it can propose itself as a leader if the candidate fails to be established as a leader). By resetting it on the second vote, we are extending our election timeout a second time potentially beyond a single election time window. It reduces our chances of becoming a candidate if the election fails.
  • (high impact) we have committed entries the vote requester does not have. Without the reset it is now more likely to time out first before the sender (which will reset its timeout after losing the election) + it is also more likely to time out before other nodes that don't have the committed entries (as the accepted vote would reset their election timeout). By resetting here we are removing the advante of nodes with committed entries, so more rounds are needed until by chance one of the nodes with the committed entries becomes the leader.

Consider how the later scenario can play when there are only 2 out of 3 nodes remaining. One of them has an entry the other one does not. If one is very unlucky with timings then the node without the extra entries keeps becoming a candidate first on many rounds. This alone does not match the many rounds I have seen it take, but this might combine with other implementation details in dotnext raft, as maybe an unlucky election timeout might keep getting reused.

@freddyrios
Copy link
Contributor Author

@sakno I will likely be trying this change locally, so I could turn it into a PR. A follow up I might be doing depending on the results is exploring how the reuse of timeouts plays with this and other raft behaviors (but that should likely be a topic for a separate time).

@sakno
Copy link
Collaborator

sakno commented Dec 21, 2021

@freddyrios , thanks for your work! As for indicated potential issue, there are few protections that are implemented already:

  • PreVote message which is not covered in original Raft paper. See this doc for more details. This type of RPC call allows to prevent term inflation.
  • Leader stickiness:
    // provide leader stickiness
    result = Timestamp.Current - Timestamp.VolatileRead(ref lastUpdated).Value >= ElectionTimeout &&
    currentTerm <= nextTerm &&
    await auditTrail.IsUpToDateAsync(lastLogIndex, lastLogTerm, token).ConfigureAwait(false);
    }

These two extensions prevent the cluster from re-elections in case of unavailable leader for a short time. Candidate state has this additional step called pre-voting:

async Task<bool> PreVoteAsync(long currentTerm)
{
var lastIndex = auditTrail.LastUncommittedEntryIndex;
var lastTerm = await auditTrail.GetTermAsync(lastIndex, LifecycleToken).ConfigureAwait(false);
ICollection<Task<Result<bool>>> responses = new LinkedList<Task<Result<bool>>>();
foreach (var member in Members)
responses.Add(member.PreVoteAsync(currentTerm, lastIndex, lastTerm, LifecycleToken));
var votes = 0;
// analyze responses
foreach (var response in responses)
{
try
{
var result = await response.ConfigureAwait(false);
votes += result.Value ? +1 : -1;
}
catch (OperationCanceledException)
{
return false;
}
catch (MemberUnavailableException)
{
votes -= 1;
}
finally
{
response.Dispose();
}
}
return votes > 0;
}

So I expect that the situations described previously by you can be mitigated by this extra step. Anyway, PRs are welcome! You can submit changes and we can take a look at them together.

@sakno sakno added the enhancement New feature or request label Dec 21, 2021
@freddyrios
Copy link
Contributor Author

@sakno thanks for the reply.

I agree prevote prevents the second scenario, since a node can't become a candidate (and send related vote requests) if the majority has more committed entries than the node.

I guess the first scenario could still happen if the leader does go down. So some nodes do not have a fair chance to win a new election if the current election ends with a split vote.

That said, we should probably close this issue, as I don't currently have a strong scenario where that is shown to be the root cause of issues. More below.

I found more about this: This alone does not match the many rounds I have seen it take, but this might combine with other implementation details in dotnext raft, as maybe an unlucky election timeout might keep getting reused.

Apparently depending on the size and frequency of entries being saved vs. some of the settings raft dotnext supports, the implementation is affected by some of the unstability issues we were seeing. This includes problems holding an elected leader.

Specifically, we found that tuning these greatly helped the stability of some of our test deployments:

  • TcpConfiguration.LowerElectionTimeout
  • TcpConfiguration.UpperElectionTimeout
  • TcpConfiguration.TransmissionBlockSize
  • MemoryBasedStateMachine.Options.BufferSize
  • MemoryBasedStateMachine.Options.InitialParitionSize

Note that increasing the election timeouts alone was not enough to achieve a cluster that is unlikely to go into recurrent elections and/or recurrent request timeouts. We see more stable behavior even when saving 8 KB entries when the parameters are turned compared to sending just the 8 bytes the example sends without changing those configurations. When we are sending 8 KB entries we increased the values of transmisssion block size, buffer size and initial partition size by 1000 compared to those in the example (as our data is that much larger, but no extra exploration has been done in a smaller increment being suitable). For the lower and upper election timeouts doubling the values was enough.


So the only thing I am left wondering about the unstability with smaller transmission block size, buffer size and initial partition size is if the cluster would still have stuck in recurrent elections in the same way without the behavior I mentioned in the second scenario of the ticket or if all nodes would have faced the same fate.

ps. with the current configuration killing only one node out of 3 tends to consistently recover, while killing 2 nodes out of 3 + restarting 1 killed one does not always recover succesfully (extra program restarts if it fails to start does tend to recover it).

@sakno
Copy link
Collaborator

sakno commented Jan 7, 2022

while killing 2 nodes out of 3 + restarting 1 killed one does not always recover succesfully

It would be great if you have a working example that reproduces this situation.

@sakno
Copy link
Collaborator

sakno commented Jan 7, 2022

One more question - is this behavior specific to TCP transport? Did you observe the same for UDP/HTTP?

@sakno
Copy link
Collaborator

sakno commented Jan 8, 2022

Another potential reason is suboptimal implementation of TCP transport for Raft. UDP/TCP both shares the same transmission flow control procedure. Historically, my primary focus as HTTP implementation. TCP/UDP transports were added later. To reduce development efforts, I decided to share the same logic for both transports.

Transmission flow control is based on Exchange concept. Exchange is a dialogue between the client and the server. During the exchange, the client and the server send packets (a logical portion of data) to each other using Request-Reply pattern. These packets are then translated down to the transport layer. For instance, AppendEntries as the most complex RPC call has the following flow:

Message flow:
1.REQ(None) Announce number of entries, prevLogIndex, prevLogTerm etc.
1.RES(Ack) Wait for command: NextEntry to start sending content, None to abort transmission
2.REQ(StreamStart) with information about content-type and length of the record
2.REP(Ack) Wait for command: NextEntry to start sending content, Continue to send next chunk, None to finalize transmission
3.REQ(Fragment) with the chunk of record data
3.REP(Ack) Wait for command: NextEntry to start sending content, Continue to send next chunk, None to finalize transmission
4.REQ(StreamEnd) with the final chunk of record data
4.REP(Ack) Wait for command: NextEntry to start sending content, None to finalize transmission
*/
private protected readonly Pipe pipe;

That was absolutely reasonable for UDP because it has no connection concept and packet ordering. As a result, packet ordering is guaranteed through a sequence of requests/acknowledgments. For example, transmission of small log entry that fits to single UDP packet requires REQ packet and REP packet The situation is much worse if a log entries doesn't fit to single UDP packet. In this situation, the transmission of a single log entry requires multiple REQ/REP packets. As a result, a single call of AppendEntries causes a bunch of packets to be transferred over the wire in both directions.

This architecture is not very efficient for TCP. Probably, it's time to rethink the current implementation.

@freddyrios
Copy link
Contributor Author

while killing 2 nodes out of 3 + restarting 1 killed one does not always recover succesfully

It would be great if you have a working example that reproduces this situation.

I don't have a test, but it happens when running in a 3 cluster node with a modification of the example in this branch: https://github.com/freddyrios/dotNext/tree/feature/bigstruct-example.
It only has one commit with notes on what changed, but the modified example is more of a fairly busy system, which could play a part.

One more question - is this behavior specific to TCP transport? Did you observe the same for UDP/HTTP?

I have not tried it with other transports.

@freddyrios
Copy link
Contributor Author

thanks for the fixes, working great now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants