-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unexpected election timeouts when rejecting vote requests in raft #89
Comments
@sakno I will likely be trying this change locally, so I could turn it into a PR. A follow up I might be doing depending on the results is exploring how the reuse of timeouts plays with this and other raft behaviors (but that should likely be a topic for a separate time). |
@freddyrios , thanks for your work! As for indicated potential issue, there are few protections that are implemented already:
These two extensions prevent the cluster from re-elections in case of unavailable leader for a short time. Candidate state has this additional step called pre-voting: dotNext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs Lines 695 to 729 in 6b01f63
So I expect that the situations described previously by you can be mitigated by this extra step. Anyway, PRs are welcome! You can submit changes and we can take a look at them together. |
@sakno thanks for the reply. I agree prevote prevents the second scenario, since a node can't become a candidate (and send related vote requests) if the majority has more committed entries than the node. I guess the first scenario could still happen if the leader does go down. So some nodes do not have a fair chance to win a new election if the current election ends with a split vote. That said, we should probably close this issue, as I don't currently have a strong scenario where that is shown to be the root cause of issues. More below. I found more about this: Apparently depending on the size and frequency of entries being saved vs. some of the settings raft dotnext supports, the implementation is affected by some of the unstability issues we were seeing. This includes problems holding an elected leader. Specifically, we found that tuning these greatly helped the stability of some of our test deployments:
Note that increasing the election timeouts alone was not enough to achieve a cluster that is unlikely to go into recurrent elections and/or recurrent request timeouts. We see more stable behavior even when saving 8 KB entries when the parameters are turned compared to sending just the 8 bytes the example sends without changing those configurations. When we are sending 8 KB entries we increased the values of transmisssion block size, buffer size and initial partition size by 1000 compared to those in the example (as our data is that much larger, but no extra exploration has been done in a smaller increment being suitable). For the lower and upper election timeouts doubling the values was enough. So the only thing I am left wondering about the unstability with smaller transmission block size, buffer size and initial partition size is if the cluster would still have stuck in recurrent elections in the same way without the behavior I mentioned in the second scenario of the ticket or if all nodes would have faced the same fate. ps. with the current configuration killing only one node out of 3 tends to consistently recover, while killing 2 nodes out of 3 + restarting 1 killed one does not always recover succesfully (extra program restarts if it fails to start does tend to recover it). |
It would be great if you have a working example that reproduces this situation. |
One more question - is this behavior specific to TCP transport? Did you observe the same for UDP/HTTP? |
Another potential reason is suboptimal implementation of TCP transport for Raft. UDP/TCP both shares the same transmission flow control procedure. Historically, my primary focus as HTTP implementation. TCP/UDP transports were added later. To reduce development efforts, I decided to share the same logic for both transports. Transmission flow control is based on Exchange concept. Exchange is a dialogue between the client and the server. During the exchange, the client and the server send packets (a logical portion of data) to each other using Request-Reply pattern. These packets are then translated down to the transport layer. For instance, Lines 13 to 26 in 30e4188
That was absolutely reasonable for UDP because it has no connection concept and packet ordering. As a result, packet ordering is guaranteed through a sequence of requests/acknowledgments. For example, transmission of small log entry that fits to single UDP packet requires REQ packet and REP packet The situation is much worse if a log entries doesn't fit to single UDP packet. In this situation, the transmission of a single log entry requires multiple REQ/REP packets. As a result, a single call of This architecture is not very efficient for TCP. Probably, it's time to rethink the current implementation. |
I don't have a test, but it happens when running in a 3 cluster node with a modification of the example in this branch: https://github.com/freddyrios/dotNext/tree/feature/bigstruct-example.
I have not tried it with other transports. |
thanks for the fixes, working great now |
When handling vote requests in the following method, the election timeout when rejecting vote requests (when line 544 is false):
dotNext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs
Lines 516 to 552 in 6b01f63
Note that this happens both explicitely in line 533, and indirectly in line 529. It seems to be problematic in both cases, although in the later when it is a leader it does need to start a new election timeout (it is supposed to change though, instead of remaining the same which seems to be the case following the implementation of StepDown).
Some of this is not explicit at all in the raft papers, not even the phd dissertation. However, one can simulate various scenarios in the raft visualization by sending requests/stopping/resuming/timing out nodes and seeing how exactly it deals with those cases https://raft.github.io/
More info on some of the scenarios related to the votes being rejected with the checks in line 529:
Consider how the later scenario can play when there are only 2 out of 3 nodes remaining. One of them has an entry the other one does not. If one is very unlucky with timings then the node without the extra entries keeps becoming a candidate first on many rounds. This alone does not match the many rounds I have seen it take, but this might combine with other implementation details in dotnext raft, as maybe an unlucky election timeout might keep getting reused.
The text was updated successfully, but these errors were encountered: