-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raft persistence bug #589
Comments
Switching to last verified index isn't quite enough, because if the failure at 4. occurs after a majority of secondaries have acked the last verified index but before they have been notified that the commit index has moved to it, they will not be able to confirm to the user exactly what has been committed on failover, even though the transactions won't be lost. Instead they will need to re-establish consensus over the verified suffix, which is effectively commited, since they cannot tell it is. We therefore need to modify raft further to keep track of the secondaries' views on what the commit index is, and only confirm to the user that transactions are committed once a majority of followers has seen that commit index. On election, voting must also happen by comparing the last commit index first, and the last verified index second. |
This is also insufficient, because a node post-election would not be able to use its commit index as the new commit index (it could be further than the majority). I think an additional field and round trip is necessary. |
Quick notes from today's long discussions around this: We believe the 'Fancy Election Scheme' is inevitable. After winning an election as new leader of the new consensus term, you may not immediately know where every follower is, so you may not know the consensus global commit that the previous primary reached. As far as Raft is concerned, this 'leader elect' is a full primary - they are sending append entries to the other nodes to catch them up. But publicly, to the clients, they must continue to look like a backup, advertising their previous global commit - initially, this is what they last heard from the previous primary, then this is advanced as they catch up the followers. Only once this 'leader elect' has a global commit at least as high as their last signed index can they claim to be a normal primary to clients. If they follow this rule, then the advertised global commit can never move backwards (though the last-thing-a-client-heard may be briefly unavailable, and then jump forward. around elections). In practice this 'leader-elect' phase should be very short, as sufficient nodes are caught up that an election winner can immediately transition to Primary. We can make this even faster by including the follower's current received index in vote responses (otherwise we need an AppendEntries roundtrip to find out where they are, and thus what we can advertise). We can shrink this unavailability window even further by doing BFT-style all-to-all announcements of the follower's current indices, shrinking some of the state we need to gather to confirm the global commit index after winning an election. |
Reading this issue again I realise my last comment doesn't really flow from the rest. As we now have a fix for this in #1641, I'll try to summarise our work on this clearly for posterity. The core problem is summarised correctly above:
This "last verified index" is what we're now calling The initial fix seemed to be to use
We explored some complex workarounds for this, to try and find an advertisable commit point that winnable follower would always know was committed. Eventually we decided this was impossible - there is inevitably a stage where the primary knows something the followers don't, and the new primary must rediscover this in/after an election. The fear that "they will need to re-establish consensus over the verified suffix" isn't such a big problem - the new primary is sharing some suffix of the previous primary's transactions just as they would share any other transactions from prior terms. Our implementation of this is what we call the 'Fancy Election Scheme', where the service is potentially in a limbo state for a period after an election. The knock-on implications of this for the service, and what it can advertise as globally committed, are described in my comment above. The fix for the core issue, that commits which were advertised as committed could be rolled back, is fixed in #1641 by using |
Our raft implementation can lose transactions that have been advertised as committed.
This was spotted before as #519 / #520, but the fix is incorrect.
Instead of using commit_idx as their log_idx/value to roll back to, secondaries must use the last verified index (ie. the index of the last signature transaction they verified), which is the logical equivalent of log_idx with added verifiability.
Using commit_idx instead is unnecessarily strong and creates a potential loss window.
Fixing this will also mitigate #521, because the suffix committed on the primary, verified but not yet committed on the followers will be on a majority of followers and therefore part of the future ledger post-election.
The text was updated successfully, but these errors were encountered: