-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast joiner catchup #63
Comments
Depends on #236 |
Following discussion with @olgavrou, here's a sketch of how this could work for CFT. A snapshot is a transaction made of a write-set containing all keys in all replicated tables at a globally committed version. Its version is the committed version for which it was produced. It contains a flag identifying it as a snapshot. A new node connecting to a network can request the most recent snapshot from the node it initiates its connection against, along with evidence. It writes the evidence for the snapshot to a derived table (to preserve offline verifiability), and starts operating from that commit onwards. On catastrophic recovery, a list of snapshots with attached evidence can be collated. A recovery proposed from a snapshot would also expose the aggregated evidence from old nodes, which can be examined by the members when voting for/against the recovery. A straightforward case could be recovery from a snapshot with evidence from at least f+1 nodes as of the last known configuration. To allow for this, it is preferable that snapshots be produced at the same version on all nodes, for example every N globally committed versions. During snapshot production, compaction needs to be momentarily turned off. The KV can record the most recent compaction event triggered during snapshot production and trigger it after the production ends. It may be useful for the primary to replicate the digest of the most recent snapshot. A late joiner would be able to remove the snapshot evidence once that digest commits (because it is then attributable to the primary). |
I think the snapshot should be over all tables - not just the replicated ones. If you are using the snapshot to catch up then you don't want to execute all the requests from the beginning of time to arrive to the same derived state. |
This sketch is for CFT, for BFT the scheme needs some variation (the snapshots must be signed by more than the producing node for example). If we do use the derived tables to store the snapshot evidence also, I don't think they can be part of the snapshot itself. |
The BFT variant should not require too many changes from what is described here. Every replica should generate a checkpoint at every X per-prepares. X should be part of the configuration of CCF. On the X + K pre-prepare the primary would send out the digest of the checkpoint and the replicas would send out their digest in the their prepare messages. Assuming 2f+1 agree with the primary (and other requirements hold) then the X+K can be committed otherwise a view-change is required and the agreement process at X+K will be restarted. Again K is a CCF configuration parameter. Once we have agreement on the checkpoint each machine can persisted its own checkpoint along with proof that their checkpoint is correct. Here is a summary:
|
Points discussed with @achamayou to consider in implementing this:
I will investigate this further in the next few days and aim to provide answers to all of these. |
What can probably be replicated through the consensus is a digest of the snapshot, signed by the producing node in the case of raft. That way it's at least possible to get consensus on that. |
There should be a more efficient late-join/recovery mechanism than replaying through history.
The text was updated successfully, but these errors were encountered: