-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot auditability #1539
Comments
On recovery, we also need to make sure that we do not rollback the snapshot evidence. |
Discussed further with @achamayou today. The simplest solution is probably for new joiners to be passed the snapshot and the ledger suffix until the next signature that confirms that the snapshot evidence has been globally committed. The scheme is then very similar to the existing recovery scheme:
The main drawback are:
The main advantages is that there is no added complexity in Aft on either sides (i.e. existing network or joiner) to not count the new joiner in the quorum until it has seen the global commit of the snapshot evidence. Starting from a snapshot also means that we can use the state of the store to find out which nodes RPC address we could use on join (providing that the snapshot is not too outdated, e.g. if the configuration in the snapshot does not overlap with the latest configuration). |
As discussed with @achamayou today, some more details on this: (The following can be simplified once we can ask a receipt for the evidence of the snapshot. The receipt for the evidence could probably be embedded in the snapshot file directly) For the new joiner, the aim is simply to verify that the snapshot it wants to join from is valid. To do so, on startup, the new joiner deserialises the snapshot in public mode and the following ledger entries until evidence that the snapshot evidence has been (globally) commit (i.e. signature entry which contains a commit seqno > snapshot evidence). If successful, the node can reset its store and attempt to join the service. Otherwise, the node should refuse to start (*). When the service successfully returns the ledger secrets to the new joiner, the full snapshot can be applied to the store (existing behaviour). On recovery, the node should apply the snapshot in public mode and when deserialising public entries, it should check that the evidence for that snapshot is indeed present in the ledger. If when reaching the end of the ledger, there hasn't been any evidence that the snapshot evidence was committed, it should abort with an error. (*) We could make the node auto-retries with the previous snapshot, etc. However, this would mean adding a couple of extra messages on the ring-buffer to retrieve the previous snapshot, which adds some complexity. Instead, we'll simply check that when a snapshot is selected for join/recovery, the |
Follow up from #1302
Snapshots are currently generated at regular intervals for a state that is globally committed. However, the snapshot evidence (hash of snapshot) is only committed after the snapshot has been generated. The snapshot is written to disk as soon as it is generated.
This first implementation means that the evidence of a snapshot that is available for new joiners to resume from (i.e. an operator can copy the snapshot file and start a new joiner from it straight away) can actually be rolled back. In this case, the snapshot would be blameless as there's no evidence for it in the ledger.
What we should do instead is:
version
of the evidence) untilversion
is globally committed.However, this may not be enough to guarantee that a joiner that resumed from a snapshot can join the consensus:
The text was updated successfully, but these errors were encountered: