Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot auditability #1539

Closed
1 of 2 tasks
jumaffre opened this issue Aug 27, 2020 · 3 comments · Fixed by #1925
Closed
1 of 2 tasks

Snapshot auditability #1539

jumaffre opened this issue Aug 27, 2020 · 3 comments · Fixed by #1925

Comments

@jumaffre
Copy link
Contributor

jumaffre commented Aug 27, 2020

Follow up from #1302

Snapshots are currently generated at regular intervals for a state that is globally committed. However, the snapshot evidence (hash of snapshot) is only committed after the snapshot has been generated. The snapshot is written to disk as soon as it is generated.

This first implementation means that the evidence of a snapshot that is available for new joiners to resume from (i.e. an operator can copy the snapshot file and start a new joiner from it straight away) can actually be rolled back. In this case, the snapshot would be blameless as there's no evidence for it in the ledger.

What we should do instead is:

  • Only write the snapshot to disk once the snapshot evidence is globally committed. This means we would have to keep the serialised snapshot around (+ the version of the evidence) until version is globally committed.

However, this may not be enough to guarantee that a joiner that resumed from a snapshot can join the consensus:

  • From its perspective, it should not become part of the network (e.g. process app requests) until it has seen the globally committed evidence of the snapshot it has resumed from.
  • From the existing network perspective, this new node shouldn't count as part of the Raft quorum until it has confirmed that it has resumed from a trustworthy snapshot.
@jumaffre jumaffre added this to the Checkpointing milestone Aug 27, 2020
@jumaffre jumaffre added the 0.14 label Sep 16, 2020
@jumaffre
Copy link
Contributor Author

On recovery, we also need to make sure that we do not rollback the snapshot evidence.

@jumaffre
Copy link
Contributor Author

jumaffre commented Oct 1, 2020

Discussed further with @achamayou today. The simplest solution is probably for new joiners to be passed the snapshot and the ledger suffix until the next signature that confirms that the snapshot evidence has been globally committed. The scheme is then very similar to the existing recovery scheme:

  1. The new joiner deserialises the snapshot and the ledger suffix (in public mode), until its see the signature that confirms that the snapshot evidence has been globally committed. It also verifies that the evidence matches the snapshot (SHA256).
  2. It joins the new network from there and then initiates the deserialisation of the snapshot and the private ledger suffix, but this time in private mode.
  3. When the private ledger has been deserialised at the same level than the public store (i.e. signature that validates the snapshot evidence), the Merkle roots of both stores are compared and the private maps are swapped in the public store and the node now becomes part of the consensus.

The main drawback are:

  • A node cannot join from just a snapshot but also needs an arbitrarily long ledger suffix: there's no way for operators to tell in which ledger chunk the snapshot evidence is committed although it is likely to be in the chunk that immediately follows the snapshot.
  • There is also a short period of time in which the new joiner is part of the consensus from the perspective of the existing network but hasn't yet initialised its own consensus as it is replaying private ledger entries during this time.

The main advantages is that there is no added complexity in Aft on either sides (i.e. existing network or joiner) to not count the new joiner in the quorum until it has seen the global commit of the snapshot evidence.

Starting from a snapshot also means that we can use the state of the store to find out which nodes RPC address we could use on join (providing that the snapshot is not too outdated, e.g. if the configuration in the snapshot does not overlap with the latest configuration).

@jumaffre
Copy link
Contributor Author

As discussed with @achamayou today, some more details on this:

(The following can be simplified once we can ask a receipt for the evidence of the snapshot. The receipt for the evidence could probably be embedded in the snapshot file directly)

For the new joiner, the aim is simply to verify that the snapshot it wants to join from is valid. To do so, on startup, the new joiner deserialises the snapshot in public mode and the following ledger entries until evidence that the snapshot evidence has been (globally) commit (i.e. signature entry which contains a commit seqno > snapshot evidence). If successful, the node can reset its store and attempt to join the service. Otherwise, the node should refuse to start (*). When the service successfully returns the ledger secrets to the new joiner, the full snapshot can be applied to the store (existing behaviour).

On recovery, the node should apply the snapshot in public mode and when deserialising public entries, it should check that the evidence for that snapshot is indeed present in the ledger. If when reaching the end of the ledger, there hasn't been any evidence that the snapshot evidence was committed, it should abort with an error.

(*) We could make the node auto-retries with the previous snapshot, etc. However, this would mean adding a couple of extra messages on the ring-buffer to retrieve the previous snapshot, which adds some complexity. Instead, we'll simply check that when a snapshot is selected for join/recovery, the seqno of its evidence is included in a ledger chunk that is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants