Chitchat relies on an extension of the Scuttlebutt reconciliation algorithm, to handle:
- Key-Value deletion
- Member leaving the cluster.
For simplification, we will just discuss the replication of a single map on a cluster of n node, on which only one node can write to.
(In scuttlebutt, each node has ownership over such a map.)
There are two kinds of update event on this map:
- Set(K, V)
- Deletion(K)
The writer associates an incremental version to all of the updates.
Let's assume the writer node has done V updates and its map has reached a final state $\mathcal{S}f$. We can use the notation $ ({u{0}}, ..., {u_i}, ..., {u}_{v=V}) $ to describe the sequence of updates, ordered by version.
At any point in time, each node in the cluster has a view of the state map
Every second or so, a node
By design the algorithm attempts to work with a limited budget. The update is truncated to fit a maximum size called the mtu
. In chitchat
, the protocol used is UDP, and the default mtu is of around 65KB
We call such an update a delta
.
Upon reception of the updates, the node
In order to inform the peer about where the updates should start from, the node will send two parameters describing how much of the state is replicated:
-
max_version
: One big difference with reliable broadcast, is that thismax version
does NOT mean that the node's state reflects all of the events earlier thanmax_version
. Rather, it says that if the node were to receive all of the events within betweenmax_version
and$V$ then it will end up to date. The difference is subtle, but comes from an important optimization of scuttlebutt.
If a node asks for an update above
last_gc_version
: Rather than being deleted right away, the algorithm is using a tombstone mechanism. Key entries are kept by marked as deleted. From the client point of view, everything looks like the key really has been deleted. Eventually, in order to free memory, we garbage collect all of the key values marked for deletion that are older than a few hours. We then keep track of the maximumlast_gc_version
.
At each round of the protocol, a node maintains a pair
We also defined the subsequence of missing updates
-
$i > max_version$ if$u_i$ is a set operation -
$i > max(max_version, last_gc_version)$ if$u_i$ is a delete operation.
The key of the algorithm is to maintain the following invariant. $$ \mathcal{S}~\top (u') = \mathcal{S}$$
In plain words, the protocol attempts to make sure that, if any nodes were to apply
- all insert updates above its
max_version
- all delete updates above its
max(max_version, last_gc_version)
it would have managed to fully replicate$S_f$ .
Knowing the max_version
and the last_gc_version
of a peer, a node will emit a delta that, if applied to such valid state, would maintain the invariant above and
increase
Due to the nature of UDP and to the concurrent handshake, deltas may not arrive in order or be duplicated. For this reason, upon reception of a delta, a node must ensure that the delta can be applied to the current state, without breaking the invariant,
and increasing