Skip to content

Latest commit

 

History

History
93 lines (69 loc) · 6.13 KB

proof.md

File metadata and controls

93 lines (69 loc) · 6.13 KB

Quokka's correctness proof.

Let's assume we want to generate a set of objects X. Each object x in X can depend on a set of other objects IN(x). Some objects might not depend on any other objects. Different objects take different inputs. i.e. if x1 != x2, then IN(x1) intersect IN(x2) = 0. When A is a set, IN(A) means the inputs for all the objects in this set, i.e. union of IN(a) for all a in A.

Let's assume we have a set of nodes capable of generating objects. An object x can be generated by a node if all its inputs are stored on the node or on other nodes. We assume that we have capacity to store many objects, though we might have to garbage collect some of them.

We are mostly concerned with the liveness property, i.e. all the objects will be generated when one or more of the nodes die. We assume we have a coordinator that is reliable. It can communicate with all the nodes and can determine who has died.

Then we shall keep the following three data structures on the coordinator, in a transactional data store. Note the coordinator itself can be distributed, so can this transactional data store, as long as it handles its own fault recovery.

CT: objects that can be GC'ed, starts at {} WBG: objects that will be generated, WBG[A] is all objects that will be generated on node A. WBG.all will denote all stored objects. We can distribute objects to the nodes to generate at the onset. WBG.all starts at X. PT: objects that are stored on nodes, PT[A] is all objects on node A. PT.all will denote all stored objects. Starts at {}

Quokka maintains several invariants among those data stores. Invariants: Invariant 1: IN(WBG.all) contained in PT.all union WBG.all, i.e. the inputs to objects that will be generated are present or will be generated. Invariant 2: IN(WBG.all) intersect CT = 0, i.e. the inputs to objects that will be generated cannot be GC'ed. Invariant 3: WBG.all unique <> for X in WBG.all, only one X in WBG.all, i.e. there exists only one outstanding task to generate X.

Notes:

  • This means that if object X is in CT, X does not have to be in PT.all or WBG.all to satisfy Invariant 1
  • The data store changes through serializable transactions. The nodes and the coordinators can issue transactions to the data store. Since the transactions are serializable, as long as each transaction type maintains the invariants, the invariants will always be maintained.
  • The data store is said to be accurate if it describes the real world. i.e. the present objects are really present. The data store is said to be consistent if it respects the aforementioned invariants. The data store can be inaccurate due to node failures, but can never be inconsistent.
  • If one or more nodes fail, the data store becomes inaccurate. The coordinator's job is to make the data store accurate while maintaining its consistency. Failures can be serialized in this fashion. If two concurrent failures occur, say nodes A and B, the coordinator can assume information pertaining to node B is accurate when changing the data store to account for A's failure. After it has done so, it can then address B's failure. The end result will still be accurate.
  • Quokka assumes that in normal operation without failures, all objects have dependencies such that they will all be generated. i.e. no missing dependencies in normal operation.
  • It might be infeasible to maintain data structures such as WBG since the number of objects are potentially infinite. In this case you can maintain its complement -- objects that won't be generated. It might not make sense to think about what won't be generated on a single node -- in that case you can use a combination of data structures to achieve the goal.

--- What needs to be done, is to show that if these invariants are true, then Quokka actually has liveness. I'm pretty sure that's true I just need to put it in words.

A node A in the system then do something like this:

for x in WBG[A]: if IN(A) in PT.all: - produce x on node A lock CT, WBG, PT. MULTI add IN(x) to CT remove x from WBG[A] add x to PT[A] EXEC

If Invariant 3 is true and Invariant 2 was true before this transaction, then Invariant 2 will be true after the transaction since all objects have different inputs. Invariant 1 will also stay true since we just moved X from WBG to PT. Invariant 3 remains true since we don't add objects to WBG.


garbage collect x on node A lock CT, PT. MULTI check if k in CT remove x from PT[A] EXEC

This is to guard against people removing stuff from CT while you are removing x from PT[A]. i.e. the coordinator! Using Redis watch is probably really expensive. Since the coordinator is the only person who can remove stuff, you jsut need to make sure that the coordinator is not doing recovery.


recover stuff if node A fails. Things that could now be wrong. WBG[A], PT[A] We want to get to an end state CT, WBG and PT, A not in WBG or PT.

lock CT, WBG and PT

  • perform virtual garbage collection first MULTI remove all x in PT such that x in CT

objects_to_recover = WBG[node] union PT[node] for x in WBG[node]: WBG[node'].add(x) # we have to ensure that IN(x) in PT.all union WBG.all. Before failure, IN(x) in WBG.all or PT.all, so this is fine by Invariant 1 if we are going to reconstruct PT[node] anyways. # this is fine by Invariant 3 since before x is unique, after x will be unique as well

PT.del(node) WBG.del(node)

  • the input of x is not guaranteed to be in PT or WBG! If we just put the x in some node's WBG we will violate Invariant 1. deque = [x for x in PT[node] and x not in WBG.all] while len(deque) > 0: x = deque.popleft() WBG[node'].add(x) for y in IN(x): if y in CT: CT.rem(y) if y in PT.all (in some other PT, assumed to be accurate) or y in WBG.all: continue else: # y also need to reconstructed deque.add(y)

EXEC

This recovery procedure respects the three invariants. At the end of the recovery procedure all information relating to node A will be purged from the tables, and the invariants still hold, thus it's accurate (if nothing else has failed) and consistent.