Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[toy-experiment] storage: replica refcount+teardown
The comment below goes into more detail, but here's the TL;DR: Problems: 1. right now answering "is this Replica object unused?" is impossible 2. right now replicaIDs change on existing replicas, which is very complex to reason about 3. right now it'll be difficult to split replicas into different states because that requires answering 1 during state transitions Proposed: 1. use a clean API to refcount Replica usage, and 2. allow initiating teardown "basically whenever" without blocking (think merge trigger), 3. so that the replica clears out quickly, 4. which in turn solves 1. 5. Then we can solve 2 because we'll be able to replace the Replica object in the Store whenever the replicaID would previously change in-place (this will not be trivial, but I hope it can be done), 6. and we should also be able to do 3 (again, not trivial, but a lot harder right now). I expect the replication code to benefit from 6) as the Raft instance on a Replica would never change. This PR is a toy experiment for 1. It certainly wouldn't survive contact with the real code, but it's sufficient to discuss this project and iterate on the provisional Guard interface. ---- GuardedReplica is the external interface through which callers interact with a Replica. By acquiring references to the underlying Replica object while it is being used, it allows safe removal of Replica objects and/or their under- lying data. This is an important fundamental for five reasons: Today, we use no such mechanism, though this is largely due to failing in the past to establish one[1]. The status quo "works" by maintaining a destroyStatus inside of Replica.mu, which is checked in a few places such as before proposing a command or serving a read. Since these checks are only "point in time", and nothing prevents the write to the status from occurring just a moment after the check, there is a high cognitive overhead to reasoning about the possible outcomes. In fact, in the case in which things could go bad most spectacularly, namely removing a replica including data, we hold essentially all of the locks available to us and rely on the readOnlyCmdMu (which we would rather get rid off). This then is the first reason for proposing this change: make the replica lifetime easier to reason about and establish confidence that the Replica can't just disappear out from under us. The second motivator are ReplicaID transitions, which can be extremely complicated. For example, a Replica may 1. start off as an uninitialized Replica with ReplicaID 12 (i.e. no data) 2. receive a preemptive snapshot, which confusingly results in an initialized Replica with ReplicaID 12 (though preemptive snapshots nominally should result in a preemptive replica -- one with ReplicaID zero). 3. update its ReplicaID to 18 (i.e. being added back to the Raft group). 4. get ReplicaGC'ed because 5. it blocks a preemptive snapshot, which now recreates it. In my point of view, changing the ReplicaID for a live Replica is a bad idea and incurs too much complexity. An architecture in which Replica objects have a single ReplicaID throughout their lifetime is conceptually much simpler, but it is also much more straightforward to maintain since it does away with a whole class of concurrency that needs to be tamed in today's code, and which may have performance repercussions. On the other hand, replicaID changes are not frequent, and only need to be moderately fast. The alternative is to instantiate a new incarnation of the Replica whenever the ReplicaID changes. The difficult part about this is destroying the old Replica; since Replica provides proper serialization, we mustn't have commands in-flight in two instances for the same data (and generally we want to avoid even having to think about concurrent use of old incarnations). This is explored here. The above history would read something like this: 1. start off as an uninitialized Replica R with Replica ID 12 (no data) 2. preemptive snapshot is received: tear down R, instantiate R' 3. ReplicaID is updated: tear down R', instantiate R'' 4. R'' is marked for ReplicaGC: replace with placeholder R''' in Store, tear down R'', wait for references to drain, remove the data, remove R'''. 5. instantiate R''' (no change from before). A third upshot is almost visible in the above description. Once we can re- instantiate cleanly on ReplicaID-based state changes, we might as well go ahead and pull apart various types of Replica: - preemptive snapshots (though we may replace those by learner replicas in the future[2]) - uninitialized Replicas (has replicaID, but no data) - initialized Replicas (has replicaID and data) - placeholders (have no replicaID and no data) To simplify replicaGC and to remove the preemptive snapshot state even before we stop using preemptive snapshots, we may allow placeholders to hold data (but with no means to run client requests against it). Once there, reducing the memory footprint for "idle" replicas by only having a "shim" in memory (replacing the "full" version) until the next request comes in becomes a natural proposition by introducing a new replica state that upon access gets promoted to a full replica (which loads the full in-memory state from disk), pickup up past attempts at this which failed due to the technical debt present at the moment[3]. [1]: cockroachdb#8630 [2]: cockroachdb#34058 [3]: cockroachdb#31663 Release note: None
- Loading branch information