raft: use learner replicas instead of preemptive snapshots #34058

tbg · 2019-01-16T14:59:47Z

We could stop using preemptive snapshots if we used Raft's support for learner replicas. These catch up on the log and can receive snapshots, but they don't vote, campaign, or count in the commit quorum.

Doing so would unify Raft and preemptive snapshots, would allow us to treat learners as regular replicas in much of the code, and would prevent the replica GC queue from erroneously removing preemptive snapshots (which is a medium-size nuisance today). Instead, we'd need a mechanism that regularly inspects the range descriptor for learners and decides whether a learner should be removed (for example, after a replication change fails without cleaning up the learner).

Naively we could try to rely on the Raft snapshot queue to catch up the learners, but that queue works very poorly. Until it has been redesigned, we would probably be better off sending the snapshot manually like today.

Learner replicas could also be used in conjunction with follower reads to enable historical queries at a remote location. For example, a satellite DC could hold a learner replica of each range whose full members live in another DC, and would allow the satellite DC to carry out historical reads on a copy of the data. Whether this is useful in the presence of CDC is speculation, but I think @bdarnell had opinions in the past.

cc @nvanbenschoten (who suggested learner replicas)

petermattis · 2019-01-16T15:17:15Z

I'm 👍 on trying to reduce the complexity around preemptive snapshots.

Naively we could try to rely on the Raft snapshot queue to catch up the learners, but that queue works very poorly. Until it has been redesigned, we would probably be better off sending the snapshot manually like today.

The Raft snapshot queue exists to prevent a thundering herd of snapshot activity under various scenarios (i.e. a node being offline for a few minutes, such that Raft logs are truncated). I'm not sure what the "manually like today" option means. Using the preemptive mechanism? Can you elaborate?

tbg · 2019-01-16T15:43:24Z

Yes. A preemptive snapshot is essentially a real snapshot, just for replicaID zero with special behavior around snapshot reservations and a special receive path. I'm suggesting to send a manual snapshot from the replicate queue initially. This would be a Raft snapshot, but with the same rate limiting and reservation behavior applied as otherwise would be for a preemptive snap, though once the snapshot reaches the Raft group it looks like "just another snapshot".

Raft snapshots have lots of unfortunate behavior. The one that matters most in the context of this change is that a Raft snapshot may be held up by the reservation semaphore on the receiver, which would then block the sender's replicate queue.

Ultimately Raft and learner-Raft snapshots should be sent through the snapshot queue, but until the Raft snapshot queue "just works" that's a bad idea. Making the Raft snapshot queue "good" is a separate engineering effort. My observation is that it seems straightforward to not entangle them right from the get-go while maintaining the benefit of removing preemptive snapshots.

petermattis · 2019-01-16T16:01:33Z

Got it. So you would have the replicate queue still send the snapshot (which limits the number of outstanding "preemptive" snapshots on the sender), and leave the existing Raft snapshot queue unchanged?

Btw, can you refresh my memory about the problems with the Raft snapshot queue? It is unfortunate that a problem with one range sending a snapshot can block sending snapshots for another range, but that is also part of the point.

tbg · 2019-01-16T16:13:19Z

Yep, exactly. This is really just a workaround until we can trust the raft snap queue.

Btw, can you refresh my memory about the problems with the Raft snapshot queue?

#32046 (comment) has a little bit on this. Basically there's a lot of little badness adding up to giant badness. reserveSnapshot blocks Raft snapshots on the receiver (it rejects preemptive ones, which is a lot better much of the time). There is also this backoff around "store throttling", where we won't rebalance to a store for 5s after an error. I believe we trigger that behavior accidentally all the time, which of course is terrible for swift rebalancing.

There's also, still, priority inversion. A preemptive snapshot reservation sticks around for ~32s (2mb/s rate limit), so a Raft snapshot may block for 32s or even a multiple of that, and in turn that blocks a Raft snapshot queue somewhere. It's even possible that you block before the snapshot is simply rejected because there's some replica in the way (have to double check that, but wouldn't surprise me).

And lastly we don't have flow control for the Raft snapshots except the hardcoded 8mb/s. We'd want to use more when we can. (This is a secondary concern at the moment, if we got close to 8mb/s of effective throughput on snapshot queue processing that'd be a big win).

The goal should be that a 64 1mb snaps (perhaps to different receivers) can be processed in approximately the time of a single 64mb snap (won't quite get as fast, but you get the idea), even when the receivers sometimes can't accept the snapshots right away. As long as there is one who can accept something, they should get it.

The comment below goes into more detail, but here's the TL;DR: Problems: 1. right now answering "is this Replica object unused?" is impossible 2. right now replicaIDs change on existing replicas, which is very complex to reason about 3. right now it'll be difficult to split replicas into different states because that requires answering 1 during state transitions Proposed: 1. use a clean API to refcount Replica usage, and 2. allow initiating teardown "basically whenever" without blocking (think merge trigger), 3. so that the replica clears out quickly, 4. which in turn solves 1. 5. Then we can solve 2 because we'll be able to replace the Replica object in the Store whenever the replicaID would previously change in-place (this will not be trivial, but I hope it can be done), 6. and we should also be able to do 3 (again, not trivial, but a lot harder right now). I expect the replication code to benefit from 6) as the Raft instance on a Replica would never change. This PR is a toy experiment for 1. It certainly wouldn't survive contact with the real code, but it's sufficient to discuss this project and iterate on the provisional Guard interface. ---- GuardedReplica is the external interface through which callers interact with a Replica. By acquiring references to the underlying Replica object while it is being used, it allows safe removal of Replica objects and/or their under- lying data. This is an important fundamental for five reasons: Today, we use no such mechanism, though this is largely due to failing in the past to establish one[1]. The status quo "works" by maintaining a destroyStatus inside of Replica.mu, which is checked in a few places such as before proposing a command or serving a read. Since these checks are only "point in time", and nothing prevents the write to the status from occurring just a moment after the check, there is a high cognitive overhead to reasoning about the possible outcomes. In fact, in the case in which things could go bad most spectacularly, namely removing a replica including data, we hold essentially all of the locks available to us and rely on the readOnlyCmdMu (which we would rather get rid off). This then is the first reason for proposing this change: make the replica lifetime easier to reason about and establish confidence that the Replica can't just disappear out from under us. The second motivator are ReplicaID transitions, which can be extremely complicated. For example, a Replica may 1. start off as an uninitialized Replica with ReplicaID 12 (i.e. no data) 2. receive a preemptive snapshot, which confusingly results in an initialized Replica with ReplicaID 12 (though preemptive snapshots nominally should result in a preemptive replica -- one with ReplicaID zero). 3. update its ReplicaID to 18 (i.e. being added back to the Raft group). 4. get ReplicaGC'ed because 5. it blocks a preemptive snapshot, which now recreates it. In my point of view, changing the ReplicaID for a live Replica is a bad idea and incurs too much complexity. An architecture in which Replica objects have a single ReplicaID throughout their lifetime is conceptually much simpler, but it is also much more straightforward to maintain since it does away with a whole class of concurrency that needs to be tamed in today's code, and which may have performance repercussions. On the other hand, replicaID changes are not frequent, and only need to be moderately fast. The alternative is to instantiate a new incarnation of the Replica whenever the ReplicaID changes. The difficult part about this is destroying the old Replica; since Replica provides proper serialization, we mustn't have commands in-flight in two instances for the same data (and generally we want to avoid even having to think about concurrent use of old incarnations). This is explored here. The above history would read something like this: 1. start off as an uninitialized Replica R with Replica ID 12 (no data) 2. preemptive snapshot is received: tear down R, instantiate R' 3. ReplicaID is updated: tear down R', instantiate R'' 4. R'' is marked for ReplicaGC: replace with placeholder R''' in Store, tear down R'', wait for references to drain, remove the data, remove R'''. 5. instantiate R''' (no change from before). A third upshot is almost visible in the above description. Once we can re- instantiate cleanly on ReplicaID-based state changes, we might as well go ahead and pull apart various types of Replica: - preemptive snapshots (though we may replace those by learner replicas in the future[2]) - uninitialized Replicas (has replicaID, but no data) - initialized Replicas (has replicaID and data) - placeholders (have no replicaID and no data) To simplify replicaGC and to remove the preemptive snapshot state even before we stop using preemptive snapshots, we may allow placeholders to hold data (but with no means to run client requests against it). Once there, reducing the memory footprint for "idle" replicas by only having a "shim" in memory (replacing the "full" version) until the next request comes in becomes a natural proposition by introducing a new replica state that upon access gets promoted to a full replica (which loads the full in-memory state from disk), pickup up past attempts at this which failed due to the technical debt present at the moment[3]. [1]: cockroachdb#8630 [2]: cockroachdb#34058 [3]: cockroachdb#31663 Release note: None

bdarnell · 2019-01-17T14:49:46Z

These catch up on the log and can receive snapshots, but they don't vote, campaign, or count in the commit quorum.

Do they count towards the quota pool? If they do, then these non-voting learner replicas could still impact write performance. If not, there are various sorts of pathological loops that could result in snapshots being sent repeatedly.

I think this is generally a good idea (if learner replicas had been implemented in etcd/raft when we were starting out, we'd have used them instead of implementing preemptive snapshots), but this is going to be a lot of work to tune and debug.

tbg · 2019-01-17T14:53:28Z

It's definitely a larger change, especially since it matters in so many places (there'll also be plenty of UI work).

The quota pool should ignore the learners. We already don't truncate the log while a snapshot is inflight, so the only effect of that is that they'll be allowed to fall behind arbitrarily, so the Raft log can grow without bounds until the learner either gets upgraded or removed.

bdarnell · 2019-01-17T14:55:03Z

We already don't truncate the log while a snapshot is inflight, so the only effect of that is that they'll be allowed to fall behind arbitrarily, so the Raft log can grow without bounds until the learner either gets upgraded or removed.

As long as they're temporary. My concern about the quota pool is if we start to have permanent non-voting replicas for follower reads (as suggested).

tbg · 2019-01-17T15:05:54Z

Ah, yes. This is a bigger problem. It seems to make sense to backpressure in that case, though perhaps less aggressively than for quorum members. Anyway, that won't be happening anytime soon, I also don't know if there's a strong enough incentive after CDC.

ajwerner · 2020-01-02T15:30:49Z

I'm closing this. Learner replicas shipped in 19.2.

tbg added the A-kv-replication Relating to Raft, consensus, and coordination. label Jan 16, 2019

tbg added this to the Later milestone Jan 16, 2019

knz added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 16, 2019

tbg mentioned this issue Jan 16, 2019

[toy-experiment] storage: replica refcount+teardown #34061

Closed

ajwerner closed this as completed Jan 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: use learner replicas instead of preemptive snapshots #34058

raft: use learner replicas instead of preemptive snapshots #34058

tbg commented Jan 16, 2019

petermattis commented Jan 16, 2019

tbg commented Jan 16, 2019 •

edited

Loading

petermattis commented Jan 16, 2019

tbg commented Jan 16, 2019 •

edited

Loading

bdarnell commented Jan 17, 2019

tbg commented Jan 17, 2019

bdarnell commented Jan 17, 2019

tbg commented Jan 17, 2019

ajwerner commented Jan 2, 2020

raft: use learner replicas instead of preemptive snapshots #34058

raft: use learner replicas instead of preemptive snapshots #34058

Comments

tbg commented Jan 16, 2019

petermattis commented Jan 16, 2019

tbg commented Jan 16, 2019 • edited Loading

petermattis commented Jan 16, 2019

tbg commented Jan 16, 2019 • edited Loading

bdarnell commented Jan 17, 2019

tbg commented Jan 17, 2019

bdarnell commented Jan 17, 2019

tbg commented Jan 17, 2019

ajwerner commented Jan 2, 2020

tbg commented Jan 16, 2019 •

edited

Loading

tbg commented Jan 16, 2019 •

edited

Loading