Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: use learner replicas instead of preemptive snapshots #34058

Closed
tbg opened this issue Jan 16, 2019 · 9 comments
Closed

raft: use learner replicas instead of preemptive snapshots #34058

tbg opened this issue Jan 16, 2019 · 9 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Milestone

Comments

@tbg
Copy link
Member

tbg commented Jan 16, 2019

We could stop using preemptive snapshots if we used Raft's support for learner replicas. These catch up on the log and can receive snapshots, but they don't vote, campaign, or count in the commit quorum.

Doing so would unify Raft and preemptive snapshots, would allow us to treat learners as regular replicas in much of the code, and would prevent the replica GC queue from erroneously removing preemptive snapshots (which is a medium-size nuisance today). Instead, we'd need a mechanism that regularly inspects the range descriptor for learners and decides whether a learner should be removed (for example, after a replication change fails without cleaning up the learner).

Naively we could try to rely on the Raft snapshot queue to catch up the learners, but that queue works very poorly. Until it has been redesigned, we would probably be better off sending the snapshot manually like today.

Learner replicas could also be used in conjunction with follower reads to enable historical queries at a remote location. For example, a satellite DC could hold a learner replica of each range whose full members live in another DC, and would allow the satellite DC to carry out historical reads on a copy of the data. Whether this is useful in the presence of CDC is speculation, but I think @bdarnell had opinions in the past.

cc @nvanbenschoten (who suggested learner replicas)

@tbg tbg added the A-kv-replication Relating to Raft, consensus, and coordination. label Jan 16, 2019
@tbg tbg added this to the Later milestone Jan 16, 2019
@knz knz added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 16, 2019
@petermattis
Copy link
Collaborator

I'm 👍 on trying to reduce the complexity around preemptive snapshots.

Naively we could try to rely on the Raft snapshot queue to catch up the learners, but that queue works very poorly. Until it has been redesigned, we would probably be better off sending the snapshot manually like today.

The Raft snapshot queue exists to prevent a thundering herd of snapshot activity under various scenarios (i.e. a node being offline for a few minutes, such that Raft logs are truncated). I'm not sure what the "manually like today" option means. Using the preemptive mechanism? Can you elaborate?

@tbg
Copy link
Member Author

tbg commented Jan 16, 2019

Yes. A preemptive snapshot is essentially a real snapshot, just for replicaID zero with special behavior around snapshot reservations and a special receive path. I'm suggesting to send a manual snapshot from the replicate queue initially. This would be a Raft snapshot, but with the same rate limiting and reservation behavior applied as otherwise would be for a preemptive snap, though once the snapshot reaches the Raft group it looks like "just another snapshot".

Raft snapshots have lots of unfortunate behavior. The one that matters most in the context of this change is that a Raft snapshot may be held up by the reservation semaphore on the receiver, which would then block the sender's replicate queue.

Ultimately Raft and learner-Raft snapshots should be sent through the snapshot queue, but until the Raft snapshot queue "just works" that's a bad idea. Making the Raft snapshot queue "good" is a separate engineering effort. My observation is that it seems straightforward to not entangle them right from the get-go while maintaining the benefit of removing preemptive snapshots.

@petermattis
Copy link
Collaborator

Got it. So you would have the replicate queue still send the snapshot (which limits the number of outstanding "preemptive" snapshots on the sender), and leave the existing Raft snapshot queue unchanged?

Btw, can you refresh my memory about the problems with the Raft snapshot queue? It is unfortunate that a problem with one range sending a snapshot can block sending snapshots for another range, but that is also part of the point.

@tbg
Copy link
Member Author

tbg commented Jan 16, 2019

Yep, exactly. This is really just a workaround until we can trust the raft snap queue.

Btw, can you refresh my memory about the problems with the Raft snapshot queue?

#32046 (comment) has a little bit on this. Basically there's a lot of little badness adding up to giant badness. reserveSnapshot blocks Raft snapshots on the receiver (it rejects preemptive ones, which is a lot better much of the time). There is also this backoff around "store throttling", where we won't rebalance to a store for 5s after an error. I believe we trigger that behavior accidentally all the time, which of course is terrible for swift rebalancing.

There's also, still, priority inversion. A preemptive snapshot reservation sticks around for ~32s (2mb/s rate limit), so a Raft snapshot may block for 32s or even a multiple of that, and in turn that blocks a Raft snapshot queue somewhere. It's even possible that you block before the snapshot is simply rejected because there's some replica in the way (have to double check that, but wouldn't surprise me).

And lastly we don't have flow control for the Raft snapshots except the hardcoded 8mb/s. We'd want to use more when we can. (This is a secondary concern at the moment, if we got close to 8mb/s of effective throughput on snapshot queue processing that'd be a big win).

The goal should be that a 64 1mb snaps (perhaps to different receivers) can be processed in approximately the time of a single 64mb snap (won't quite get as fast, but you get the idea), even when the receivers sometimes can't accept the snapshots right away. As long as there is one who can accept something, they should get it.

tbg added a commit to tbg/cockroach that referenced this issue Jan 16, 2019
The comment below goes into more detail, but here's the TL;DR:

Problems:

1. right now answering "is this Replica object unused?" is impossible
2. right now replicaIDs change on existing replicas, which is
   very complex to reason about
3. right now it'll be difficult to split replicas into different
   states because that requires answering 1 during state transitions

Proposed:

1. use a clean API to refcount Replica usage, and
2. allow initiating teardown "basically whenever" without blocking
   (think merge trigger),
3. so that the replica clears out quickly,
4. which in turn solves 1.
5. Then we can solve 2 because we'll be able to
   replace the Replica object in the Store whenever
   the replicaID would previously change in-place
   (this will not be trivial, but I hope it can be done),
6. and we should also be able to do 3 (again, not trivial,
   but a lot harder right now).

I expect the replication code to benefit from 6) as the Raft instance on
a Replica would never change.

This PR is a toy experiment for 1. It certainly wouldn't survive contact
with the real code, but it's sufficient to discuss this project and
iterate on the provisional Guard interface.

----

GuardedReplica is the external interface through which callers interact with
a Replica. By acquiring references to the underlying Replica object while it
is being used, it allows safe removal of Replica objects and/or their under-
lying data. This is an important fundamental for five reasons:

Today, we use no such mechanism, though this is largely due to failing in the
past to establish one[1]. The status quo "works" by maintaining a
destroyStatus inside of Replica.mu, which is checked in a few places such as
before proposing a command or serving a read. Since these checks are only
"point in time", and nothing prevents the write to the status from occurring
just a moment after the check, there is a high cognitive overhead to
reasoning about the possible outcomes. In fact, in the case in which things
could go bad most spectacularly, namely removing a replica including data, we
hold essentially all of the locks available to us and rely on the
readOnlyCmdMu (which we would rather get rid off). This then is the first
reason for proposing this change: make the replica lifetime easier to reason
about and establish confidence that the Replica can't just disappear out from
under us.

The second motivator are ReplicaID transitions, which can be extremely
complicated. For example, a Replica may

1. start off as an uninitialized Replica with ReplicaID 12 (i.e. no data)
2. receive a preemptive snapshot, which confusingly results in an initialized
   Replica with ReplicaID 12 (though preemptive snapshots nominally should
   result in a preemptive replica -- one with ReplicaID zero).
3. update its ReplicaID to 18 (i.e. being added back to the Raft group).
4. get ReplicaGC'ed because
5. it blocks a preemptive snapshot, which now recreates it.

In my point of view, changing the ReplicaID for a live Replica is a bad idea
and incurs too much complexity. An architecture in which Replica objects have
a single ReplicaID throughout their lifetime is conceptually much simpler,
but it is also much more straightforward to maintain since it does away with
a whole class of concurrency that needs to be tamed in today's code, and which
may have performance repercussions. On the other hand, replicaID changes are
not frequent, and only need to be moderately fast.

The alternative is to instantiate a new incarnation of the Replica whenever
the ReplicaID changes. The difficult part about this is destroying the old
Replica; since Replica provides proper serialization, we mustn't have commands
in-flight in two instances for the same data (and generally we want to avoid
even having to think about concurrent use of old incarnations). This is
explored here. The above history would read something like this:

1. start off as an uninitialized Replica R with Replica ID 12 (no data)
2. preemptive snapshot is received: tear down R, instantiate R'
3. ReplicaID is updated: tear down R', instantiate R''
4. R'' is marked for ReplicaGC: replace with placeholder R''' in Store, tear
   down R'', wait for references to drain, remove the data, remove R'''.
5. instantiate R''' (no change from before).

A third upshot is almost visible in the above description. Once we can re-
instantiate cleanly on ReplicaID-based state changes, we might as well go
ahead and pull apart various types of Replica:

- preemptive snapshots (though we may replace those by learner replicas in
  the future[2])
- uninitialized Replicas (has replicaID, but no data)
- initialized Replicas (has replicaID and data)
- placeholders (have no replicaID and no data)

To simplify replicaGC and to remove the preemptive snapshot state even before
we stop using preemptive snapshots, we may allow placeholders to hold data
(but with no means to run client requests against it).

Once there, reducing the memory footprint for "idle" replicas by only having
a "shim" in memory (replacing the "full" version) until the next request
comes in becomes a natural proposition by introducing a new replica state
that upon access gets promoted to a full replica (which loads the full
in-memory state from disk), pickup up past attempts at this which failed due
to the technical debt present at the moment[3].

[1]: cockroachdb#8630
[2]: cockroachdb#34058
[3]: cockroachdb#31663

Release note: None
@bdarnell
Copy link
Contributor

These catch up on the log and can receive snapshots, but they don't vote, campaign, or count in the commit quorum.

Do they count towards the quota pool? If they do, then these non-voting learner replicas could still impact write performance. If not, there are various sorts of pathological loops that could result in snapshots being sent repeatedly.

I think this is generally a good idea (if learner replicas had been implemented in etcd/raft when we were starting out, we'd have used them instead of implementing preemptive snapshots), but this is going to be a lot of work to tune and debug.

@tbg
Copy link
Member Author

tbg commented Jan 17, 2019

It's definitely a larger change, especially since it matters in so many places (there'll also be plenty of UI work).

The quota pool should ignore the learners. We already don't truncate the log while a snapshot is inflight, so the only effect of that is that they'll be allowed to fall behind arbitrarily, so the Raft log can grow without bounds until the learner either gets upgraded or removed.

@bdarnell
Copy link
Contributor

We already don't truncate the log while a snapshot is inflight, so the only effect of that is that they'll be allowed to fall behind arbitrarily, so the Raft log can grow without bounds until the learner either gets upgraded or removed.

As long as they're temporary. My concern about the quota pool is if we start to have permanent non-voting replicas for follower reads (as suggested).

@tbg
Copy link
Member Author

tbg commented Jan 17, 2019

Ah, yes. This is a bigger problem. It seems to make sense to backpressure in that case, though perhaps less aggressively than for quorum members. Anyway, that won't be happening anytime soon, I also don't know if there's a strong enough incentive after CDC.

@ajwerner
Copy link
Contributor

ajwerner commented Jan 2, 2020

I'm closing this. Learner replicas shipped in 19.2.

@ajwerner ajwerner closed this as completed Jan 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

5 participants