-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] avoid accidental GC of preemptive snapshots #32139
Conversation
This probably isn't going to be close to the code that will eventually get checked in, but I wanted to get the conversation started. I don't have concrete evidence that this problem is a root cause of \cockroachdb#32046, however I want to address it (at the very least for cockroachdb#31875). I have to dig in more, but what I'm seeing in various import/tpch flavor tests is that the split-scatter phase is extremely aggressive in splitting and then downreplicating overreplicated ranges. For example, r100 might have descriptor [n1,n2,n3,n4] and will rapidly be split (and its LHS and RHS split again, multiple times) while, say, n4 is removed. I think that in this kind of situation, needing one Raft snapshot quickly implies needing ~O(splits) Raft snapshots. This is because splitting a range on which one replica requires a Raft snapshot you end up with two ranges that do. The implication is that we don't want to need Raft snapshots (and perhaps also: we want to go easy on splitting ranges for which one replica already needs a snapshot). On a recent "successful" run of tpccbench/nodes=11/cpus=32, a spike in pending snapshots from zero to 5k (resolved within minutes) was observed. A run of import/tpch/nodes=8 typically shows a rapid increase from zero to ~1k which only dissipates after the import returns. This variation may be random, or it may indicate that the import test is a lot more aggressive for some reason. I have to look into the details, but the following script results in a number of Raft snapshots (dozens). This may already be fixed by other PRs such as cockroachdb#31875, though. Easy to verify. ---- An upreplication begins by sending a preemptive snapshot, followed by a transaction which "officially" adds the new member to the the Raft group. This leaves a (typically small) window during which the replicaGC queue could pick up the preemptive snapshot and delete it. This is unfortunate as it leaves the range in a fragile state, with one follower requiring a Raft snapshot to catch up. This commit introduces a heuristic that holds off on GC'ing replicas that look like preemptive snapshots until they've been around for a while. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/storage/replica_gc_queue.go, line 240 at r1 (raw file):
// in a fragile state until it has recovered via a Raft snapshot, so // we want to hold off on the GC for a while until it's likely that // the replication change has failed.
Could we add an RPC to check with the leaseholder? The leaseholder knows when a replica is being added to the range, but hasn't yet been added to the range descriptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/storage/replica_gc_queue.go, line 240 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Could we add an RPC to check with the leaseholder? The leaseholder knows when a replica is being added to the range, but hasn't yet been added to the range descriptor.
I think I like the tick-based heuristic better than the complexity of a new RPC.
We don't have preemptive snapshots any more. |
This probably isn't going to be close to the code that will eventually
get checked in, but I wanted to get the conversation started.
I don't have concrete evidence that this problem is a root cause of
#32046, however I want to address it (at the very least for #31875).
I have to dig in more, but what I'm seeing in various import/tpch flavor
tests is that the split-scatter phase is extremely aggressive in
splitting and then downreplicating overreplicated ranges.
For example, r100 might have descriptor [n1,n2,n3,n4] and will rapidly
be split (and its LHS and RHS split again, multiple times) while, say,
n4 is removed. I think that in this kind of situation, needing one
Raft snapshot quickly implies needing ~O(splits) Raft snapshots.
This is because splitting a range on which one replica requires a
Raft snapshot you end up with two ranges that do.
The implication is that we don't want to need Raft snapshots (and
perhaps also: we want to go easy on splitting ranges for which one
replica already needs a snapshot).
On a recent "successful" run of tpccbench/nodes=11/cpus=32, a spike in
pending snapshots from zero to 5k (resolved within minutes) was
observed. A run of import/tpch/nodes=8 typically shows a rapid increase
from zero to ~1k which only dissipates after the import returns.
This variation may be random, or it may indicate that the import test is
a lot more aggressive for some reason.
I have to look into the details, but the following script results in
a number of Raft snapshots (dozens). This may already be fixed by other
PRs such as #31875, though. Easy to verify.
An upreplication begins by sending a preemptive snapshot, followed by
a transaction which "officially" adds the new member to the the Raft group.
This leaves a (typically small) window during which the replicaGC queue
could pick up the preemptive snapshot and delete it. This is unfortunate
as it leaves the range in a fragile state, with one follower requiring a
Raft snapshot to catch up.
This commit introduces a heuristic that holds off on GC'ing replicas
that look like preemptive snapshots until they've been around for a
while.
Release note: None