[wip] avoid accidental GC of preemptive snapshots #32139

tbg · 2018-11-03T00:19:01Z

This probably isn't going to be close to the code that will eventually
get checked in, but I wanted to get the conversation started.

I don't have concrete evidence that this problem is a root cause of
#32046, however I want to address it (at the very least for #31875).

I have to dig in more, but what I'm seeing in various import/tpch flavor
tests is that the split-scatter phase is extremely aggressive in
splitting and then downreplicating overreplicated ranges.

For example, r100 might have descriptor [n1,n2,n3,n4] and will rapidly
be split (and its LHS and RHS split again, multiple times) while, say,
n4 is removed. I think that in this kind of situation, needing one
Raft snapshot quickly implies needing ~O(splits) Raft snapshots.
This is because splitting a range on which one replica requires a
Raft snapshot you end up with two ranges that do.

The implication is that we don't want to need Raft snapshots (and
perhaps also: we want to go easy on splitting ranges for which one
replica already needs a snapshot).

On a recent "successful" run of tpccbench/nodes=11/cpus=32, a spike in
pending snapshots from zero to 5k (resolved within minutes) was
observed. A run of import/tpch/nodes=8 typically shows a rapid increase
from zero to ~1k which only dissipates after the import returns.
This variation may be random, or it may indicate that the import test is
a lot more aggressive for some reason.

I have to look into the details, but the following script results in
a number of Raft snapshots (dozens). This may already be fixed by other
PRs such as #31875, though. Easy to verify.

An upreplication begins by sending a preemptive snapshot, followed by
a transaction which "officially" adds the new member to the the Raft group.

This leaves a (typically small) window during which the replicaGC queue
could pick up the preemptive snapshot and delete it. This is unfortunate
as it leaves the range in a fragile state, with one follower requiring a
Raft snapshot to catch up.

This commit introduces a heuristic that holds off on GC'ing replicas
that look like preemptive snapshots until they've been around for a
while.

Release note: None

This probably isn't going to be close to the code that will eventually get checked in, but I wanted to get the conversation started. I don't have concrete evidence that this problem is a root cause of \cockroachdb#32046, however I want to address it (at the very least for cockroachdb#31875). I have to dig in more, but what I'm seeing in various import/tpch flavor tests is that the split-scatter phase is extremely aggressive in splitting and then downreplicating overreplicated ranges. For example, r100 might have descriptor [n1,n2,n3,n4] and will rapidly be split (and its LHS and RHS split again, multiple times) while, say, n4 is removed. I think that in this kind of situation, needing one Raft snapshot quickly implies needing ~O(splits) Raft snapshots. This is because splitting a range on which one replica requires a Raft snapshot you end up with two ranges that do. The implication is that we don't want to need Raft snapshots (and perhaps also: we want to go easy on splitting ranges for which one replica already needs a snapshot). On a recent "successful" run of tpccbench/nodes=11/cpus=32, a spike in pending snapshots from zero to 5k (resolved within minutes) was observed. A run of import/tpch/nodes=8 typically shows a rapid increase from zero to ~1k which only dissipates after the import returns. This variation may be random, or it may indicate that the import test is a lot more aggressive for some reason. I have to look into the details, but the following script results in a number of Raft snapshots (dozens). This may already be fixed by other PRs such as cockroachdb#31875, though. Easy to verify. ---- An upreplication begins by sending a preemptive snapshot, followed by a transaction which "officially" adds the new member to the the Raft group. This leaves a (typically small) window during which the replicaGC queue could pick up the preemptive snapshot and delete it. This is unfortunate as it leaves the range in a fragile state, with one follower requiring a Raft snapshot to catch up. This commit introduces a heuristic that holds off on GC'ing replicas that look like preemptive snapshots until they've been around for a while. Release note: None

cockroach-teamcity · 2018-11-03T00:19:07Z

This change is

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica_gc_queue.go, line 240 at r1 (raw file):

			// in a fragile state until it has recovered via a Raft snapshot, so
			// we want to hold off on the GC for a while until it's likely that
			// the replication change has failed.

Could we add an RPC to check with the leaseholder? The leaseholder knows when a replica is being added to the range, but hasn't yet been added to the range descriptor.

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica_gc_queue.go, line 240 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Could we add an RPC to check with the leaseholder? The leaseholder knows when a replica is being added to the range, but hasn't yet been added to the range descriptor.

I think I like the tick-based heuristic better than the complexity of a new RPC.

tbg · 2019-12-02T15:41:57Z

We don't have preemptive snapshots any more.

tbg requested a review from a team November 3, 2018 00:19

petermattis reviewed Nov 8, 2018

View reviewed changes

bdarnell reviewed Nov 8, 2018

View reviewed changes

tbg mentioned this pull request Nov 12, 2018

storage: avoid errant Raft snapshots during splits #31875

Merged

tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019

tbg closed this Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] avoid accidental GC of preemptive snapshots #32139

[wip] avoid accidental GC of preemptive snapshots #32139

tbg commented Nov 3, 2018

cockroach-teamcity commented Nov 3, 2018

petermattis left a comment

bdarnell left a comment

tbg commented Dec 2, 2019

[wip] avoid accidental GC of preemptive snapshots #32139

[wip] avoid accidental GC of preemptive snapshots #32139

Conversation

tbg commented Nov 3, 2018

cockroach-teamcity commented Nov 3, 2018

petermattis left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

tbg commented Dec 2, 2019