Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] avoid accidental GC of preemptive snapshots #32139

Closed
wants to merge 1 commit into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Nov 3, 2018

This probably isn't going to be close to the code that will eventually
get checked in, but I wanted to get the conversation started.

I don't have concrete evidence that this problem is a root cause of
#32046, however I want to address it (at the very least for #31875).

I have to dig in more, but what I'm seeing in various import/tpch flavor
tests is that the split-scatter phase is extremely aggressive in
splitting and then downreplicating overreplicated ranges.

For example, r100 might have descriptor [n1,n2,n3,n4] and will rapidly
be split (and its LHS and RHS split again, multiple times) while, say,
n4 is removed. I think that in this kind of situation, needing one
Raft snapshot quickly implies needing ~O(splits) Raft snapshots.
This is because splitting a range on which one replica requires a
Raft snapshot you end up with two ranges that do.

The implication is that we don't want to need Raft snapshots (and
perhaps also: we want to go easy on splitting ranges for which one
replica already needs a snapshot).

On a recent "successful" run of tpccbench/nodes=11/cpus=32, a spike in
pending snapshots from zero to 5k (resolved within minutes) was
observed. A run of import/tpch/nodes=8 typically shows a rapid increase
from zero to ~1k which only dissipates after the import returns.
This variation may be random, or it may indicate that the import test is
a lot more aggressive for some reason.

I have to look into the details, but the following script results in
a number of Raft snapshots (dozens). This may already be fixed by other
PRs such as #31875, though. Easy to verify.


An upreplication begins by sending a preemptive snapshot, followed by
a transaction which "officially" adds the new member to the the Raft group.

This leaves a (typically small) window during which the replicaGC queue
could pick up the preemptive snapshot and delete it. This is unfortunate
as it leaves the range in a fragile state, with one follower requiring a
Raft snapshot to catch up.

This commit introduces a heuristic that holds off on GC'ing replicas
that look like preemptive snapshots until they've been around for a
while.

Release note: None

This probably isn't going to be close to the code that will eventually
get checked in, but I wanted to get the conversation started.

I don't have concrete evidence that this problem is a root cause of
\cockroachdb#32046, however I want to address it (at the very least for cockroachdb#31875).

I have to dig in more, but what I'm seeing in various import/tpch flavor
tests is that the split-scatter phase is extremely aggressive in
splitting and then downreplicating overreplicated ranges.

For example, r100 might have descriptor [n1,n2,n3,n4] and will rapidly
be split (and its LHS and RHS split again, multiple times) while, say,
n4 is removed. I think that in this kind of situation, needing one
Raft snapshot quickly implies needing ~O(splits) Raft snapshots.
This is because splitting a range on which one replica requires a
Raft snapshot you end up with two ranges that do.

The implication is that we don't want to need Raft snapshots (and
perhaps also: we want to go easy on splitting ranges for which one
replica already needs a snapshot).

On a recent "successful" run of tpccbench/nodes=11/cpus=32, a spike in
pending snapshots from zero to 5k (resolved within minutes) was
observed. A run of import/tpch/nodes=8 typically shows a rapid increase
from zero to ~1k which only dissipates after the import returns.
This variation may be random, or it may indicate that the import test is
a lot more aggressive for some reason.

I have to look into the details, but the following script results in
a number of Raft snapshots (dozens). This may already be fixed by other
PRs such as cockroachdb#31875, though. Easy to verify.

----

An upreplication begins by sending a preemptive snapshot, followed by
a transaction which "officially" adds the new member to the the Raft group.

This leaves a (typically small) window during which the replicaGC queue
could pick up the preemptive snapshot and delete it. This is unfortunate
as it leaves the range in a fragile state, with one follower requiring a
Raft snapshot to catch up.

This commit introduces a heuristic that holds off on GC'ing replicas
that look like preemptive snapshots until they've been around for a
while.

Release note: None
@tbg tbg requested a review from a team November 3, 2018 00:19
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica_gc_queue.go, line 240 at r1 (raw file):

			// in a fragile state until it has recovered via a Raft snapshot, so
			// we want to hold off on the GC for a while until it's likely that
			// the replication change has failed.

Could we add an RPC to check with the leaseholder? The leaseholder knows when a replica is being added to the range, but hasn't yet been added to the range descriptor.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica_gc_queue.go, line 240 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Could we add an RPC to check with the leaseholder? The leaseholder knows when a replica is being added to the range, but hasn't yet been added to the range descriptor.

I think I like the tick-based heuristic better than the complexity of a new RPC.

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019
@tbg tbg closed this Dec 2, 2019
@tbg
Copy link
Member Author

tbg commented Dec 2, 2019

We don't have preemptive snapshots any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants