Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: avoid errant Raft snapshots during splits #31875

Merged
merged 2 commits into from
Nov 14, 2018

Conversation

tbg
Copy link
Member

@tbg tbg commented Oct 25, 2018

A known race occurs during splits when some nodes apply the split
trigger faster than others. The "slow" node(s) may learn about the
newly created right hand side replica through Raft messages arriving
from the "fast" nodes. In such cases, the leader will immediately try to
catch the follower (which it sees at log position zero) up via a
snapshot, but this isn't possible since there's an overlapping replica
(the pre-split replica waiting to apply the trigger). I think we
would find that out early in the life of the snapshot, i.e. we wouldn't
transfer large amounts of data around. Nevertheless, this smells bad.

The following script reproduces this by running 1000 splits back
to back in a three node local cluster. In an ideal world, we'd
see zero Raft snapshots here, but we see north of a hundred typically.

Not all of those are due to the race. ReplicaGC and Raft log truncation
can also contribute to this issue (disable both of those queues for pure
results).

We shouldn't ask for a snapshot when we suspect the target replica is
going to catch up on a split trigger. This commit doesn't really fix
that, but it suggests that one cheap way out is to simply report the
requested snapshot as completed immediately. Raft will go back to
probing the commit index, and this will continue until the split
trigger has won the race (of course this trick can only ever help
in situations in which we know in our heart of hearts that there's
a commit trigger that will render the snapshot useless).


set -euxo pipefail

killall -9 cockroach || true
killall -9 workload || true
sleep 1
rm -rf cockroach-data || true
mkdir -p cockroach-data

./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background
./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background
./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background

sleep 5

./cockroach sql --insecure -e 'alter range default configure zone using range_max_bytes = 1048471142400;'
./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;'
./bin/workload run kv --splits 1000 --init --drop --max-ops 1

for port in 26257 26259 26261; do
  ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}"
done

sleep 5
tail -Fq cockroach-data/*/logs/cockroach.log

Touches #32046
Touches #31409

Release note: None

@tbg tbg requested a review from a team October 25, 2018 15:00
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member Author

tbg commented Oct 26, 2018

This happens pretty regularly in roachtest restores too. We'll see lots of snapshots at applied index 11, often blocked by the unsplit RHS on the receiving node, which then gets added to the replica GC queue. It seems mostly harmless, but we should alleviate the problem.

@tbg tbg force-pushed the repro/splittrigger-snapshot-race branch from f49d644 to 6489745 Compare October 27, 2018 19:00
@tbg
Copy link
Member Author

tbg commented Oct 27, 2018

I added a second commit that introduces heuristics that almost removes all of the snapshots and isn't obviously going to prevent real snapshots from being sent.

In essence, I'm trying to figure out whether the range was recently split, and whether the follower is likely to get a "free snapshot" via the split trigger very soon.

This works almost 100% of the time, except in the case in which the LHS bumps its generation rapidly (due to being split a second time). This is entering the area of diminishing returns, but I do want to get rid of these snapshots reliably when they're not essential.

This raises the question: if not via the generation counter, what's another indicator that the range has "very recently" split? Sure, I could go and read the history of the local range descriptor and see if the first (generation-zero) value is "recent". But maybe there's something less invasive I can work with.

@benesch, I suspect that if there is something nice then you'd know about it.

@tbg
Copy link
Member Author

tbg commented Oct 27, 2018

Filed #31947 about some general concerns about the Raft log queue.

@tbg tbg changed the title [dnm] storage: demonstrate workaround for splittrigger-snapshot race storage: avoid errant Raft snapshots during splits Oct 27, 2018
@tbg
Copy link
Member Author

tbg commented Oct 27, 2018

This is still WIP, but I think worth taking a look at.

@tbg
Copy link
Member Author

tbg commented Oct 27, 2018

(in particular I haven't vetted whether it's OK to lie to Raft and just tell it it got a snapshot, though it seems to work just fine).

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5119 at r2 (raw file):

		recentlyCreatedViaSplit := desc.IsInitialized() && (desc.Generation == nil || *desc.Generation == 0) &&
			raftStatus != nil && raftStatus.Progress[msg.To].Match == 0 &&
			ticks <= 15

Newly create Raft groups have a small Raft log index, right? Can that be used here instead of looking at Generation?


pkg/storage/replica.go, line 5123 at r2 (raw file):

		if recentlyCreatedViaSplit {
			r.withRaftGroup(false, func(rn *raft.RawNode) (bool, error) {
				rn.ReportSnapshot(msg.To, raft.SnapshotFinish)

SnapshotFinish indicates that the snapshot succeeded. That doesn't seem right.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5123 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

SnapshotFinish indicates that the snapshot succeeded. That doesn't seem right.

That's the part that we have to make "correct" (it might be correct today, but certainly doesn't look the part). If we say the snapshot failed, (I think - but the experiment suggests it) Raft is going to ask for another snapshot until it gets one. But here we have this situation where it receives a "pseudosnapshot" via the split trigger. So we lie and say "hey you got a snapshot" and apparently Raft uses this information in a way that is compatible with what we're trying to achieve (i.e. it's like hey, I seem to have gotten this follower back on track, let's try to append to him again but not using what I think would've been in the snapshot I asked for, but just using what I knew about this follower before).

@benesch
Copy link
Contributor

benesch commented Oct 29, 2018

Nothing nice that I know about. We could write a range-ID local key with the time at which the range was created, but I don’t think that would be any much more precise than what you’ve got or what Peter suggested.

Here’s an alternate proposal, though, just for kicks. Whenever you split a range, you record the LHS’s range ID and applied index on the RHS. Then, whenever you send a snapshot, as part of the handshake, you send along the recorded range ID and applied index. The receiving store then checks to see if it has a replica of that range at an earlier applied index. If it does, it blocks the snapshot for a bit (before we actually do any of the hard work of generating the snap) in the hopes that it’s replica of the LHS will apply the split trigger. If the LHS replica doesnt make progress in a few hundred ms, then we queue it for GC; only if it gets GC’d or advances past the split trigger with a snapshot of its own do we allow the RHS snapshot through.

There’s some additional complexity here to avoid stalling out all four goroutines in the Raft snapshot queue, but I think it could be made to work.

@tbg
Copy link
Member Author

tbg commented Oct 29, 2018

I thought about something like that but it seems overly complex. Also, the thing I'm really after is to never have to do any work for this case in the raft snapshot queue. We should see zero Raft snapshots in all situations which don't actually require recovery.

I wonder if we can attack the problem from the other side. The snapshot gets requested by the leader after the follower rejects its MsgApp. Could we selectively drop these rejection messages for a little bit to let the split trigger "always win"? I'll find out.

@tbg tbg force-pushed the repro/splittrigger-snapshot-race branch from cd3a65d to ff5e60d Compare October 29, 2018 14:20
@tbg
Copy link
Member Author

tbg commented Oct 29, 2018

Added a commit. It works 100% in my limited test. What I'm less sure about is that I got all of the conditions right (how do I tell whether a replica has a preemptive snapshot but isn't active yet? I think it's that it's not in its own RangeDescriptor), but I think this does look promising.

Copy link
Contributor

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, 1 of 1 files at r4, 1 of 1 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5115 at r5 (raw file):

		r.mu.RUnlock()

		if !desc.IsInitialized() && len(desc.Replicas) == 0 && msg.RejectHint == 0 && ticks < 10 {

Ah, this feels much cleaner.

I think, though I'm not sure, that the check for !desc.IsInitialized() is redundant with the check for len(desc.Replicas) == 0. I think replicas with only a preemptive snapshot will return true for desc.IsInitialized().


pkg/storage/replica.go, line 5145 at r5 (raw file):

			// TODO(tschottdorf): preemptive snapshots may also be
			// garbage-collected before the replica addition is complete. In
			// such a case, we need to accept the Raft snapshot ASAP.

Can we just sidestep this race by refusing to GC replicas with preemptive snapshots for some time period after the snapshot was applied? Say, 60s?


pkg/storage/replica.go, line 5146 at r5 (raw file):

			// garbage-collected before the replica addition is complete. In
			// such a case, we need to accept the Raft snapshot ASAP.
			log.Warningf(ctx, "dropping rejection from %d to %d", msg.Index, msg.RejectHint)

Not sure how WIP this line is, but a warning seems overly strong.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5119 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Newly create Raft groups have a small Raft log index, right? Can that be used here instead of looking at Generation?

Obsolete, but here's my old comment:

I had added that at some point, but the log index can race ahead pretty quickly. It wasn't clear to me where to cut it off.


pkg/storage/replica.go, line 5123 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

That's the part that we have to make "correct" (it might be correct today, but certainly doesn't look the part). If we say the snapshot failed, (I think - but the experiment suggests it) Raft is going to ask for another snapshot until it gets one. But here we have this situation where it receives a "pseudosnapshot" via the split trigger. So we lie and say "hey you got a snapshot" and apparently Raft uses this information in a way that is compatible with what we're trying to achieve (i.e. it's like hey, I seem to have gotten this follower back on track, let's try to append to him again but not using what I think would've been in the snapshot I asked for, but just using what I knew about this follower before).

Thankfully this is now obsolete.


pkg/storage/replica.go, line 5115 at r5 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Ah, this feels much cleaner.

I think, though I'm not sure, that the check for !desc.IsInitialized() is redundant with the check for len(desc.Replicas) == 0. I think replicas with only a preemptive snapshot will return true for desc.IsInitialized().

IsInitialized just checks for nonempty EndKey, but after a preemptive snapshot, don't you have "all of the data" except you're not a member of the RangeDescriptor?


pkg/storage/replica.go, line 5145 at r5 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Can we just sidestep this race by refusing to GC replicas with preemptive snapshots for some time period after the snapshot was applied? Say, 60s?

Yes, I think so (haven't thought about the precise mitigation yet, maybe we can do something smarter, who knows). Just didn't wanna jam this into this PR. But I should land it before this one.


pkg/storage/replica.go, line 5146 at r5 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Not sure how WIP this line is, but a warning seems overly strong.

Definitely WIP. Thanks for pointing it out.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5115 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

IsInitialized just checks for nonempty EndKey, but after a preemptive snapshot, don't you have "all of the data" except you're not a member of the RangeDescriptor?

Ditto that this is cleaner than trying to reject the snapshot.

I imagine that you'll be adding a test for this behavior. Writing that test isn't going to be fun, but your future self will thank you.

Copy link
Contributor

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5115 at r5 (raw file):

IsInitialized just checks for nonempty EndKey, but after a preemptive snapshot, don't you have "all of the data" except you're not a member of the RangeDescriptor?

Right. So if you have a preemptive snapshot, then IsInitialized is true, and so we correctly won't take this branch because we know the replica was not created as the result of a split trigger.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5115 at r5 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

IsInitialized just checks for nonempty EndKey, but after a preemptive snapshot, don't you have "all of the data" except you're not a member of the RangeDescriptor?

Right. So if you have a preemptive snapshot, then IsInitialized is true, and so we correctly won't take this branch because we know the replica was not created as the result of a split trigger.

The checks here seem redundant to me too. They're all equivalent: an uninitialized replica will have empty desc.Replicas and a zero log index, while an initialized one is guaranteed to have non-empty desc.Replicas and non-zero log index. Just one of the first three checks should be sufficient (in combination with ticks < 10).

If we're sending a MsgAppResp, the leader sent us a MsgApp, so the leader believes we're a member of the group. That means we're either in the "split, but not yet triggered" case or the "received preemptive snapshot and immediately GC'd it" case. I don't see a good way to distinguish the two cases at this stage, although a guard against GC'ing a brand new replica would go a long way to minimizing the risk (and mu.ticks gives us a good way to handle this).

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add a test, but the non-test code is ready for review (I'm already using it in my roachtesting).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5115 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The checks here seem redundant to me too. They're all equivalent: an uninitialized replica will have empty desc.Replicas and a zero log index, while an initialized one is guaranteed to have non-empty desc.Replicas and non-zero log index. Just one of the first three checks should be sufficient (in combination with ticks < 10).

If we're sending a MsgAppResp, the leader sent us a MsgApp, so the leader believes we're a member of the group. That means we're either in the "split, but not yet triggered" case or the "received preemptive snapshot and immediately GC'd it" case. I don't see a good way to distinguish the two cases at this stage, although a guard against GC'ing a brand new replica would go a long way to minimizing the risk (and mu.ticks gives us a good way to handle this).

Ah, yeah, that was pretty confused. Removed.


pkg/storage/replica.go, line 5145 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Yes, I think so (haven't thought about the precise mitigation yet, maybe we can do something smarter, who knows). Just didn't wanna jam this into this PR. But I should land it before this one.

Heads up that I have a WIP for this that I'm testing with. Will send that separately and make sure it lands before this.


pkg/storage/replica.go, line 5146 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Definitely WIP. Thanks for pointing it out.

Done.

@tbg
Copy link
Member Author

tbg commented Nov 14, 2018

I'm also noticing that the spikes in snapshots are per node (i.e. the cluster view is a superposition of per-node spikes). This lends further credence to this being caused by splitting ranges that already need a snapshot.

@tbg tbg force-pushed the repro/splittrigger-snapshot-race branch from 6b46b9d to bb0c84f Compare November 14, 2018 13:13
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 361 at r7 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Can we rename this to pendingSnapshotIndexes or similar?

Done.


pkg/storage/replica.go, line 361 at r7 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Keying this by pending index is kinda weird. Why not map snapshot ID to {deadline, index}? Then you won't need to comment about why it's mostly OK that you might have two snapshots that clobber each other.

You're right, this was silly all along. Done.


pkg/storage/replica.go, line 6971 at r7 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

nit: I find it somewhat weird that you read from the map before initializing it. (I know reading from nil maps is safe even though writing to them is not—thanks, Go—but I still find it weird.) Don't bother changing it unless you also find it weird now that I've pointed it out.

I did that intentionally, but there wasn't a real good reason. Done.


pkg/storage/replica.go, line 6976 at r7 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Consider renaming to "markPendingSnapshotCompleted."

Done.


pkg/storage/replica.go, line 6980 at r7 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Perhaps explain why 3s.

Done.


pkg/storage/replica.go, line 5178 at r8 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

s/desc.Replicas would be nonempty/initialized would be true/

Done.


pkg/storage/replica_raftstorage.go, line 402 at r7 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

As mentioned above, I think keying pending snapshots by snap UUID would be cleaner. Unless I'm missing something, it should be straightforward to hoist the UUID generation out of snapshot().

Done.

@tbg tbg force-pushed the repro/splittrigger-snapshot-race branch from bb0c84f to b407a37 Compare November 14, 2018 17:13
@tbg
Copy link
Member Author

tbg commented Nov 14, 2018

An additional four instances of import/tpch/nodes=4 just finished successfully. 🎆

--- PASS: import/tpch/nodes=4 (17403.51s)
--- PASS: import/tpch/nodes=4 (18198.41s)
--- PASS: import/tpch/nodes=4 (19152.02s)
--- PASS: import/tpch/nodes=4 (18090.76s)

Copy link
Contributor

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 6 of 6 files at r9, 3 of 3 files at r10.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

When a (preemptive or Raft) snapshot is inflight, a range should not
truncate its log index to one ahead of the pending snapshot as doing so
would require yet another snapshot.

The code prior to this commit attempted to achieve that by treating the
pending snapshot as an additional follower in the computation of the
quorum index. This works as intended as long as the Raft log is below
its configured maximum size but usually leads to abandoning the snapshot
index when truncating based on size (to the quorum commit index). This
situation occurs frequently during the split+scatter phase of data
imports, where (mostly empty) ranges are rapidly upreplicated and split.

This isn't limited to small replicas, however. Assume a range is ~32mb
in size and a (say, preemptive) snapshot is underway. Preemptive
snapshots are (at the time of writing) throttled to 2mb/s, so it will
take approximately 16s to go through. At the same time, the Raft log may
easily grow in size by the size truncation threshold (4mb at time of
writing), allowing a log truncation that abandons the snapshot. A
similar phenomenon applies to Raft snapshots, though now the quota pool
will place restrictions on forward Raft progress, however not if a
single command exceeds the size restriction (as is common during RESTORE
operations where individual Raft commands are large). I haven't
conclusively observed this in practice (though there has been enough
badness to suspect that it happened), but in principle this could lead
to a potentially infinite number of snapshots being sent out, a very
expensive way of keeping a follower up to date.

After this change, the pending snapshot index is managed more carefully
(and is managed for Raft snapshots as well, not only for preemptive
ones) and is never truncated away. As a result, in regular operation
snapshots should now be able to be followed by regular Raft log catch-up
in all cases. More precisely, we keep a map of pending snapshot index to
deadline. The deadline is zero while the snapshot is ongoing and is set
when the snapshot is completed. This avoids races between the snapshot
completing and the replication change completing (which would occur
even if the snapshot is only registering as completed after the
replication change completes).

There's an opportunity for a bigger refactor here by using learner
replicas instead of preemptive snapshots. The idea is to add the replica
as a learner first (ignoring it for the quota pool until it has received
the first snapshot) first, and to use the regular Raft snapshot
mechanism to catch it up. Once achieved, another configuration change
would convert the learner into a regular follower. This change in itself
will likely make our code more principled, but it is a more invasive
change that is left for the future.

Similarly, there is knowledge in the quota pool that it seems we should
be using for log truncation decisions. The quota pool essentially knows
about the size of each log entry whereas the Raft truncations only know
the accumulated approximate size of the full log. For instance, instead
of blindly truncating to the quorum commit index when the Raft log is
too large, we could truncate it to an index that reduces the log size to
about 50% of the maximum, in effect reducing the number of snapshots
that are necessary due to quorum truncation. It's unclear whether this
improvement will matter in practice.

The following script reproduces redundant Raft snapshots (needs a
somewhat beefy machine such as a gceworker). Note that this script will
also incur Raft snapshots due to the splits, which are fixed in a
follow-up commit.

```
set -euxo pipefail

killall -9 cockroach || true
killall -9 workload || true
sleep 1
rm -rf cockroach-data || true
mkdir -p cockroach-data

./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background
./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background
./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background
./cockroach start --insecure --host=localhost --port=26263 --http-port=26264 --store=cockroach-data/4 --cache=256MiB --join=localhost:26257 --background

sleep 5

./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;'

./cockroach sql --insecure <<EOF
CREATE DATABASE csv;
IMPORT TABLE csv.lineitem CREATE USING 'gs://cockroach-fixtures/tpch-csv/schema/lineitem.sql'
CSV DATA (
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.1',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.2',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.3',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.4',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.5',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.6',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.7',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.8'
) WITH delimiter='|' ;
EOF

sleep 5

for port in 26257 26259 26261 26263; do
 ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}"
+done
```

Release note: None
@tbg tbg force-pushed the repro/splittrigger-snapshot-race branch 2 times, most recently from 6300ee3 to 0518ff6 Compare November 14, 2018 19:21
A known race occurs during splits when some nodes apply the split
trigger faster than others. The "slow" node(s) may learn about the
newly created right hand side replica through Raft messages arriving
from the "fast" nodes. In such cases, the leader will immediately try to
catch up the follower (which it sees at log position zero) via a
snapshot, but this isn't possible since there's an overlapping replica
(the pre-split replica waiting to apply the trigger). This both leads to
unnecessary transfer of data and can clog the Raft snapshot queue which
tends to get stuck due to the throttling mechanisms both at the sender
and receivers.

To prevent this race (or make it exceedingly unlikely), we selectively
drop certain messages from uninitialized followers, namely those that
refuse an append to the log, for a number of ticks (corresponding to
at most a few seconds of real time). Not dropping such a message leads
to a Raft snapshot as the leader will learn that the follower has last
index zero, which is never an index that can be caught up to from the
log (our log "starts" at index 10).

The script below reproduces the race (prior to this commit) by running
1000 splits back to back in a three node local cluster, usually showing
north of a hundred Raft snapshots, i.e. a >10% chance to hit the race
for each split. There's also a unit test that exposes this problem and
can be stressed more conveniently (it also exposes the problems in the
preceding commit related to overly aggressive log truncation).

The false positives here are a) the LHS of the split needs a snapshot
which catches it up across the split trigger and b) the LHS is
rebalanced away (and GC'ed) before applying the split trigger. In both
cases the timeout-based mechanism would allow the snapshot after a few
seconds, once the Raft leader contacts the follower for the next time.

Note that the interaction with Raft group quiescence is benign. We're
only dropping MsgAppResp which is only sent by followers, implying that
the Raft group is already unquiesced.

```
set -euxo pipefail

killall -9 cockroach || true
killall -9 workload || true
sleep 1
rm -rf cockroach-data || true
mkdir -p cockroach-data

./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background
./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background
./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background

sleep 5

./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;'
./bin/workload run kv --splits 1000 --init --drop --max-ops 1

sleep 5

for port in 26257 26259 26261; do
  ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}"
done
```

Release note (bug fix): Avoid occasional unnecessary Raft snapshots after Range splits.
@tbg tbg force-pushed the repro/splittrigger-snapshot-race branch from 0518ff6 to 3fd4bd5 Compare November 14, 2018 19:40
@ajwerner
Copy link
Contributor

bors r=benesch

craig bot pushed a commit that referenced this pull request Nov 14, 2018
31875: storage: avoid errant Raft snapshots during splits r=benesch a=tbg

A known race occurs during splits when some nodes apply the split
trigger faster than others. The "slow" node(s) may learn about the
newly created right hand side replica through Raft messages arriving
from the "fast" nodes. In such cases, the leader will immediately try to
catch the follower (which it sees at log position zero) up via a
snapshot, but this isn't possible since there's an overlapping replica
(the pre-split replica waiting to apply the trigger). I *think* we
would find that out early in the life of the snapshot, i.e. we wouldn't
transfer large amounts of data around. Nevertheless, this smells bad.

The following script reproduces this by running 1000 splits back
to back in a three node local cluster. In an ideal world, we'd
see zero Raft snapshots here, but we see north of a hundred typically.

Not all of those are due to the race. ReplicaGC and Raft log truncation
can also contribute to this issue (disable both of those queues for pure
results).

We shouldn't ask for a snapshot when we suspect the target replica is
going to catch up on a split trigger. This commit doesn't really fix
that, but it suggests that one cheap way out is to simply report the
requested snapshot as completed immediately. Raft will go back to
probing the commit index, and this will continue until the split
trigger has won the race (of course this trick can only ever help
in situations in which we know in our heart of hearts that there's
a commit trigger that will render the snapshot useless).

```

set -euxo pipefail

killall -9 cockroach || true
killall -9 workload || true
sleep 1
rm -rf cockroach-data || true
mkdir -p cockroach-data

./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background
./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background
./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background

sleep 5

./cockroach sql --insecure -e 'alter range default configure zone using range_max_bytes = 1048471142400;'
./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;'
./bin/workload run kv --splits 1000 --init --drop --max-ops 1

for port in 26257 26259 26261; do
  ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}"
done

sleep 5
tail -Fq cockroach-data/*/logs/cockroach.log
```

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig
Copy link
Contributor

craig bot commented Nov 14, 2018

Build succeeded

@craig craig bot merged commit 3fd4bd5 into cockroachdb:master Nov 14, 2018
@tbg tbg deleted the repro/splittrigger-snapshot-race branch November 15, 2018 08:10
tbg added a commit to tbg/cockroach that referenced this pull request Nov 15, 2018
With cockroachdb#31875 in, import/tpch/nodes=4 has been passing reliably, and the
expectation is that the nodes=8,32 versions will follow suit (at least
they passed just now).

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Nov 20, 2018
With cockroachdb#31875 in, import/tpch/nodes=4 has been passing reliably, and the
expectation is that the nodes=8,32 versions will follow suit (at least
they passed just now).

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Nov 30, 2018
With cockroachdb#31875 in, import/tpch/nodes=4 has been passing reliably, and the
expectation is that the nodes=8,32 versions will follow suit (at least
they passed just now).

Release note: None
craig bot pushed a commit that referenced this pull request Nov 30, 2018
32749: roachtest: reenable skipped import tests r=petermattis a=tbg

With #31875 in, import/tpch/nodes=4 has been passing reliably, and the
expectation is that the nodes=8,32 versions will follow suit (at least
they passed just now).

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants