storage: avoid errant Raft snapshots during splits #31875

tbg · 2018-10-25T15:00:46Z

A known race occurs during splits when some nodes apply the split
trigger faster than others. The "slow" node(s) may learn about the
newly created right hand side replica through Raft messages arriving
from the "fast" nodes. In such cases, the leader will immediately try to
catch the follower (which it sees at log position zero) up via a
snapshot, but this isn't possible since there's an overlapping replica
(the pre-split replica waiting to apply the trigger). I think we
would find that out early in the life of the snapshot, i.e. we wouldn't
transfer large amounts of data around. Nevertheless, this smells bad.

The following script reproduces this by running 1000 splits back
to back in a three node local cluster. In an ideal world, we'd
see zero Raft snapshots here, but we see north of a hundred typically.

Not all of those are due to the race. ReplicaGC and Raft log truncation
can also contribute to this issue (disable both of those queues for pure
results).

We shouldn't ask for a snapshot when we suspect the target replica is
going to catch up on a split trigger. This commit doesn't really fix
that, but it suggests that one cheap way out is to simply report the
requested snapshot as completed immediately. Raft will go back to
probing the commit index, and this will continue until the split
trigger has won the race (of course this trick can only ever help
in situations in which we know in our heart of hearts that there's
a commit trigger that will render the snapshot useless).


set -euxo pipefail

killall -9 cockroach || true
killall -9 workload || true
sleep 1
rm -rf cockroach-data || true
mkdir -p cockroach-data

./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background
./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background
./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background

sleep 5

./cockroach sql --insecure -e 'alter range default configure zone using range_max_bytes = 1048471142400;'
./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;'
./bin/workload run kv --splits 1000 --init --drop --max-ops 1

for port in 26257 26259 26261; do
  ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}"
done

sleep 5
tail -Fq cockroach-data/*/logs/cockroach.log

Touches #32046
Touches #31409

Release note: None

cockroach-teamcity · 2018-10-25T15:00:54Z

This change is

tbg · 2018-10-26T11:37:50Z

This happens pretty regularly in roachtest restores too. We'll see lots of snapshots at applied index 11, often blocked by the unsplit RHS on the receiving node, which then gets added to the replica GC queue. It seems mostly harmless, but we should alleviate the problem.

tbg · 2018-10-27T19:42:09Z

I added a second commit that introduces heuristics that almost removes all of the snapshots and isn't obviously going to prevent real snapshots from being sent.

In essence, I'm trying to figure out whether the range was recently split, and whether the follower is likely to get a "free snapshot" via the split trigger very soon.

This works almost 100% of the time, except in the case in which the LHS bumps its generation rapidly (due to being split a second time). This is entering the area of diminishing returns, but I do want to get rid of these snapshots reliably when they're not essential.

This raises the question: if not via the generation counter, what's another indicator that the range has "very recently" split? Sure, I could go and read the history of the local range descriptor and see if the first (generation-zero) value is "recent". But maybe there's something less invasive I can work with.

@benesch, I suspect that if there is something nice then you'd know about it.

tbg · 2018-10-27T20:12:40Z

Filed #31947 about some general concerns about the Raft log queue.

tbg · 2018-10-27T20:41:38Z

This is still WIP, but I think worth taking a look at.

tbg · 2018-10-27T20:42:10Z

(in particular I haven't vetted whether it's OK to lie to Raft and just tell it it got a snapshot, though it seems to work just fine).

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5119 at r2 (raw file):

		recentlyCreatedViaSplit := desc.IsInitialized() && (desc.Generation == nil || *desc.Generation == 0) &&
			raftStatus != nil && raftStatus.Progress[msg.To].Match == 0 &&
			ticks <= 15

Newly create Raft groups have a small Raft log index, right? Can that be used here instead of looking at Generation?

pkg/storage/replica.go, line 5123 at r2 (raw file):

		if recentlyCreatedViaSplit {
			r.withRaftGroup(false, func(rn *raft.RawNode) (bool, error) {
				rn.ReportSnapshot(msg.To, raft.SnapshotFinish)

SnapshotFinish indicates that the snapshot succeeded. That doesn't seem right.

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5123 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

SnapshotFinish indicates that the snapshot succeeded. That doesn't seem right.

That's the part that we have to make "correct" (it might be correct today, but certainly doesn't look the part). If we say the snapshot failed, (I think - but the experiment suggests it) Raft is going to ask for another snapshot until it gets one. But here we have this situation where it receives a "pseudosnapshot" via the split trigger. So we lie and say "hey you got a snapshot" and apparently Raft uses this information in a way that is compatible with what we're trying to achieve (i.e. it's like hey, I seem to have gotten this follower back on track, let's try to append to him again but not using what I think would've been in the snapshot I asked for, but just using what I knew about this follower before).

benesch · 2018-10-29T05:55:56Z

Nothing nice that I know about. We could write a range-ID local key with the time at which the range was created, but I don’t think that would be any much more precise than what you’ve got or what Peter suggested.

Here’s an alternate proposal, though, just for kicks. Whenever you split a range, you record the LHS’s range ID and applied index on the RHS. Then, whenever you send a snapshot, as part of the handshake, you send along the recorded range ID and applied index. The receiving store then checks to see if it has a replica of that range at an earlier applied index. If it does, it blocks the snapshot for a bit (before we actually do any of the hard work of generating the snap) in the hopes that it’s replica of the LHS will apply the split trigger. If the LHS replica doesnt make progress in a few hundred ms, then we queue it for GC; only if it gets GC’d or advances past the split trigger with a snapshot of its own do we allow the RHS snapshot through.

There’s some additional complexity here to avoid stalling out all four goroutines in the Raft snapshot queue, but I think it could be made to work.

tbg · 2018-10-29T13:12:27Z

I thought about something like that but it seems overly complex. Also, the thing I'm really after is to never have to do any work for this case in the raft snapshot queue. We should see zero Raft snapshots in all situations which don't actually require recovery.

I wonder if we can attack the problem from the other side. The snapshot gets requested by the leader after the follower rejects its MsgApp. Could we selectively drop these rejection messages for a little bit to let the split trigger "always win"? I'll find out.

tbg · 2018-10-29T14:21:21Z

Added a commit. It works 100% in my limited test. What I'm less sure about is that I got all of the conditions right (how do I tell whether a replica has a preemptive snapshot but isn't active yet? I think it's that it's not in its own RangeDescriptor), but I think this does look promising.

benesch

Reviewed 1 of 1 files at r3, 1 of 1 files at r4, 1 of 1 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5115 at r5 (raw file):

		r.mu.RUnlock()

		if !desc.IsInitialized() && len(desc.Replicas) == 0 && msg.RejectHint == 0 && ticks < 10 {

Ah, this feels much cleaner.

I think, though I'm not sure, that the check for !desc.IsInitialized() is redundant with the check for len(desc.Replicas) == 0. I think replicas with only a preemptive snapshot will return true for desc.IsInitialized().

pkg/storage/replica.go, line 5145 at r5 (raw file):

			// TODO(tschottdorf): preemptive snapshots may also be
			// garbage-collected before the replica addition is complete. In
			// such a case, we need to accept the Raft snapshot ASAP.

Can we just sidestep this race by refusing to GC replicas with preemptive snapshots for some time period after the snapshot was applied? Say, 60s?

pkg/storage/replica.go, line 5146 at r5 (raw file):

			// garbage-collected before the replica addition is complete. In
			// such a case, we need to accept the Raft snapshot ASAP.
			log.Warningf(ctx, "dropping rejection from %d to %d", msg.Index, msg.RejectHint)

Not sure how WIP this line is, but a warning seems overly strong.

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5119 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Newly create Raft groups have a small Raft log index, right? Can that be used here instead of looking at Generation?

Obsolete, but here's my old comment:

I had added that at some point, but the log index can race ahead pretty quickly. It wasn't clear to me where to cut it off.

pkg/storage/replica.go, line 5123 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

That's the part that we have to make "correct" (it might be correct today, but certainly doesn't look the part). If we say the snapshot failed, (I think - but the experiment suggests it) Raft is going to ask for another snapshot until it gets one. But here we have this situation where it receives a "pseudosnapshot" via the split trigger. So we lie and say "hey you got a snapshot" and apparently Raft uses this information in a way that is compatible with what we're trying to achieve (i.e. it's like hey, I seem to have gotten this follower back on track, let's try to append to him again but not using what I think would've been in the snapshot I asked for, but just using what I knew about this follower before).

Thankfully this is now obsolete.

pkg/storage/replica.go, line 5115 at r5 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Ah, this feels much cleaner.

I think, though I'm not sure, that the check for !desc.IsInitialized() is redundant with the check for len(desc.Replicas) == 0. I think replicas with only a preemptive snapshot will return true for desc.IsInitialized().

IsInitialized just checks for nonempty EndKey, but after a preemptive snapshot, don't you have "all of the data" except you're not a member of the RangeDescriptor?

pkg/storage/replica.go, line 5145 at r5 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Can we just sidestep this race by refusing to GC replicas with preemptive snapshots for some time period after the snapshot was applied? Say, 60s?

Yes, I think so (haven't thought about the precise mitigation yet, maybe we can do something smarter, who knows). Just didn't wanna jam this into this PR. But I should land it before this one.

pkg/storage/replica.go, line 5146 at r5 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Not sure how WIP this line is, but a warning seems overly strong.

Definitely WIP. Thanks for pointing it out.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5115 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

IsInitialized just checks for nonempty EndKey, but after a preemptive snapshot, don't you have "all of the data" except you're not a member of the RangeDescriptor?

Ditto that this is cleaner than trying to reject the snapshot.

I imagine that you'll be adding a test for this behavior. Writing that test isn't going to be fun, but your future self will thank you.

benesch

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5115 at r5 (raw file):

IsInitialized just checks for nonempty EndKey, but after a preemptive snapshot, don't you have "all of the data" except you're not a member of the RangeDescriptor?

Right. So if you have a preemptive snapshot, then IsInitialized is true, and so we correctly won't take this branch because we know the replica was not created as the result of a split trigger.

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5115 at r5 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

IsInitialized just checks for nonempty EndKey, but after a preemptive snapshot, don't you have "all of the data" except you're not a member of the RangeDescriptor?

Right. So if you have a preemptive snapshot, then IsInitialized is true, and so we correctly won't take this branch because we know the replica was not created as the result of a split trigger.

The checks here seem redundant to me too. They're all equivalent: an uninitialized replica will have empty desc.Replicas and a zero log index, while an initialized one is guaranteed to have non-empty desc.Replicas and non-zero log index. Just one of the first three checks should be sufficient (in combination with ticks < 10).

If we're sending a MsgAppResp, the leader sent us a MsgApp, so the leader believes we're a member of the group. That means we're either in the "split, but not yet triggered" case or the "received preemptive snapshot and immediately GC'd it" case. I don't see a good way to distinguish the two cases at this stage, although a guard against GC'ing a brand new replica would go a long way to minimizing the risk (and mu.ticks gives us a good way to handle this).

tbg

I need to add a test, but the non-test code is ready for review (I'm already using it in my roachtesting).

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5115 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The checks here seem redundant to me too. They're all equivalent: an uninitialized replica will have empty desc.Replicas and a zero log index, while an initialized one is guaranteed to have non-empty desc.Replicas and non-zero log index. Just one of the first three checks should be sufficient (in combination with ticks < 10).

If we're sending a MsgAppResp, the leader sent us a MsgApp, so the leader believes we're a member of the group. That means we're either in the "split, but not yet triggered" case or the "received preemptive snapshot and immediately GC'd it" case. I don't see a good way to distinguish the two cases at this stage, although a guard against GC'ing a brand new replica would go a long way to minimizing the risk (and mu.ticks gives us a good way to handle this).

Ah, yeah, that was pretty confused. Removed.

pkg/storage/replica.go, line 5145 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Yes, I think so (haven't thought about the precise mitigation yet, maybe we can do something smarter, who knows). Just didn't wanna jam this into this PR. But I should land it before this one.

Heads up that I have a WIP for this that I'm testing with. Will send that separately and make sure it lands before this.

pkg/storage/replica.go, line 5146 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Definitely WIP. Thanks for pointing it out.

Done.

tbg · 2018-11-14T12:31:40Z

I'm also noticing that the spikes in snapshots are per node (i.e. the cluster view is a superposition of per-node spikes). This lends further credence to this being caused by splitting ranges that already need a snapshot.

tbg

Updated, PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 361 at r7 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Can we rename this to pendingSnapshotIndexes or similar?

Done.

pkg/storage/replica.go, line 361 at r7 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Keying this by pending index is kinda weird. Why not map snapshot ID to {deadline, index}? Then you won't need to comment about why it's mostly OK that you might have two snapshots that clobber each other.

You're right, this was silly all along. Done.

pkg/storage/replica.go, line 6971 at r7 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

nit: I find it somewhat weird that you read from the map before initializing it. (I know reading from nil maps is safe even though writing to them is not—thanks, Go—but I still find it weird.) Don't bother changing it unless you also find it weird now that I've pointed it out.

I did that intentionally, but there wasn't a real good reason. Done.

pkg/storage/replica.go, line 6976 at r7 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Consider renaming to "markPendingSnapshotCompleted."

Done.

pkg/storage/replica.go, line 6980 at r7 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Perhaps explain why 3s.

Done.

pkg/storage/replica.go, line 5178 at r8 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

s/desc.Replicas would be nonempty/initialized would be true/

Done.

pkg/storage/replica_raftstorage.go, line 402 at r7 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

As mentioned above, I think keying pending snapshots by snap UUID would be cleaner. Unless I'm missing something, it should be straightforward to hoist the UUID generation out of snapshot().

Done.

tbg · 2018-11-14T18:48:54Z

An additional four instances of import/tpch/nodes=4 just finished successfully. 🎆

--- PASS: import/tpch/nodes=4 (17403.51s)
--- PASS: import/tpch/nodes=4 (18198.41s)
--- PASS: import/tpch/nodes=4 (19152.02s)
--- PASS: import/tpch/nodes=4 (18090.76s)

benesch

Reviewed 6 of 6 files at r9, 3 of 3 files at r10.
Reviewable status: complete! 1 of 0 LGTMs obtained

When a (preemptive or Raft) snapshot is inflight, a range should not truncate its log index to one ahead of the pending snapshot as doing so would require yet another snapshot. The code prior to this commit attempted to achieve that by treating the pending snapshot as an additional follower in the computation of the quorum index. This works as intended as long as the Raft log is below its configured maximum size but usually leads to abandoning the snapshot index when truncating based on size (to the quorum commit index). This situation occurs frequently during the split+scatter phase of data imports, where (mostly empty) ranges are rapidly upreplicated and split. This isn't limited to small replicas, however. Assume a range is ~32mb in size and a (say, preemptive) snapshot is underway. Preemptive snapshots are (at the time of writing) throttled to 2mb/s, so it will take approximately 16s to go through. At the same time, the Raft log may easily grow in size by the size truncation threshold (4mb at time of writing), allowing a log truncation that abandons the snapshot. A similar phenomenon applies to Raft snapshots, though now the quota pool will place restrictions on forward Raft progress, however not if a single command exceeds the size restriction (as is common during RESTORE operations where individual Raft commands are large). I haven't conclusively observed this in practice (though there has been enough badness to suspect that it happened), but in principle this could lead to a potentially infinite number of snapshots being sent out, a very expensive way of keeping a follower up to date. After this change, the pending snapshot index is managed more carefully (and is managed for Raft snapshots as well, not only for preemptive ones) and is never truncated away. As a result, in regular operation snapshots should now be able to be followed by regular Raft log catch-up in all cases. More precisely, we keep a map of pending snapshot index to deadline. The deadline is zero while the snapshot is ongoing and is set when the snapshot is completed. This avoids races between the snapshot completing and the replication change completing (which would occur even if the snapshot is only registering as completed after the replication change completes). There's an opportunity for a bigger refactor here by using learner replicas instead of preemptive snapshots. The idea is to add the replica as a learner first (ignoring it for the quota pool until it has received the first snapshot) first, and to use the regular Raft snapshot mechanism to catch it up. Once achieved, another configuration change would convert the learner into a regular follower. This change in itself will likely make our code more principled, but it is a more invasive change that is left for the future. Similarly, there is knowledge in the quota pool that it seems we should be using for log truncation decisions. The quota pool essentially knows about the size of each log entry whereas the Raft truncations only know the accumulated approximate size of the full log. For instance, instead of blindly truncating to the quorum commit index when the Raft log is too large, we could truncate it to an index that reduces the log size to about 50% of the maximum, in effect reducing the number of snapshots that are necessary due to quorum truncation. It's unclear whether this improvement will matter in practice. The following script reproduces redundant Raft snapshots (needs a somewhat beefy machine such as a gceworker). Note that this script will also incur Raft snapshots due to the splits, which are fixed in a follow-up commit. ``` set -euxo pipefail killall -9 cockroach || true killall -9 workload || true sleep 1 rm -rf cockroach-data || true mkdir -p cockroach-data ./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background ./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background ./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background ./cockroach start --insecure --host=localhost --port=26263 --http-port=26264 --store=cockroach-data/4 --cache=256MiB --join=localhost:26257 --background sleep 5 ./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;' ./cockroach sql --insecure <<EOF CREATE DATABASE csv; IMPORT TABLE csv.lineitem CREATE USING 'gs://cockroach-fixtures/tpch-csv/schema/lineitem.sql' CSV DATA ( 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.1', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.2', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.3', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.4', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.5', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.6', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.7', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.8' ) WITH delimiter='|' ; EOF sleep 5 for port in 26257 26259 26261 26263; do ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}" +done ``` Release note: None

A known race occurs during splits when some nodes apply the split trigger faster than others. The "slow" node(s) may learn about the newly created right hand side replica through Raft messages arriving from the "fast" nodes. In such cases, the leader will immediately try to catch up the follower (which it sees at log position zero) via a snapshot, but this isn't possible since there's an overlapping replica (the pre-split replica waiting to apply the trigger). This both leads to unnecessary transfer of data and can clog the Raft snapshot queue which tends to get stuck due to the throttling mechanisms both at the sender and receivers. To prevent this race (or make it exceedingly unlikely), we selectively drop certain messages from uninitialized followers, namely those that refuse an append to the log, for a number of ticks (corresponding to at most a few seconds of real time). Not dropping such a message leads to a Raft snapshot as the leader will learn that the follower has last index zero, which is never an index that can be caught up to from the log (our log "starts" at index 10). The script below reproduces the race (prior to this commit) by running 1000 splits back to back in a three node local cluster, usually showing north of a hundred Raft snapshots, i.e. a >10% chance to hit the race for each split. There's also a unit test that exposes this problem and can be stressed more conveniently (it also exposes the problems in the preceding commit related to overly aggressive log truncation). The false positives here are a) the LHS of the split needs a snapshot which catches it up across the split trigger and b) the LHS is rebalanced away (and GC'ed) before applying the split trigger. In both cases the timeout-based mechanism would allow the snapshot after a few seconds, once the Raft leader contacts the follower for the next time. Note that the interaction with Raft group quiescence is benign. We're only dropping MsgAppResp which is only sent by followers, implying that the Raft group is already unquiesced. ``` set -euxo pipefail killall -9 cockroach || true killall -9 workload || true sleep 1 rm -rf cockroach-data || true mkdir -p cockroach-data ./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background ./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background ./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background sleep 5 ./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;' ./bin/workload run kv --splits 1000 --init --drop --max-ops 1 sleep 5 for port in 26257 26259 26261; do ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}" done ``` Release note (bug fix): Avoid occasional unnecessary Raft snapshots after Range splits.

ajwerner · 2018-11-14T20:34:56Z

bors r=benesch

31875: storage: avoid errant Raft snapshots during splits r=benesch a=tbg A known race occurs during splits when some nodes apply the split trigger faster than others. The "slow" node(s) may learn about the newly created right hand side replica through Raft messages arriving from the "fast" nodes. In such cases, the leader will immediately try to catch the follower (which it sees at log position zero) up via a snapshot, but this isn't possible since there's an overlapping replica (the pre-split replica waiting to apply the trigger). I *think* we would find that out early in the life of the snapshot, i.e. we wouldn't transfer large amounts of data around. Nevertheless, this smells bad. The following script reproduces this by running 1000 splits back to back in a three node local cluster. In an ideal world, we'd see zero Raft snapshots here, but we see north of a hundred typically. Not all of those are due to the race. ReplicaGC and Raft log truncation can also contribute to this issue (disable both of those queues for pure results). We shouldn't ask for a snapshot when we suspect the target replica is going to catch up on a split trigger. This commit doesn't really fix that, but it suggests that one cheap way out is to simply report the requested snapshot as completed immediately. Raft will go back to probing the commit index, and this will continue until the split trigger has won the race (of course this trick can only ever help in situations in which we know in our heart of hearts that there's a commit trigger that will render the snapshot useless). ``` set -euxo pipefail killall -9 cockroach || true killall -9 workload || true sleep 1 rm -rf cockroach-data || true mkdir -p cockroach-data ./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background ./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background ./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background sleep 5 ./cockroach sql --insecure -e 'alter range default configure zone using range_max_bytes = 1048471142400;' ./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;' ./bin/workload run kv --splits 1000 --init --drop --max-ops 1 for port in 26257 26259 26261; do ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}" done sleep 5 tail -Fq cockroach-data/*/logs/cockroach.log ``` Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

craig · 2018-11-14T20:51:10Z

Build succeeded

GitHub CI (Cockroach)

With cockroachdb#31875 in, import/tpch/nodes=4 has been passing reliably, and the expectation is that the nodes=8,32 versions will follow suit (at least they passed just now). Release note: None

32749: roachtest: reenable skipped import tests r=petermattis a=tbg With #31875 in, import/tpch/nodes=4 has been passing reliably, and the expectation is that the nodes=8,32 versions will follow suit (at least they passed just now). Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

tbg requested a review from a team October 25, 2018 15:00

tbg mentioned this pull request Oct 25, 2018

roachtest: import/tpch/nodes=32 failed #31732

Closed

tbg mentioned this pull request Oct 26, 2018

roachtest: import/tpch/nodes=8 failed on master #30261

Closed

tbg force-pushed the repro/splittrigger-snapshot-race branch from f49d644 to 6489745 Compare October 27, 2018 19:00

tbg changed the title ~~[dnm] storage: demonstrate workaround for splittrigger-snapshot race~~ storage: avoid errant Raft snapshots during splits Oct 27, 2018

tbg requested review from benesch, petermattis and bdarnell October 27, 2018 20:41

petermattis previously requested changes Oct 27, 2018

View reviewed changes

tbg commented Oct 28, 2018

View reviewed changes

tbg force-pushed the repro/splittrigger-snapshot-race branch from cd3a65d to ff5e60d Compare October 29, 2018 14:20

benesch reviewed Oct 29, 2018

View reviewed changes

tbg commented Oct 29, 2018

View reviewed changes

petermattis reviewed Oct 29, 2018

View reviewed changes

benesch reviewed Oct 29, 2018

View reviewed changes

bdarnell reviewed Oct 29, 2018

View reviewed changes

This was referenced Oct 30, 2018

roachtest: stability problems during tpcc-{5,10,20}k on {6,12,24} nodes #31409

Closed

storage: snapshot/log truncation/replica GC badness tracking issue #32046

Closed

tbg force-pushed the repro/splittrigger-snapshot-race branch 2 times, most recently from d04677d to ecbdf6e Compare October 31, 2018 22:59

tbg commented Oct 31, 2018

View reviewed changes

tbg mentioned this pull request Nov 14, 2018

backupccl: RESTORE slowing at end #23509

Closed

tbg force-pushed the repro/splittrigger-snapshot-race branch from 6b46b9d to bb0c84f Compare November 14, 2018 13:13

tbg commented Nov 14, 2018

View reviewed changes

tbg force-pushed the repro/splittrigger-snapshot-race branch from bb0c84f to b407a37 Compare November 14, 2018 17:13

benesch approved these changes Nov 14, 2018

View reviewed changes

tbg force-pushed the repro/splittrigger-snapshot-race branch 2 times, most recently from 6300ee3 to 0518ff6 Compare November 14, 2018 19:21

tbg force-pushed the repro/splittrigger-snapshot-race branch from 0518ff6 to 3fd4bd5 Compare November 14, 2018 19:40

craig bot merged commit 3fd4bd5 into cockroachdb:master Nov 14, 2018

tbg deleted the repro/splittrigger-snapshot-race branch November 15, 2018 08:10

tbg mentioned this pull request Nov 15, 2018

kv: operations on a range can stall during lease changes when under heavy load #32367

Closed

tbg mentioned this pull request Nov 30, 2018

roachtest: reenable skipped import tests #32749

Merged

tbg mentioned this pull request Dec 10, 2018

backport-2.1: storage: improve logging and log Raft truncation #32972

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: avoid errant Raft snapshots during splits #31875

storage: avoid errant Raft snapshots during splits #31875

tbg commented Oct 25, 2018 •

edited

Loading

cockroach-teamcity commented Oct 25, 2018

tbg commented Oct 26, 2018

tbg commented Oct 27, 2018

tbg commented Oct 27, 2018

tbg commented Oct 27, 2018

tbg commented Oct 27, 2018

petermattis left a comment

tbg left a comment

benesch commented Oct 29, 2018

tbg commented Oct 29, 2018

tbg commented Oct 29, 2018

benesch left a comment

tbg left a comment

petermattis left a comment

benesch left a comment

bdarnell left a comment

tbg left a comment

tbg commented Nov 14, 2018

tbg left a comment

tbg commented Nov 14, 2018

benesch left a comment

ajwerner commented Nov 14, 2018

craig bot commented Nov 14, 2018

storage: avoid errant Raft snapshots during splits #31875

storage: avoid errant Raft snapshots during splits #31875

Conversation

tbg commented Oct 25, 2018 • edited Loading

cockroach-teamcity commented Oct 25, 2018

tbg commented Oct 26, 2018

tbg commented Oct 27, 2018

tbg commented Oct 27, 2018

tbg commented Oct 27, 2018

tbg commented Oct 27, 2018

petermattis left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

benesch commented Oct 29, 2018

tbg commented Oct 29, 2018

tbg commented Oct 29, 2018

benesch left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg commented Nov 14, 2018

tbg left a comment

Choose a reason for hiding this comment

tbg commented Nov 14, 2018

benesch left a comment

Choose a reason for hiding this comment

ajwerner commented Nov 14, 2018

craig bot commented Nov 14, 2018

Build succeeded

tbg commented Oct 25, 2018 •

edited

Loading