-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: avoid errant Raft snapshots during splits #31875
storage: avoid errant Raft snapshots during splits #31875
Conversation
This happens pretty regularly in roachtest restores too. We'll see lots of snapshots at applied index 11, often blocked by the unsplit RHS on the receiving node, which then gets added to the replica GC queue. It seems mostly harmless, but we should alleviate the problem. |
f49d644
to
6489745
Compare
I added a second commit that introduces heuristics that almost removes all of the snapshots and isn't obviously going to prevent real snapshots from being sent. In essence, I'm trying to figure out whether the range was recently split, and whether the follower is likely to get a "free snapshot" via the split trigger very soon. This works almost 100% of the time, except in the case in which the LHS bumps its generation rapidly (due to being split a second time). This is entering the area of diminishing returns, but I do want to get rid of these snapshots reliably when they're not essential. This raises the question: if not via the generation counter, what's another indicator that the range has "very recently" split? Sure, I could go and read the history of the local range descriptor and see if the first (generation-zero) value is "recent". But maybe there's something less invasive I can work with. @benesch, I suspect that if there is something nice then you'd know about it. |
Filed #31947 about some general concerns about the Raft log queue. |
This is still WIP, but I think worth taking a look at. |
(in particular I haven't vetted whether it's OK to lie to Raft and just tell it it got a snapshot, though it seems to work just fine). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5119 at r2 (raw file):
recentlyCreatedViaSplit := desc.IsInitialized() && (desc.Generation == nil || *desc.Generation == 0) && raftStatus != nil && raftStatus.Progress[msg.To].Match == 0 && ticks <= 15
Newly create Raft groups have a small Raft log index, right? Can that be used here instead of looking at Generation
?
pkg/storage/replica.go, line 5123 at r2 (raw file):
if recentlyCreatedViaSplit { r.withRaftGroup(false, func(rn *raft.RawNode) (bool, error) { rn.ReportSnapshot(msg.To, raft.SnapshotFinish)
SnapshotFinish
indicates that the snapshot succeeded. That doesn't seem right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5123 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
SnapshotFinish
indicates that the snapshot succeeded. That doesn't seem right.
That's the part that we have to make "correct" (it might be correct today, but certainly doesn't look the part). If we say the snapshot failed, (I think - but the experiment suggests it) Raft is going to ask for another snapshot until it gets one. But here we have this situation where it receives a "pseudosnapshot" via the split trigger. So we lie and say "hey you got a snapshot" and apparently Raft uses this information in a way that is compatible with what we're trying to achieve (i.e. it's like hey, I seem to have gotten this follower back on track, let's try to append to him again but not using what I think would've been in the snapshot I asked for, but just using what I knew about this follower before).
Nothing nice that I know about. We could write a range-ID local key with the time at which the range was created, but I don’t think that would be any much more precise than what you’ve got or what Peter suggested. Here’s an alternate proposal, though, just for kicks. Whenever you split a range, you record the LHS’s range ID and applied index on the RHS. Then, whenever you send a snapshot, as part of the handshake, you send along the recorded range ID and applied index. The receiving store then checks to see if it has a replica of that range at an earlier applied index. If it does, it blocks the snapshot for a bit (before we actually do any of the hard work of generating the snap) in the hopes that it’s replica of the LHS will apply the split trigger. If the LHS replica doesnt make progress in a few hundred ms, then we queue it for GC; only if it gets GC’d or advances past the split trigger with a snapshot of its own do we allow the RHS snapshot through. There’s some additional complexity here to avoid stalling out all four goroutines in the Raft snapshot queue, but I think it could be made to work. |
I thought about something like that but it seems overly complex. Also, the thing I'm really after is to never have to do any work for this case in the raft snapshot queue. We should see zero Raft snapshots in all situations which don't actually require recovery. I wonder if we can attack the problem from the other side. The snapshot gets requested by the leader after the follower rejects its MsgApp. Could we selectively drop these rejection messages for a little bit to let the split trigger "always win"? I'll find out. |
cd3a65d
to
ff5e60d
Compare
Added a commit. It works 100% in my limited test. What I'm less sure about is that I got all of the conditions right (how do I tell whether a replica has a preemptive snapshot but isn't active yet? I think it's that it's not in its own RangeDescriptor), but I think this does look promising. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3, 1 of 1 files at r4, 1 of 1 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5115 at r5 (raw file):
r.mu.RUnlock() if !desc.IsInitialized() && len(desc.Replicas) == 0 && msg.RejectHint == 0 && ticks < 10 {
Ah, this feels much cleaner.
I think, though I'm not sure, that the check for !desc.IsInitialized()
is redundant with the check for len(desc.Replicas) == 0
. I think replicas with only a preemptive snapshot will return true for desc.IsInitialized()
.
pkg/storage/replica.go, line 5145 at r5 (raw file):
// TODO(tschottdorf): preemptive snapshots may also be // garbage-collected before the replica addition is complete. In // such a case, we need to accept the Raft snapshot ASAP.
Can we just sidestep this race by refusing to GC replicas with preemptive snapshots for some time period after the snapshot was applied? Say, 60s?
pkg/storage/replica.go, line 5146 at r5 (raw file):
// garbage-collected before the replica addition is complete. In // such a case, we need to accept the Raft snapshot ASAP. log.Warningf(ctx, "dropping rejection from %d to %d", msg.Index, msg.RejectHint)
Not sure how WIP this line is, but a warning seems overly strong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5119 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Newly create Raft groups have a small Raft log index, right? Can that be used here instead of looking at
Generation
?
Obsolete, but here's my old comment:
I had added that at some point, but the log index can race ahead pretty quickly. It wasn't clear to me where to cut it off.
pkg/storage/replica.go, line 5123 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
That's the part that we have to make "correct" (it might be correct today, but certainly doesn't look the part). If we say the snapshot failed, (I think - but the experiment suggests it) Raft is going to ask for another snapshot until it gets one. But here we have this situation where it receives a "pseudosnapshot" via the split trigger. So we lie and say "hey you got a snapshot" and apparently Raft uses this information in a way that is compatible with what we're trying to achieve (i.e. it's like hey, I seem to have gotten this follower back on track, let's try to append to him again but not using what I think would've been in the snapshot I asked for, but just using what I knew about this follower before).
Thankfully this is now obsolete.
pkg/storage/replica.go, line 5115 at r5 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Ah, this feels much cleaner.
I think, though I'm not sure, that the check for
!desc.IsInitialized()
is redundant with the check forlen(desc.Replicas) == 0
. I think replicas with only a preemptive snapshot will return true fordesc.IsInitialized()
.
IsInitialized
just checks for nonempty EndKey
, but after a preemptive snapshot, don't you have "all of the data" except you're not a member of the RangeDescriptor
?
pkg/storage/replica.go, line 5145 at r5 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Can we just sidestep this race by refusing to GC replicas with preemptive snapshots for some time period after the snapshot was applied? Say, 60s?
Yes, I think so (haven't thought about the precise mitigation yet, maybe we can do something smarter, who knows). Just didn't wanna jam this into this PR. But I should land it before this one.
pkg/storage/replica.go, line 5146 at r5 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Not sure how WIP this line is, but a warning seems overly strong.
Definitely WIP. Thanks for pointing it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5115 at r5 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
IsInitialized
just checks for nonemptyEndKey
, but after a preemptive snapshot, don't you have "all of the data" except you're not a member of theRangeDescriptor
?
Ditto that this is cleaner than trying to reject the snapshot.
I imagine that you'll be adding a test for this behavior. Writing that test isn't going to be fun, but your future self will thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5115 at r5 (raw file):
IsInitialized just checks for nonempty EndKey, but after a preemptive snapshot, don't you have "all of the data" except you're not a member of the RangeDescriptor?
Right. So if you have a preemptive snapshot, then IsInitialized is true, and so we correctly won't take this branch because we know the replica was not created as the result of a split trigger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5115 at r5 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
IsInitialized just checks for nonempty EndKey, but after a preemptive snapshot, don't you have "all of the data" except you're not a member of the RangeDescriptor?
Right. So if you have a preemptive snapshot, then IsInitialized is true, and so we correctly won't take this branch because we know the replica was not created as the result of a split trigger.
The checks here seem redundant to me too. They're all equivalent: an uninitialized replica will have empty desc.Replicas
and a zero log index, while an initialized one is guaranteed to have non-empty desc.Replicas
and non-zero log index. Just one of the first three checks should be sufficient (in combination with ticks < 10
).
If we're sending a MsgAppResp, the leader sent us a MsgApp, so the leader believes we're a member of the group. That means we're either in the "split, but not yet triggered" case or the "received preemptive snapshot and immediately GC'd it" case. I don't see a good way to distinguish the two cases at this stage, although a guard against GC'ing a brand new replica would go a long way to minimizing the risk (and mu.ticks
gives us a good way to handle this).
d04677d
to
ecbdf6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to add a test, but the non-test code is ready for review (I'm already using it in my roachtesting).
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5115 at r5 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
The checks here seem redundant to me too. They're all equivalent: an uninitialized replica will have empty
desc.Replicas
and a zero log index, while an initialized one is guaranteed to have non-emptydesc.Replicas
and non-zero log index. Just one of the first three checks should be sufficient (in combination withticks < 10
).If we're sending a MsgAppResp, the leader sent us a MsgApp, so the leader believes we're a member of the group. That means we're either in the "split, but not yet triggered" case or the "received preemptive snapshot and immediately GC'd it" case. I don't see a good way to distinguish the two cases at this stage, although a guard against GC'ing a brand new replica would go a long way to minimizing the risk (and
mu.ticks
gives us a good way to handle this).
Ah, yeah, that was pretty confused. Removed.
pkg/storage/replica.go, line 5145 at r5 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Yes, I think so (haven't thought about the precise mitigation yet, maybe we can do something smarter, who knows). Just didn't wanna jam this into this PR. But I should land it before this one.
Heads up that I have a WIP for this that I'm testing with. Will send that separately and make sure it lands before this.
pkg/storage/replica.go, line 5146 at r5 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Definitely WIP. Thanks for pointing it out.
Done.
I'm also noticing that the spikes in snapshots are per node (i.e. the cluster view is a superposition of per-node spikes). This lends further credence to this being caused by splitting ranges that already need a snapshot. |
6b46b9d
to
bb0c84f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 361 at r7 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Can we rename this to
pendingSnapshotIndexes
or similar?
Done.
pkg/storage/replica.go, line 361 at r7 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Keying this by pending index is kinda weird. Why not map snapshot ID to {deadline, index}? Then you won't need to comment about why it's mostly OK that you might have two snapshots that clobber each other.
You're right, this was silly all along. Done.
pkg/storage/replica.go, line 6971 at r7 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
nit: I find it somewhat weird that you read from the map before initializing it. (I know reading from nil maps is safe even though writing to them is not—thanks, Go—but I still find it weird.) Don't bother changing it unless you also find it weird now that I've pointed it out.
I did that intentionally, but there wasn't a real good reason. Done.
pkg/storage/replica.go, line 6976 at r7 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Consider renaming to "markPendingSnapshotCompleted."
Done.
pkg/storage/replica.go, line 6980 at r7 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Perhaps explain why 3s.
Done.
pkg/storage/replica.go, line 5178 at r8 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
s/desc.Replicas would be nonempty/initialized would be true/
Done.
pkg/storage/replica_raftstorage.go, line 402 at r7 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
As mentioned above, I think keying pending snapshots by snap UUID would be cleaner. Unless I'm missing something, it should be straightforward to hoist the UUID generation out of
snapshot()
.
Done.
bb0c84f
to
b407a37
Compare
An additional four instances of --- PASS: import/tpch/nodes=4 (17403.51s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r9, 3 of 3 files at r10.
Reviewable status: complete! 1 of 0 LGTMs obtained
When a (preemptive or Raft) snapshot is inflight, a range should not truncate its log index to one ahead of the pending snapshot as doing so would require yet another snapshot. The code prior to this commit attempted to achieve that by treating the pending snapshot as an additional follower in the computation of the quorum index. This works as intended as long as the Raft log is below its configured maximum size but usually leads to abandoning the snapshot index when truncating based on size (to the quorum commit index). This situation occurs frequently during the split+scatter phase of data imports, where (mostly empty) ranges are rapidly upreplicated and split. This isn't limited to small replicas, however. Assume a range is ~32mb in size and a (say, preemptive) snapshot is underway. Preemptive snapshots are (at the time of writing) throttled to 2mb/s, so it will take approximately 16s to go through. At the same time, the Raft log may easily grow in size by the size truncation threshold (4mb at time of writing), allowing a log truncation that abandons the snapshot. A similar phenomenon applies to Raft snapshots, though now the quota pool will place restrictions on forward Raft progress, however not if a single command exceeds the size restriction (as is common during RESTORE operations where individual Raft commands are large). I haven't conclusively observed this in practice (though there has been enough badness to suspect that it happened), but in principle this could lead to a potentially infinite number of snapshots being sent out, a very expensive way of keeping a follower up to date. After this change, the pending snapshot index is managed more carefully (and is managed for Raft snapshots as well, not only for preemptive ones) and is never truncated away. As a result, in regular operation snapshots should now be able to be followed by regular Raft log catch-up in all cases. More precisely, we keep a map of pending snapshot index to deadline. The deadline is zero while the snapshot is ongoing and is set when the snapshot is completed. This avoids races between the snapshot completing and the replication change completing (which would occur even if the snapshot is only registering as completed after the replication change completes). There's an opportunity for a bigger refactor here by using learner replicas instead of preemptive snapshots. The idea is to add the replica as a learner first (ignoring it for the quota pool until it has received the first snapshot) first, and to use the regular Raft snapshot mechanism to catch it up. Once achieved, another configuration change would convert the learner into a regular follower. This change in itself will likely make our code more principled, but it is a more invasive change that is left for the future. Similarly, there is knowledge in the quota pool that it seems we should be using for log truncation decisions. The quota pool essentially knows about the size of each log entry whereas the Raft truncations only know the accumulated approximate size of the full log. For instance, instead of blindly truncating to the quorum commit index when the Raft log is too large, we could truncate it to an index that reduces the log size to about 50% of the maximum, in effect reducing the number of snapshots that are necessary due to quorum truncation. It's unclear whether this improvement will matter in practice. The following script reproduces redundant Raft snapshots (needs a somewhat beefy machine such as a gceworker). Note that this script will also incur Raft snapshots due to the splits, which are fixed in a follow-up commit. ``` set -euxo pipefail killall -9 cockroach || true killall -9 workload || true sleep 1 rm -rf cockroach-data || true mkdir -p cockroach-data ./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background ./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background ./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background ./cockroach start --insecure --host=localhost --port=26263 --http-port=26264 --store=cockroach-data/4 --cache=256MiB --join=localhost:26257 --background sleep 5 ./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;' ./cockroach sql --insecure <<EOF CREATE DATABASE csv; IMPORT TABLE csv.lineitem CREATE USING 'gs://cockroach-fixtures/tpch-csv/schema/lineitem.sql' CSV DATA ( 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.1', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.2', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.3', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.4', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.5', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.6', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.7', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.8' ) WITH delimiter='|' ; EOF sleep 5 for port in 26257 26259 26261 26263; do ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}" +done ``` Release note: None
6300ee3
to
0518ff6
Compare
A known race occurs during splits when some nodes apply the split trigger faster than others. The "slow" node(s) may learn about the newly created right hand side replica through Raft messages arriving from the "fast" nodes. In such cases, the leader will immediately try to catch up the follower (which it sees at log position zero) via a snapshot, but this isn't possible since there's an overlapping replica (the pre-split replica waiting to apply the trigger). This both leads to unnecessary transfer of data and can clog the Raft snapshot queue which tends to get stuck due to the throttling mechanisms both at the sender and receivers. To prevent this race (or make it exceedingly unlikely), we selectively drop certain messages from uninitialized followers, namely those that refuse an append to the log, for a number of ticks (corresponding to at most a few seconds of real time). Not dropping such a message leads to a Raft snapshot as the leader will learn that the follower has last index zero, which is never an index that can be caught up to from the log (our log "starts" at index 10). The script below reproduces the race (prior to this commit) by running 1000 splits back to back in a three node local cluster, usually showing north of a hundred Raft snapshots, i.e. a >10% chance to hit the race for each split. There's also a unit test that exposes this problem and can be stressed more conveniently (it also exposes the problems in the preceding commit related to overly aggressive log truncation). The false positives here are a) the LHS of the split needs a snapshot which catches it up across the split trigger and b) the LHS is rebalanced away (and GC'ed) before applying the split trigger. In both cases the timeout-based mechanism would allow the snapshot after a few seconds, once the Raft leader contacts the follower for the next time. Note that the interaction with Raft group quiescence is benign. We're only dropping MsgAppResp which is only sent by followers, implying that the Raft group is already unquiesced. ``` set -euxo pipefail killall -9 cockroach || true killall -9 workload || true sleep 1 rm -rf cockroach-data || true mkdir -p cockroach-data ./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background ./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background ./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background sleep 5 ./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;' ./bin/workload run kv --splits 1000 --init --drop --max-ops 1 sleep 5 for port in 26257 26259 26261; do ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}" done ``` Release note (bug fix): Avoid occasional unnecessary Raft snapshots after Range splits.
0518ff6
to
3fd4bd5
Compare
bors r=benesch |
31875: storage: avoid errant Raft snapshots during splits r=benesch a=tbg A known race occurs during splits when some nodes apply the split trigger faster than others. The "slow" node(s) may learn about the newly created right hand side replica through Raft messages arriving from the "fast" nodes. In such cases, the leader will immediately try to catch the follower (which it sees at log position zero) up via a snapshot, but this isn't possible since there's an overlapping replica (the pre-split replica waiting to apply the trigger). I *think* we would find that out early in the life of the snapshot, i.e. we wouldn't transfer large amounts of data around. Nevertheless, this smells bad. The following script reproduces this by running 1000 splits back to back in a three node local cluster. In an ideal world, we'd see zero Raft snapshots here, but we see north of a hundred typically. Not all of those are due to the race. ReplicaGC and Raft log truncation can also contribute to this issue (disable both of those queues for pure results). We shouldn't ask for a snapshot when we suspect the target replica is going to catch up on a split trigger. This commit doesn't really fix that, but it suggests that one cheap way out is to simply report the requested snapshot as completed immediately. Raft will go back to probing the commit index, and this will continue until the split trigger has won the race (of course this trick can only ever help in situations in which we know in our heart of hearts that there's a commit trigger that will render the snapshot useless). ``` set -euxo pipefail killall -9 cockroach || true killall -9 workload || true sleep 1 rm -rf cockroach-data || true mkdir -p cockroach-data ./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background ./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background ./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background sleep 5 ./cockroach sql --insecure -e 'alter range default configure zone using range_max_bytes = 1048471142400;' ./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;' ./bin/workload run kv --splits 1000 --init --drop --max-ops 1 for port in 26257 26259 26261; do ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}" done sleep 5 tail -Fq cockroach-data/*/logs/cockroach.log ``` Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Build succeeded |
With cockroachdb#31875 in, import/tpch/nodes=4 has been passing reliably, and the expectation is that the nodes=8,32 versions will follow suit (at least they passed just now). Release note: None
With cockroachdb#31875 in, import/tpch/nodes=4 has been passing reliably, and the expectation is that the nodes=8,32 versions will follow suit (at least they passed just now). Release note: None
With cockroachdb#31875 in, import/tpch/nodes=4 has been passing reliably, and the expectation is that the nodes=8,32 versions will follow suit (at least they passed just now). Release note: None
32749: roachtest: reenable skipped import tests r=petermattis a=tbg With #31875 in, import/tpch/nodes=4 has been passing reliably, and the expectation is that the nodes=8,32 versions will follow suit (at least they passed just now). Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
A known race occurs during splits when some nodes apply the split
trigger faster than others. The "slow" node(s) may learn about the
newly created right hand side replica through Raft messages arriving
from the "fast" nodes. In such cases, the leader will immediately try to
catch the follower (which it sees at log position zero) up via a
snapshot, but this isn't possible since there's an overlapping replica
(the pre-split replica waiting to apply the trigger). I think we
would find that out early in the life of the snapshot, i.e. we wouldn't
transfer large amounts of data around. Nevertheless, this smells bad.
The following script reproduces this by running 1000 splits back
to back in a three node local cluster. In an ideal world, we'd
see zero Raft snapshots here, but we see north of a hundred typically.
Not all of those are due to the race. ReplicaGC and Raft log truncation
can also contribute to this issue (disable both of those queues for pure
results).
We shouldn't ask for a snapshot when we suspect the target replica is
going to catch up on a split trigger. This commit doesn't really fix
that, but it suggests that one cheap way out is to simply report the
requested snapshot as completed immediately. Raft will go back to
probing the commit index, and this will continue until the split
trigger has won the race (of course this trick can only ever help
in situations in which we know in our heart of hearts that there's
a commit trigger that will render the snapshot useless).
Touches #32046
Touches #31409
Release note: None