Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
storage: avoid errant Raft snapshots after splits
A known race occurs during splits when some nodes apply the split trigger faster than others. The "slow" node(s) may learn about the newly created right hand side replica through Raft messages arriving from the "fast" nodes. In such cases, the leader will immediately try to catch up the follower (which it sees at log position zero) via a snapshot, but this isn't possible since there's an overlapping replica (the pre-split replica waiting to apply the trigger). This both leads to unnecessary transfer of data and can clog the Raft snapshot queue which tends to get stuck due to the throttling mechanisms both at the sender and receivers. To prevent this race (or make it exceedingly unlikely), we selectively drop certain messages from uninitialized followers, namely those that refuse an append to the log, for a number of ticks (corresponding to at most a few seconds of real time). Not dropping such a message leads to a Raft snapshot as the leader will learn that the follower has last index zero, which is never an index that can be caught up to from the log (our log "starts" at index 10). The script below reproduces the race (prior to this commit) by running 1000 splits back to back in a three node local cluster, usually showing north of a hundred Raft snapshots, i.e. a >10% chance to hit the race for each split. There's also a unit test that exposes this problem and can be stressed more conveniently (it also exposes the problems in the preceding commit related to overly aggressive log truncation). The false positives here are a) the LHS of the split needs a snapshot which catches it up across the split trigger and b) the LHS is rebalanced away (and GC'ed) before applying the split trigger. In both cases the timeout-based mechanism would allow the snapshot after a few seconds, once the Raft leader contacts the follower for the next time. Note that the interaction with Raft group quiescence is benign. We're only dropping MsgAppResp which is only sent by followers, implying that the Raft group is already unquiesced. ``` set -euxo pipefail killall -9 cockroach || true killall -9 workload || true sleep 1 rm -rf cockroach-data || true mkdir -p cockroach-data ./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background ./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background ./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background sleep 5 ./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;' ./bin/workload run kv --splits 1000 --init --drop --max-ops 1 sleep 5 for port in 26257 26259 26261; do ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}" done ``` Release note (bug fix): Avoid occasional unnecessary Raft snapshots after Range splits.
- Loading branch information