-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
storage: never truncate log for inflight snapshots
When a (preemptive or Raft) snapshot is inflight, a range should not truncate its log index to one ahead of the pending snapshot as doing so would require yet another snapshot. The code prior to this commit attempted to achieve that by treating the pending snapshot as an additional follower in the computation of the quorum index. This works as intended as long as the Raft log is below its configured maximum size but usually leads to abandoning the snapshot index when truncating based on size (to the quorum commit index). This situation occurs frequently during the split+scatter phase of data imports, where (mostly empty) ranges are rapidly upreplicated and split. This isn't limited to small replicas, however. Assume a range is ~32mb in size and a (say, preemptive) snapshot is underway. Preemptive snapshots are (at the time of writing) throttled to 2mb/s, so it will take approximately 16s to go through. At the same time, the Raft log may easily grow in size by the size truncation threshold (4mb at time of writing), allowing a log truncation that abandons the snapshot. A similar phenomenon applies to Raft snapshots, though now the quota pool will place restrictions on forward Raft progress, however not if a single command exceeds the size restriction (as is common during RESTORE operations where individual Raft commands are large). I haven't conclusively observed this in practice (though there has been enough badness to suspect that it happened), but in principle this could lead to a potentially infinite number of snapshots being sent out, a very expensive way of keeping a follower up to date. After this change, the pending snapshot index is managed more carefully (and is managed for Raft snapshots as well, not only for preemptive ones) and is never truncated away. As a result, in regular operation snapshots should now be able to be followed by regular Raft log catch-up in all cases. More precisely, we keep a map of pending snapshot index to deadline. The deadline is zero while the snapshot is ongoing and is set when the snapshot is completed. This avoids races between the snapshot completing and the replication change completing (which would occur even if the snapshot is only registering as completed after the replication change completes). There's an opportunity for a bigger refactor here by using learner replicas instead of preemptive snapshots. The idea is to add the replica as a learner first (ignoring it for the quota pool until it has received the first snapshot) first, and to use the regular Raft snapshot mechanism to catch it up. Once achieved, another configuration change would convert the learner into a regular follower. This change in itself will likely make our code more principled, but it is a more invasive change that is left for the future. Similarly, there is knowledge in the quota pool that it seems we should be using for log truncation decisions. The quota pool essentially knows about the size of each log entry whereas the Raft truncations only know the accumulated approximate size of the full log. For instance, instead of blindly truncating to the quorum commit index when the Raft log is too large, we could truncate it to an index that reduces the log size to about 50% of the maximum, in effect reducing the number of snapshots that are necessary due to quorum truncation. It's unclear whether this improvement will matter in practice. The following script reproduces redundant Raft snapshots (needs a somewhat beefy machine such as a gceworker). Note that this script will also incur Raft snapshots due to the splits, which are fixed in a follow-up commit. ``` set -euxo pipefail killall -9 cockroach || true killall -9 workload || true sleep 1 rm -rf cockroach-data || true mkdir -p cockroach-data ./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background ./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background ./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background ./cockroach start --insecure --host=localhost --port=26263 --http-port=26264 --store=cockroach-data/4 --cache=256MiB --join=localhost:26257 --background sleep 5 ./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;' ./cockroach sql --insecure <<EOF CREATE DATABASE csv; IMPORT TABLE csv.lineitem CREATE USING 'gs://cockroach-fixtures/tpch-csv/schema/lineitem.sql' CSV DATA ( 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.1', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.2', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.3', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.4', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.5', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.6', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.7', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.8' ) WITH delimiter='|' ; EOF sleep 5 for port in 26257 26259 26261 26263; do ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}" +done ``` Release note: None
- Loading branch information
Showing
5 changed files
with
193 additions
and
78 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.