storage: optimize MVCCGarbageCollect for large numbers of versions #51184

ajwerner · 2020-07-09T01:05:02Z

The first two commits rework the benchmarks a tad. The third commit is the meat.

Touches and maybe fixes #50194.

Release note (performance improvement): Improved the efficiency of garbage
collection when there are a large number of versions of a single key,
commonly found when utilizing sequences.

cockroach-teamcity · 2020-07-09T01:05:11Z

This change is

@petermattis

This commit adds an optimization to massively reduce the overhead of garbage collecting large numbers of versions. When running garbage collection, we currently iterate through the store to send (Key, Timestamp) pairs in a GCRequest to the store to be evaluated and applied. This can be rather slow when deleting large numbers of keys, particularly due to the need to paginate. The primary motivation for pagination is to ensure that we do not create raft commands with deletions that are too large. In practice, we find that for the tiny values in a sequence, we find that we can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though note that in that PR the more versions exist, the worse the throughput will be). This remains abysmally slow. I imagine that using larger batches could be one approach to increase the throughput, but, as it stands, 256 KiB is not a tiny raft command. This instead turns to the ClearRange operation which can delete all of versions of a key with the replication overhead of just two copies. This approach is somewhat controversial because, as @petermattis puts it: ``` We need to be careful about this. Historically, adding many range tombstones was very bad for performance. I think we resolved most (all?) of those issues, but I'm still nervous about embracing using range tombstones below the level of a Range. ``` Nevertheless, the results are enticing. Rather than pinning a core at full utilization for minutes just to clear the versions written to a sequence over the course of a bit more than an hour, we can clear that in ~2 seconds. Release note (performance improvement): Improved performance of garbage collection in the face of large numbers of versions.

This commit unifies `BenchmarkGarbageCollect` in the style of `BenchmarkExportToSst` and moves both from engine-specific files to bench_test.go Release note: None

Before this change, BenchmarkGarbageCollect always benchmarked collecting all but the last version of a key. This change expands the test to benchmark different numbers of deleted versions to highlight that the current implementation is linear in the number of versions. Release note: None

itsbilal

Sorry for the slightly delayed review - missed this last week.

save for one minor comment.

Reviewed 3 of 3 files at r1, 1 of 1 files at r2, 1 of 1 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/storage/bench_test.go, line 36 at r1 (raw file):

)

// Note: most benchmarks in this package have an engine-specific Benchmark

👍

pkg/storage/mvcc.go, line 3179 at r3 (raw file):

	// in increasing order.
	sort.Slice(keys, func(i, j int) bool {
		cmp := keys[i].Key.Compare(keys[j].Key)

Why not just do keys[i].Less(keys[j]) ?

pkg/storage/mvcc.go, line 3262 at r3 (raw file):

			// to find the predecessor of gcKey before resorting to seeking.
			//
			// In a	synthetic benchmark where there is one version of garbage and one

Nit: extra spaces

Prior to this change, MVCCGarbageCollect performed a linear scan of all versions of a key, not just the versions being garbage collected. Given the pagination of deleting versions above this call, the linear behavior can result in quadratic runtime of GC when the number of versions vastly exceeds the page size. The benchmark results demonstrate the change's effectiveness. It's worth noting that for a single key with a single version, the change has a negative performance impact. I suspect this is due to the allocation of a key in order to construct the iterator. In cases involving more keys, I theorize the positive change is due to the fact that now the iterator is never seeked backwards due to the sorting of the keys. It's worth noting that since 20.1, the GC queue has been sending keys in the GC request in reverse order. I anticipate that this sorting is likely a good thing in that case too. The stepping optimization seemed important in the microbenchmarks for cases where most of the data was garbage. Without it, the change had small negative impact on performance. ``` name old time/op new time/op delta MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=2/deleteVersions=1-24 3.39µs ± 1% 3.96µs ± 0% +16.99% (p=0.004 n=6+5) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1-24 319µs ± 3% 10µs ±12% -96.88% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=16-24 319µs ± 2% 16µs ±10% -94.95% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=32-24 319µs ± 3% 21µs ± 5% -93.52% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=512-24 337µs ± 1% 182µs ± 3% -46.00% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1015-24 361µs ± 0% 353µs ± 2% -2.32% (p=0.010 n=4+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1023-24 361µs ± 3% 350µs ± 2% -3.14% (p=0.009 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=2/deleteVersions=1-24 2.00ms ± 3% 2.25ms ± 2% +12.53% (p=0.004 n=6+5) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1-24 388ms ± 3% 16ms ± 5% -95.76% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=16-24 387ms ± 1% 27ms ± 3% -93.14% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=32-24 393ms ± 5% 35ms ± 4% -91.09% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=512-24 463ms ± 4% 276ms ± 3% -40.43% (p=0.004 n=5+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1015-24 539ms ± 5% 514ms ± 3% -4.64% (p=0.016 n=5+5) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1023-24 533ms ± 4% 514ms ± 1% ~ (p=0.093 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=2/deleteVersions=1-24 1.97µs ± 3% 2.29µs ± 2% +16.58% (p=0.002 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1-24 139µs ± 1% 5µs ± 6% -96.40% (p=0.004 n=5+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=16-24 140µs ± 1% 8µs ± 1% -94.13% (p=0.004 n=6+5) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=32-24 143µs ± 4% 11µs ± 2% -92.03% (p=0.002 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=512-24 178µs ± 9% 109µs ± 1% -38.75% (p=0.004 n=6+5) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1015-24 201µs ± 1% 213µs ± 1% +5.80% (p=0.008 n=5+5) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1023-24 205µs ±11% 215µs ± 6% ~ (p=0.126 n=5+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=2/deleteVersions=1-24 1.43ms ± 1% 1.34ms ± 1% -5.82% (p=0.004 n=6+5) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1-24 218ms ± 9% 9ms ± 2% -96.00% (p=0.002 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=16-24 216ms ± 3% 15ms ± 2% -93.19% (p=0.004 n=5+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=32-24 219ms ± 4% 20ms ± 5% -90.77% (p=0.004 n=5+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=512-24 303ms ± 4% 199ms ± 4% -34.47% (p=0.004 n=5+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1015-24 382ms ±16% 363ms ± 8% ~ (p=0.485 n=6+6) ajwerner@gceworker-ajwerner:~/go/src/github.com/cockroachdb/cockroach$ %ns=1024/deleteVersions=1023-24 363ms ± 4% 354ms ± 4% ~ (p=0.222 n=5+5) ``` Release note (performance improvement): Improved the efficiency of garbage collection when there are a large number of versions of a single key, commonly found when utilizing sequences.

@petermattis

This commit adds an optimization to massively reduce the overhead of garbage collecting large numbers of versions. When running garbage collection, we currently iterate through the store to send (Key, Timestamp) pairs in a GCRequest to the store to be evaluated and applied. This can be rather slow when deleting large numbers of keys, particularly due to the need to paginate. The primary motivation for pagination is to ensure that we do not create raft commands with deletions that are too large. In practice, we find that for the tiny values in a sequence, we find that we can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though note that in that PR the more versions exist, the worse the throughput will be). This remains abysmally slow. I imagine that using larger batches could be one approach to increase the throughput, but, as it stands, 256 KiB is not a tiny raft command. This instead turns to the ClearRange operation which can delete all of versions of a key with the replication overhead of just two copies. This approach is somewhat controversial because, as @petermattis puts it: ``` We need to be careful about this. Historically, adding many range tombstones was very bad for performance. I think we resolved most (all?) of those issues, but I'm still nervous about embracing using range tombstones below the level of a Range. ``` Nevertheless, the results are enticing. Rather than pinning a core at full utilization for minutes just to clear the versions written to a sequence over the course of a bit more than an hour, we can clear that in ~2 seconds. Release note (performance improvement): Improved performance of garbage collection in the face of large numbers of versions.

ajwerner

Thanks for the review. I changes the less function to utilize MVCCKey.Less.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal)

pkg/storage/mvcc.go, line 3179 at r3 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Why not just do keys[i].Less(keys[j]) ?

These keys aren't roachpb.Key but rather roachpb.GCRequest_GCKey. That being said, constructing MVCCKeys and using the Less is cleaner so I changed it to do that.

pkg/storage/mvcc.go, line 3262 at r3 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Nit: extra spaces

Done.

ajwerner · 2020-07-13T21:39:36Z

bors r=itsbilal

craig · 2020-07-13T22:59:04Z

Build succeeded

GitHub CI (Cockroach)

The bug that this fixes was introduced in cockroachdb#51184. This bug was fortunately caught by a gc queue test under stressrace where a batch happens to get sent twice and thus there isn't garbage. The specific problem is that the implementation of `ReadWriter` used by `kvserver` is `spanset.Batch` which returns an error after `SeekLT` if the found key is not within the span leading to an error being returned and the keys not being properly GC'd. Fixes cockroachdb#51402. Release note: None

51462: storage: optimize MVCCGarbageCollect for large number of undeleted version r=itsbilal a=ajwerner This is a better-tested take-2 of #51184. * The first commit introduces a `SupportsPrev` method to the `Iterator` interface. * The second commit adjusts the benchmark to use a regular batch. * The third commit adjusts the logic in #51184 to fix the bugs there and adds testing. The win is now only pronounced on pebble, as might be expected. 51489: opt, cat: rename Table.AllColumnCount to Table.ColumnCount r=RaduBerinde a=RaduBerinde This change renames AllColumnCount to ColumnCount. In #51400 I meant to rename before merging but I forgot. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com>

@petermattis

This commit adds an optimization to massively reduce the overhead of garbage collecting large numbers of versions. When running garbage collection, we currently iterate through the store to send (Key, Timestamp) pairs in a GCRequest to the store to be evaluated and applied. This can be rather slow when deleting large numbers of keys, particularly due to the need to paginate. The primary motivation for pagination is to ensure that we do not create raft commands with deletions that are too large. In practice, we find that for the tiny values in a sequence, we find that we can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though note that in that PR the more versions exist, the worse the throughput will be). This remains abysmally slow. I imagine that using larger batches could be one approach to increase the throughput, but, as it stands, 256 KiB is not a tiny raft command. This instead turns to the ClearRange operation which can delete all of versions of a key with the replication overhead of just two copies. This approach is somewhat controversial because, as @petermattis puts it: ``` We need to be careful about this. Historically, adding many range tombstones was very bad for performance. I think we resolved most (all?) of those issues, but I'm still nervous about embracing using range tombstones below the level of a Range. ``` Nevertheless, the results are enticing. Rather than pinning a core at full utilization for minutes just to clear the versions written to a sequence over the course of a bit more than an hour, we can clear that in ~2 seconds. The decision to use a ClearRange is controlled by whether an entire batch would be used to clear versions of a single key. This means that there we'll only send a clear range if there are at least `<key size> * <num versions> > 256 KiB`. There's any additional sanity check that the `<num versions>` is greater than 32 in order to prevent issuing lots of `ClearRange` operations when the cluster has gigantic keys. This new functionality is gated behind both a version and an experimental cluster setting. I'm skipping the release note because of the experimental flag. Release note: None

@petermattis

This commit adds an optimization to massively reduce the overhead of garbage collecting large numbers of versions. When running garbage collection, we currently iterate through the store to send (Key, Timestamp) pairs in a GCRequest to the store to be evaluated and applied. This can be rather slow when deleting large numbers of keys, particularly due to the need to paginate. The primary motivation for pagination is to ensure that we do not create raft commands with deletions that are too large. In practice, we find that for the tiny values in a sequence, we find that we can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though note that in that PR the more versions exist, the worse the throughput will be). This remains abysmally slow. I imagine that using larger batches could be one approach to increase the throughput, but, as it stands, 256 KiB is not a tiny raft command. This instead turns to the ClearRange operation which can delete all of versions of a key with the replication overhead of just two copies. This approach is somewhat controversial because, as @petermattis puts it: ``` We need to be careful about this. Historically, adding many range tombstones was very bad for performance. I think we resolved most (all?) of those issues, but I'm still nervous about embracing using range tombstones below the level of a Range. ``` Nevertheless, the results are enticing. Rather than pinning a core at full utilization for minutes just to clear the versions written to a sequence over the course of a bit more than an hour, we can clear that in ~2 seconds. The decision to use a ClearRange is controlled by whether an entire batch would be used to clear versions of a single key. This means that there we'll only send a clear range if there are at least `<key size> * <num versions> > 256 KiB`. There's any additional sanity check that the `<num versions>` is greater than 32 in order to prevent issuing lots of `ClearRange` operations when the cluster has gigantic keys. This new functionality is gated behind both a version and an experimental cluster setting. I'm skipping the release note because of the experimental flag. Release note: None

ajwerner force-pushed the ajwerner/optimize-MVCCGarbageCollect branch 2 times, most recently from 88b5d8c to 9a96e1d Compare July 9, 2020 04:09

ajwerner mentioned this pull request Jul 9, 2020

roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey ajwerner/cockroach#7

Open

ajwerner force-pushed the ajwerner/optimize-MVCCGarbageCollect branch from 9a96e1d to 3a1e899 Compare July 9, 2020 04:44

ajwerner marked this pull request as ready for review July 9, 2020 18:03

ajwerner requested a review from itsbilal July 9, 2020 18:03

ajwerner mentioned this pull request Jul 9, 2020

roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey #51230

Closed

ajwerner added 2 commits July 13, 2020 14:23

storage: unify BenchmarkGarbageCollect between engines

422b725

This commit unifies `BenchmarkGarbageCollect` in the style of `BenchmarkExportToSst` and moves both from engine-specific files to bench_test.go Release note: None

itsbilal approved these changes Jul 13, 2020

View reviewed changes

ajwerner force-pushed the ajwerner/optimize-MVCCGarbageCollect branch from 3a1e899 to 8e5423b Compare July 13, 2020 19:38

ajwerner commented Jul 13, 2020

View reviewed changes

craig bot merged commit 6e39e58 into cockroachdb:master Jul 13, 2020

ajwerner mentioned this pull request Jul 14, 2020

kv/kvserver: TestGCQueueTransactionTable failed #51402

Closed

ajwerner mentioned this pull request Jul 14, 2020

storage: fix bug in MVCCGarbageCollect when there is just one version #51426

Closed

ajwerner mentioned this pull request Jul 15, 2020

storage: optimize MVCCGarbageCollect for large number of undeleted version #51462

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: optimize MVCCGarbageCollect for large numbers of versions #51184

storage: optimize MVCCGarbageCollect for large numbers of versions #51184

ajwerner commented Jul 9, 2020

cockroach-teamcity commented Jul 9, 2020

itsbilal left a comment

ajwerner left a comment

ajwerner commented Jul 13, 2020

craig bot commented Jul 13, 2020

storage: optimize MVCCGarbageCollect for large numbers of versions #51184

storage: optimize MVCCGarbageCollect for large numbers of versions #51184

Conversation

ajwerner commented Jul 9, 2020

cockroach-teamcity commented Jul 9, 2020

itsbilal left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner commented Jul 13, 2020

craig bot commented Jul 13, 2020

Build succeeded