-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: optimize MVCCGarbageCollect for large numbers of versions #51184
storage: optimize MVCCGarbageCollect for large numbers of versions #51184
Conversation
88b5d8c
to
9a96e1d
Compare
This commit adds an optimization to massively reduce the overhead of garbage collecting large numbers of versions. When running garbage collection, we currently iterate through the store to send (Key, Timestamp) pairs in a GCRequest to the store to be evaluated and applied. This can be rather slow when deleting large numbers of keys, particularly due to the need to paginate. The primary motivation for pagination is to ensure that we do not create raft commands with deletions that are too large. In practice, we find that for the tiny values in a sequence, we find that we can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though note that in that PR the more versions exist, the worse the throughput will be). This remains abysmally slow. I imagine that using larger batches could be one approach to increase the throughput, but, as it stands, 256 KiB is not a tiny raft command. This instead turns to the ClearRange operation which can delete all of versions of a key with the replication overhead of just two copies. This approach is somewhat controversial because, as @petermattis puts it: ``` We need to be careful about this. Historically, adding many range tombstones was very bad for performance. I think we resolved most (all?) of those issues, but I'm still nervous about embracing using range tombstones below the level of a Range. ``` Nevertheless, the results are enticing. Rather than pinning a core at full utilization for minutes just to clear the versions written to a sequence over the course of a bit more than an hour, we can clear that in ~2 seconds. Release note (performance improvement): Improved performance of garbage collection in the face of large numbers of versions.
9a96e1d
to
3a1e899
Compare
This commit unifies `BenchmarkGarbageCollect` in the style of `BenchmarkExportToSst` and moves both from engine-specific files to bench_test.go Release note: None
Before this change, BenchmarkGarbageCollect always benchmarked collecting all but the last version of a key. This change expands the test to benchmark different numbers of deleted versions to highlight that the current implementation is linear in the number of versions. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slightly delayed review - missed this last week.
Reviewed 3 of 3 files at r1, 1 of 1 files at r2, 1 of 1 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/storage/bench_test.go, line 36 at r1 (raw file):
) // Note: most benchmarks in this package have an engine-specific Benchmark
👍
pkg/storage/mvcc.go, line 3179 at r3 (raw file):
// in increasing order. sort.Slice(keys, func(i, j int) bool { cmp := keys[i].Key.Compare(keys[j].Key)
Why not just do keys[i].Less(keys[j])
?
pkg/storage/mvcc.go, line 3262 at r3 (raw file):
// to find the predecessor of gcKey before resorting to seeking. // // In a synthetic benchmark where there is one version of garbage and one
Nit: extra spaces
Prior to this change, MVCCGarbageCollect performed a linear scan of all versions of a key, not just the versions being garbage collected. Given the pagination of deleting versions above this call, the linear behavior can result in quadratic runtime of GC when the number of versions vastly exceeds the page size. The benchmark results demonstrate the change's effectiveness. It's worth noting that for a single key with a single version, the change has a negative performance impact. I suspect this is due to the allocation of a key in order to construct the iterator. In cases involving more keys, I theorize the positive change is due to the fact that now the iterator is never seeked backwards due to the sorting of the keys. It's worth noting that since 20.1, the GC queue has been sending keys in the GC request in reverse order. I anticipate that this sorting is likely a good thing in that case too. The stepping optimization seemed important in the microbenchmarks for cases where most of the data was garbage. Without it, the change had small negative impact on performance. ``` name old time/op new time/op delta MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=2/deleteVersions=1-24 3.39µs ± 1% 3.96µs ± 0% +16.99% (p=0.004 n=6+5) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1-24 319µs ± 3% 10µs ±12% -96.88% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=16-24 319µs ± 2% 16µs ±10% -94.95% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=32-24 319µs ± 3% 21µs ± 5% -93.52% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=512-24 337µs ± 1% 182µs ± 3% -46.00% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1015-24 361µs ± 0% 353µs ± 2% -2.32% (p=0.010 n=4+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1023-24 361µs ± 3% 350µs ± 2% -3.14% (p=0.009 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=2/deleteVersions=1-24 2.00ms ± 3% 2.25ms ± 2% +12.53% (p=0.004 n=6+5) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1-24 388ms ± 3% 16ms ± 5% -95.76% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=16-24 387ms ± 1% 27ms ± 3% -93.14% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=32-24 393ms ± 5% 35ms ± 4% -91.09% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=512-24 463ms ± 4% 276ms ± 3% -40.43% (p=0.004 n=5+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1015-24 539ms ± 5% 514ms ± 3% -4.64% (p=0.016 n=5+5) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1023-24 533ms ± 4% 514ms ± 1% ~ (p=0.093 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=2/deleteVersions=1-24 1.97µs ± 3% 2.29µs ± 2% +16.58% (p=0.002 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1-24 139µs ± 1% 5µs ± 6% -96.40% (p=0.004 n=5+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=16-24 140µs ± 1% 8µs ± 1% -94.13% (p=0.004 n=6+5) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=32-24 143µs ± 4% 11µs ± 2% -92.03% (p=0.002 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=512-24 178µs ± 9% 109µs ± 1% -38.75% (p=0.004 n=6+5) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1015-24 201µs ± 1% 213µs ± 1% +5.80% (p=0.008 n=5+5) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1023-24 205µs ±11% 215µs ± 6% ~ (p=0.126 n=5+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=2/deleteVersions=1-24 1.43ms ± 1% 1.34ms ± 1% -5.82% (p=0.004 n=6+5) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1-24 218ms ± 9% 9ms ± 2% -96.00% (p=0.002 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=16-24 216ms ± 3% 15ms ± 2% -93.19% (p=0.004 n=5+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=32-24 219ms ± 4% 20ms ± 5% -90.77% (p=0.004 n=5+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=512-24 303ms ± 4% 199ms ± 4% -34.47% (p=0.004 n=5+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1015-24 382ms ±16% 363ms ± 8% ~ (p=0.485 n=6+6) ajwerner@gceworker-ajwerner:~/go/src/github.com/cockroachdb/cockroach$ %ns=1024/deleteVersions=1023-24 363ms ± 4% 354ms ± 4% ~ (p=0.222 n=5+5) ``` Release note (performance improvement): Improved the efficiency of garbage collection when there are a large number of versions of a single key, commonly found when utilizing sequences.
This commit adds an optimization to massively reduce the overhead of garbage collecting large numbers of versions. When running garbage collection, we currently iterate through the store to send (Key, Timestamp) pairs in a GCRequest to the store to be evaluated and applied. This can be rather slow when deleting large numbers of keys, particularly due to the need to paginate. The primary motivation for pagination is to ensure that we do not create raft commands with deletions that are too large. In practice, we find that for the tiny values in a sequence, we find that we can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though note that in that PR the more versions exist, the worse the throughput will be). This remains abysmally slow. I imagine that using larger batches could be one approach to increase the throughput, but, as it stands, 256 KiB is not a tiny raft command. This instead turns to the ClearRange operation which can delete all of versions of a key with the replication overhead of just two copies. This approach is somewhat controversial because, as @petermattis puts it: ``` We need to be careful about this. Historically, adding many range tombstones was very bad for performance. I think we resolved most (all?) of those issues, but I'm still nervous about embracing using range tombstones below the level of a Range. ``` Nevertheless, the results are enticing. Rather than pinning a core at full utilization for minutes just to clear the versions written to a sequence over the course of a bit more than an hour, we can clear that in ~2 seconds. Release note (performance improvement): Improved performance of garbage collection in the face of large numbers of versions.
3a1e899
to
8e5423b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I changes the less function to utilize MVCCKey.Less
.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal)
pkg/storage/mvcc.go, line 3179 at r3 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Why not just do
keys[i].Less(keys[j])
?
These keys aren't roachpb.Key
but rather roachpb.GCRequest_GCKey
. That being said, constructing MVCCKey
s and using the Less
is cleaner so I changed it to do that.
pkg/storage/mvcc.go, line 3262 at r3 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Nit: extra spaces
Done.
bors r=itsbilal |
Build succeeded |
The bug that this fixes was introduced in cockroachdb#51184. This bug was fortunately caught by a gc queue test under stressrace where a batch happens to get sent twice and thus there isn't garbage. The specific problem is that the implementation of `ReadWriter` used by `kvserver` is `spanset.Batch` which returns an error after `SeekLT` if the found key is not within the span leading to an error being returned and the keys not being properly GC'd. Fixes cockroachdb#51402. Release note: None
51462: storage: optimize MVCCGarbageCollect for large number of undeleted version r=itsbilal a=ajwerner This is a better-tested take-2 of #51184. * The first commit introduces a `SupportsPrev` method to the `Iterator` interface. * The second commit adjusts the benchmark to use a regular batch. * The third commit adjusts the logic in #51184 to fix the bugs there and adds testing. The win is now only pronounced on pebble, as might be expected. 51489: opt, cat: rename Table.AllColumnCount to Table.ColumnCount r=RaduBerinde a=RaduBerinde This change renames AllColumnCount to ColumnCount. In #51400 I meant to rename before merging but I forgot. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
This commit adds an optimization to massively reduce the overhead of garbage collecting large numbers of versions. When running garbage collection, we currently iterate through the store to send (Key, Timestamp) pairs in a GCRequest to the store to be evaluated and applied. This can be rather slow when deleting large numbers of keys, particularly due to the need to paginate. The primary motivation for pagination is to ensure that we do not create raft commands with deletions that are too large. In practice, we find that for the tiny values in a sequence, we find that we can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though note that in that PR the more versions exist, the worse the throughput will be). This remains abysmally slow. I imagine that using larger batches could be one approach to increase the throughput, but, as it stands, 256 KiB is not a tiny raft command. This instead turns to the ClearRange operation which can delete all of versions of a key with the replication overhead of just two copies. This approach is somewhat controversial because, as @petermattis puts it: ``` We need to be careful about this. Historically, adding many range tombstones was very bad for performance. I think we resolved most (all?) of those issues, but I'm still nervous about embracing using range tombstones below the level of a Range. ``` Nevertheless, the results are enticing. Rather than pinning a core at full utilization for minutes just to clear the versions written to a sequence over the course of a bit more than an hour, we can clear that in ~2 seconds. The decision to use a ClearRange is controlled by whether an entire batch would be used to clear versions of a single key. This means that there we'll only send a clear range if there are at least `<key size> * <num versions> > 256 KiB`. There's any additional sanity check that the `<num versions>` is greater than 32 in order to prevent issuing lots of `ClearRange` operations when the cluster has gigantic keys. This new functionality is gated behind both a version and an experimental cluster setting. I'm skipping the release note because of the experimental flag. Release note: None
This commit adds an optimization to massively reduce the overhead of garbage collecting large numbers of versions. When running garbage collection, we currently iterate through the store to send (Key, Timestamp) pairs in a GCRequest to the store to be evaluated and applied. This can be rather slow when deleting large numbers of keys, particularly due to the need to paginate. The primary motivation for pagination is to ensure that we do not create raft commands with deletions that are too large. In practice, we find that for the tiny values in a sequence, we find that we can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though note that in that PR the more versions exist, the worse the throughput will be). This remains abysmally slow. I imagine that using larger batches could be one approach to increase the throughput, but, as it stands, 256 KiB is not a tiny raft command. This instead turns to the ClearRange operation which can delete all of versions of a key with the replication overhead of just two copies. This approach is somewhat controversial because, as @petermattis puts it: ``` We need to be careful about this. Historically, adding many range tombstones was very bad for performance. I think we resolved most (all?) of those issues, but I'm still nervous about embracing using range tombstones below the level of a Range. ``` Nevertheless, the results are enticing. Rather than pinning a core at full utilization for minutes just to clear the versions written to a sequence over the course of a bit more than an hour, we can clear that in ~2 seconds. The decision to use a ClearRange is controlled by whether an entire batch would be used to clear versions of a single key. This means that there we'll only send a clear range if there are at least `<key size> * <num versions> > 256 KiB`. There's any additional sanity check that the `<num versions>` is greater than 32 in order to prevent issuing lots of `ClearRange` operations when the cluster has gigantic keys. This new functionality is gated behind both a version and an experimental cluster setting. I'm skipping the release note because of the experimental flag. Release note: None
The first two commits rework the benchmarks a tad. The third commit is the meat.
Touches and maybe fixes #50194.
Release note (performance improvement): Improved the efficiency of garbage
collection when there are a large number of versions of a single key,
commonly found when utilizing sequences.