Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: optimize MVCCGarbageCollect for large numbers of versions #51184

Merged

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Jul 9, 2020

The first two commits rework the benchmarks a tad. The third commit is the meat.

Touches and maybe fixes #50194.

Release note (performance improvement): Improved the efficiency of garbage
collection when there are a large number of versions of a single key,
commonly found when utilizing sequences.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/optimize-MVCCGarbageCollect branch 2 times, most recently from 88b5d8c to 9a96e1d Compare July 9, 2020 04:09
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Jul 9, 2020
This commit adds an optimization to massively reduce the overhead of
garbage collecting large numbers of versions. When running garbage
collection, we currently iterate through the store to send (Key, Timestamp)
pairs in a GCRequest to the store to be evaluated and applied. This can
be rather slow when deleting large numbers of keys, particularly due to the
need to paginate. The primary motivation for pagination is to ensure that
we do not create raft commands with deletions that are too large.

In practice, we find that for the tiny values in a sequence, we find that we
can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though
note that in that PR the more versions exist, the worse the throughput will
be). This remains abysmally slow. I imagine that using larger batches could
be one approach to increase the throughput, but, as it stands, 256 KiB is
not a tiny raft command.

This instead turns to the ClearRange operation which can delete all of versions
of a key with the replication overhead of just two copies. This approach is
somewhat controversial because, as @petermattis puts it:

```
We need to be careful about this. Historically, adding many range tombstones
was very bad for performance. I think we resolved most (all?) of those issues,
but I'm still nervous about embracing using range tombstones below the level
of a Range.
```

Nevertheless, the results are enticing. Rather than pinning a core at full
utilization for minutes just to clear the versions written to a sequence over
the course of a bit more than an hour, we can clear that in ~2 seconds.

Release note (performance improvement): Improved performance of garbage
collection in the face of large numbers of versions.
@ajwerner ajwerner force-pushed the ajwerner/optimize-MVCCGarbageCollect branch from 9a96e1d to 3a1e899 Compare July 9, 2020 04:44
@ajwerner ajwerner marked this pull request as ready for review July 9, 2020 18:03
@ajwerner ajwerner requested a review from itsbilal July 9, 2020 18:03
This commit unifies `BenchmarkGarbageCollect` in the style of
`BenchmarkExportToSst` and moves both from engine-specific
files to bench_test.go

Release note: None
Before this change, BenchmarkGarbageCollect always benchmarked collecting
all but the last version of a key. This change expands the test to benchmark
different numbers of deleted versions to highlight that the current
implementation is linear in the number of versions.

Release note: None
Copy link
Member

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slightly delayed review - missed this last week.

:lgtm: save for one minor comment.

Reviewed 3 of 3 files at r1, 1 of 1 files at r2, 1 of 1 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/storage/bench_test.go, line 36 at r1 (raw file):

)

// Note: most benchmarks in this package have an engine-specific Benchmark

👍


pkg/storage/mvcc.go, line 3179 at r3 (raw file):

	// in increasing order.
	sort.Slice(keys, func(i, j int) bool {
		cmp := keys[i].Key.Compare(keys[j].Key)

Why not just do keys[i].Less(keys[j]) ?


pkg/storage/mvcc.go, line 3262 at r3 (raw file):

			// to find the predecessor of gcKey before resorting to seeking.
			//
			// In a	synthetic benchmark where there is one version of garbage and one

Nit: extra spaces

Prior to this change, MVCCGarbageCollect performed a linear scan of all
versions of a key, not just the versions being garbage collected. Given
the pagination of deleting versions above this call, the linear behavior
can result in quadratic runtime of GC when the number of versions vastly
exceeds the page size. The benchmark results demonstrate the change's
effectiveness.

It's worth noting that for a single key with a single version, the change
has a negative performance impact. I suspect this is due to the allocation
of a key in order to construct the iterator. In cases involving more keys,
I theorize the positive change is due to the fact that now the iterator
is never seeked backwards due to the sorting of the keys. It's worth
noting that since 20.1, the GC queue has been sending keys in the GC
request in reverse order. I anticipate that this sorting is likely a good
thing in that case too.

The stepping optimization seemed important in the microbenchmarks for cases
where most of the data was garbage. Without it, the change had small negative
impact on performance.

```
name                                                                                                     old time/op  new time/op  delta
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=2/deleteVersions=1-24           3.39µs ± 1%  3.96µs ± 0%  +16.99%  (p=0.004 n=6+5)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1-24         319µs ± 3%    10µs ±12%  -96.88%  (p=0.002 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=16-24        319µs ± 2%    16µs ±10%  -94.95%  (p=0.002 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=32-24        319µs ± 3%    21µs ± 5%  -93.52%  (p=0.002 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=512-24       337µs ± 1%   182µs ± 3%  -46.00%  (p=0.002 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1015-24      361µs ± 0%   353µs ± 2%   -2.32%  (p=0.010 n=4+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1023-24      361µs ± 3%   350µs ± 2%   -3.14%  (p=0.009 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=2/deleteVersions=1-24        2.00ms ± 3%  2.25ms ± 2%  +12.53%  (p=0.004 n=6+5)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1-24      388ms ± 3%    16ms ± 5%  -95.76%  (p=0.002 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=16-24     387ms ± 1%    27ms ± 3%  -93.14%  (p=0.002 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=32-24     393ms ± 5%    35ms ± 4%  -91.09%  (p=0.002 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=512-24    463ms ± 4%   276ms ± 3%  -40.43%  (p=0.004 n=5+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1015-24   539ms ± 5%   514ms ± 3%   -4.64%  (p=0.016 n=5+5)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1023-24   533ms ± 4%   514ms ± 1%     ~     (p=0.093 n=6+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=2/deleteVersions=1-24            1.97µs ± 3%  2.29µs ± 2%  +16.58%  (p=0.002 n=6+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1-24          139µs ± 1%     5µs ± 6%  -96.40%  (p=0.004 n=5+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=16-24         140µs ± 1%     8µs ± 1%  -94.13%  (p=0.004 n=6+5)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=32-24         143µs ± 4%    11µs ± 2%  -92.03%  (p=0.002 n=6+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=512-24        178µs ± 9%   109µs ± 1%  -38.75%  (p=0.004 n=6+5)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1015-24       201µs ± 1%   213µs ± 1%   +5.80%  (p=0.008 n=5+5)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1023-24       205µs ±11%   215µs ± 6%     ~     (p=0.126 n=5+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=2/deleteVersions=1-24         1.43ms ± 1%  1.34ms ± 1%   -5.82%  (p=0.004 n=6+5)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1-24       218ms ± 9%     9ms ± 2%  -96.00%  (p=0.002 n=6+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=16-24      216ms ± 3%    15ms ± 2%  -93.19%  (p=0.004 n=5+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=32-24      219ms ± 4%    20ms ± 5%  -90.77%  (p=0.004 n=5+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=512-24     303ms ± 4%   199ms ± 4%  -34.47%  (p=0.004 n=5+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1015-24    382ms ±16%   363ms ± 8%     ~     (p=0.485 n=6+6)
ajwerner@gceworker-ajwerner:~/go/src/github.com/cockroachdb/cockroach$ %ns=1024/deleteVersions=1023-24    363ms ± 4%   354ms ± 4%     ~     (p=0.222 n=5+5)
```

Release note (performance improvement): Improved the efficiency of garbage
collection when there are a large number of versions of a single key,
commonly found when utilizing sequences.
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Jul 13, 2020
This commit adds an optimization to massively reduce the overhead of
garbage collecting large numbers of versions. When running garbage
collection, we currently iterate through the store to send (Key, Timestamp)
pairs in a GCRequest to the store to be evaluated and applied. This can
be rather slow when deleting large numbers of keys, particularly due to the
need to paginate. The primary motivation for pagination is to ensure that
we do not create raft commands with deletions that are too large.

In practice, we find that for the tiny values in a sequence, we find that we
can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though
note that in that PR the more versions exist, the worse the throughput will
be). This remains abysmally slow. I imagine that using larger batches could
be one approach to increase the throughput, but, as it stands, 256 KiB is
not a tiny raft command.

This instead turns to the ClearRange operation which can delete all of versions
of a key with the replication overhead of just two copies. This approach is
somewhat controversial because, as @petermattis puts it:

```
We need to be careful about this. Historically, adding many range tombstones
was very bad for performance. I think we resolved most (all?) of those issues,
but I'm still nervous about embracing using range tombstones below the level
of a Range.
```

Nevertheless, the results are enticing. Rather than pinning a core at full
utilization for minutes just to clear the versions written to a sequence over
the course of a bit more than an hour, we can clear that in ~2 seconds.

Release note (performance improvement): Improved performance of garbage
collection in the face of large numbers of versions.
@ajwerner ajwerner force-pushed the ajwerner/optimize-MVCCGarbageCollect branch from 3a1e899 to 8e5423b Compare July 13, 2020 19:38
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I changes the less function to utilize MVCCKey.Less.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal)


pkg/storage/mvcc.go, line 3179 at r3 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Why not just do keys[i].Less(keys[j]) ?

These keys aren't roachpb.Key but rather roachpb.GCRequest_GCKey. That being said, constructing MVCCKeys and using the Less is cleaner so I changed it to do that.


pkg/storage/mvcc.go, line 3262 at r3 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Nit: extra spaces

Done.

@ajwerner
Copy link
Contributor Author

bors r=itsbilal

@craig
Copy link
Contributor

craig bot commented Jul 13, 2020

Build succeeded

@craig craig bot merged commit 6e39e58 into cockroachdb:master Jul 13, 2020
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Jul 14, 2020
The bug that this fixes was introduced in cockroachdb#51184.

This bug was fortunately caught by a gc queue test under stressrace where
a batch happens to get sent twice and thus there isn't garbage. The specific
problem is that the implementation of `ReadWriter` used by `kvserver` is
`spanset.Batch` which returns an error after `SeekLT` if the found key is
not within the span leading to an error being returned and the keys not being
properly GC'd.

Fixes cockroachdb#51402.

Release note: None
craig bot pushed a commit that referenced this pull request Jul 16, 2020
51462: storage: optimize MVCCGarbageCollect for large number of undeleted version r=itsbilal a=ajwerner

This is a better-tested take-2 of #51184.

* The first commit introduces a `SupportsPrev` method to the `Iterator` interface.
* The second commit adjusts the benchmark to use a regular batch.
* The third commit adjusts the logic in #51184 to fix the bugs there and adds testing.

The win is now only pronounced on pebble, as might be expected. 

51489: opt, cat: rename Table.AllColumnCount to Table.ColumnCount r=RaduBerinde a=RaduBerinde

This change renames AllColumnCount to ColumnCount. In #51400 I meant
to rename before merging but I forgot.

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Jul 19, 2020
This commit adds an optimization to massively reduce the overhead of
garbage collecting large numbers of versions. When running garbage
collection, we currently iterate through the store to send (Key, Timestamp)
pairs in a GCRequest to the store to be evaluated and applied. This can
be rather slow when deleting large numbers of keys, particularly due to the
need to paginate. The primary motivation for pagination is to ensure that
we do not create raft commands with deletions that are too large.

In practice, we find that for the tiny values in a sequence, we find that we
can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though
note that in that PR the more versions exist, the worse the throughput will
be). This remains abysmally slow. I imagine that using larger batches could
be one approach to increase the throughput, but, as it stands, 256 KiB is
not a tiny raft command.

This instead turns to the ClearRange operation which can delete all of versions
of a key with the replication overhead of just two copies. This approach is
somewhat controversial because, as @petermattis puts it:

```
We need to be careful about this. Historically, adding many range tombstones
was very bad for performance. I think we resolved most (all?) of those issues,
but I'm still nervous about embracing using range tombstones below the level
of a Range.
```

Nevertheless, the results are enticing. Rather than pinning a core at full
utilization for minutes just to clear the versions written to a sequence over
the course of a bit more than an hour, we can clear that in ~2 seconds.

The decision to use a ClearRange is controlled by whether an entire batch would
be used to clear versions of a single key. This means that there we'll only send
a clear range if there are at least `<key size> * <num versions> > 256 KiB`.
There's any additional sanity check that the `<num versions>` is greater than
32 in order to prevent issuing lots of `ClearRange` operations when the
cluster has gigantic keys.

This new functionality is gated behind both a version and an experimental
cluster setting. I'm skipping the release note because of the experimental
flag.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Oct 28, 2020
This commit adds an optimization to massively reduce the overhead of
garbage collecting large numbers of versions. When running garbage
collection, we currently iterate through the store to send (Key, Timestamp)
pairs in a GCRequest to the store to be evaluated and applied. This can
be rather slow when deleting large numbers of keys, particularly due to the
need to paginate. The primary motivation for pagination is to ensure that
we do not create raft commands with deletions that are too large.

In practice, we find that for the tiny values in a sequence, we find that we
can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though
note that in that PR the more versions exist, the worse the throughput will
be). This remains abysmally slow. I imagine that using larger batches could
be one approach to increase the throughput, but, as it stands, 256 KiB is
not a tiny raft command.

This instead turns to the ClearRange operation which can delete all of versions
of a key with the replication overhead of just two copies. This approach is
somewhat controversial because, as @petermattis puts it:

```
We need to be careful about this. Historically, adding many range tombstones
was very bad for performance. I think we resolved most (all?) of those issues,
but I'm still nervous about embracing using range tombstones below the level
of a Range.
```

Nevertheless, the results are enticing. Rather than pinning a core at full
utilization for minutes just to clear the versions written to a sequence over
the course of a bit more than an hour, we can clear that in ~2 seconds.

The decision to use a ClearRange is controlled by whether an entire batch would
be used to clear versions of a single key. This means that there we'll only send
a clear range if there are at least `<key size> * <num versions> > 256 KiB`.
There's any additional sanity check that the `<num versions>` is greater than
32 in order to prevent issuing lots of `ClearRange` operations when the
cluster has gigantic keys.

This new functionality is gated behind both a version and an experimental
cluster setting. I'm skipping the release note because of the experimental
flag.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sequence table GC speed is slower than user write, and which have conflict when user write and gc
3 participants