-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DNM] sstable: add value blocks for storing older versions of a key #1443
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 13 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @petermattis)
sstable/reader_test.go, line 1175 at r1 (raw file):
for _, valueSize := range []int{100, 1000, 10000} { b.Run(fmt.Sprintf("valueSize=%d", valueSize), func(b *testing.B) { for _, versions := range []int{1, 10, 100} {
to put this in context, https://github.com/cockroachlabs/support/issues/1348#issuecomment-985867830 had > 350 versions per key.
When WriterOptions.ValueBlocksAreEnabled is set to true, older versions of a key are written to a sequence of value blocks, and the key contains a valueHandle which is a tuple (valueLen, blockNum, offsetInBlock). The assumption here is that most reads only need the value of the latest version, and many reads that care about an older version only need the value length. Value blocks are a simple sequence of (varint encoded length, value bytes) tuples such that given the uncompressed value block, the valueHandle can cheaply read the value. The value blocks index connects the blockNum in the valueHandle to the location of the value block. It uses fixed width encoding to avoid the expense of a general purpose key-value block. See the comment at the top of value_block.go for details. The following are preliminary results from a read benchmark, after some performance tuning. The old numbers are master. The needValue=false cases are the ones where value blocks are expected to help. - The versions=1 have no values in value blocks, and the slowdown is the extra call to valueBlockReader that needs to subslice to remove the single byte prefix. - The hasCache=false case correspond to a cold cache, where there will be additional wasted decompression of values that we don't need (when needValue=false). As expected, when there is an improvement, it is larger with hasCache=false. For example the -97.83% below (almost 50x faster) compared with -79.89%. - The needValue=true is where the code can be slower up to 2x. The higher slowdowns occur when the value size is smaller. In such cases more inline values can be packed into an ssblock and the code overhead of decoding the valueHandle, and the value length in the value block (all of these are varints) becomes a significant component. This is a prototype in that there are no changes to the InternalIterator interface, and the read path only works for singleLevelIterator. name old time/op new time/op delta ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16 25.5ns ± 3% 25.9ns ± 2% +1.50% (p=0.028 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16 15.6ns ± 1% 15.5ns ± 2% ~ (p=0.268 n=9+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16 27.3ns ± 3% 29.5ns ± 3% +8.11% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16 17.1ns ± 2% 19.2ns ± 2% +12.74% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16 26.7ns ± 2% 29.4ns ± 2% +10.46% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16 15.9ns ± 2% 15.2ns ± 3% -4.63% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16 26.7ns ± 2% 53.0ns ± 4% +98.79% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16 16.6ns ± 1% 26.7ns ± 2% +61.05% (p=0.000 n=9+9) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16 28.3ns ± 4% 25.3ns ± 5% -10.74% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16 15.8ns ± 2% 14.9ns ± 1% -5.66% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16 29.4ns ± 4% 47.8ns ± 3% +62.46% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16 16.7ns ± 4% 26.1ns ± 3% +56.04% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16 123ns ± 4% 125ns ± 7% ~ (p=0.735 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16 23.0ns ± 5% 22.9ns ± 5% ~ (p=0.684 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16 124ns ± 6% 131ns ± 7% +5.76% (p=0.008 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16 24.3ns ± 4% 26.4ns ± 3% +8.26% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16 130ns ± 8% 27ns ± 4% -79.10% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16 23.8ns ± 4% 16.6ns ± 2% -30.00% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16 128ns ± 9% 164ns ±12% +27.94% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16 25.0ns ± 4% 33.0ns ± 2% +32.22% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16 123ns ± 9% 28ns ± 3% -76.89% (p=0.000 n=9+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16 23.0ns ± 2% 15.3ns ± 5% -33.36% (p=0.000 n=10+9) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16 132ns ± 2% 171ns ± 5% +29.24% (p=0.000 n=8+10) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16 24.3ns ± 3% 32.6ns ± 3% +33.98% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16 1.45µs ± 8% 1.35µs ±10% -6.41% (p=0.015 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16 75.5ns ± 2% 76.7ns ± 5% ~ (p=0.218 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16 1.34µs ± 3% 1.46µs ±16% +9.03% (p=0.022 n=9+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16 77.0ns ± 3% 79.9ns ± 3% +3.80% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16 1.46µs ± 6% 0.13µs ± 3% -91.15% (p=0.000 n=9+9) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16 76.4ns ± 3% 21.4ns ± 2% -72.06% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16 1.47µs ± 8% 1.56µs ± 7% +5.72% (p=0.013 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16 78.1ns ± 4% 76.1ns ± 2% -2.52% (p=0.009 n=10+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16 1.34µs ± 5% 0.03µs ± 2% -97.83% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16 77.0ns ± 2% 15.5ns ± 2% -79.89% (p=0.000 n=8+10) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16 1.42µs ± 9% 1.49µs ± 2% +5.28% (p=0.007 n=10+9) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16 78.5ns ± 4% 73.0ns ± 4% -7.01% (p=0.000 n=10+9) To put these numbers in context of what fraction of bytes in an sstable are in regular data blocks (the rest are in value blocks): value-size: 100,versions: 1,data-block-bytes: 1185521,total-bytes: 1186984,fraction-in-data-blocks:1.00 value-size: 100,versions: 10,data-block-bytes: 174822,total-bytes: 1085218,fraction-in-data-blocks:0.16 value-size: 100,versions:100,data-block-bytes: 82118,total-bytes: 1083393,fraction-in-data-blocks:0.08 value-size: 1000,versions: 1,data-block-bytes: 10197788,total-bytes: 10203513,fraction-in-data-blocks:1.00 value-size: 1000,versions: 10,data-block-bytes: 1199435,total-bytes: 10222244,fraction-in-data-blocks:0.12 value-size: 1000,versions:100,data-block-bytes: 169931,total-bytes: 10094442,fraction-in-data-blocks:0.02 value-size:10000,versions: 1,data-block-bytes:100246534,total-bytes:100292713,fraction-in-data-blocks:1.00 value-size:10000,versions: 10,data-block-bytes: 10204053,total-bytes:100256921,fraction-in-data-blocks:0.10 value-size:10000,versions:100,data-block-bytes: 1198846,total-bytes:100252245,fraction-in-data-blocks:0.01 Informs cockroachdb#1170
cab59da
to
1df3b3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 13 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
sstable/value_block.go, line 77 at r3 (raw file):
// which is modest. Therefore, we don't support more than one value index block. // The metaindex block contains the valueBlockIndexHandle which in addition to // the BlockHandle also specifies the widths of these tuple fields.
It would be helpful to have an ascii-art diagram that describes the structure here.
If I'm understanding the above correctly, the data block still contains keys for all of the historic versions. The space for those keys can be significant as well. Did you try to imagine any designs where the keys and values are stored elsewhere for historic versions. For example, the newest key could be stored in the data block along with its inline value and a pointer to another data block where subsequent versions of the key are stored. Iteration would become a bit more complex as iterating in a data block potentially has to move to another data block before popping back to the current one. This is just a drive-by thought. I haven't put much time into thinking it through or truly understanding what you've done in this PR.
This eliminates the value length encoding/decoding cost. Latest benchmark (old is master). We see some unexpected improvement in the needValue=true case too when the valueSize is large (10K). name old time/op new time/op delta ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16 25.5ns ± 3% 25.8ns ± 3% ~ (p=0.118 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16 15.6ns ± 1% 15.4ns ± 3% ~ (p=0.117 n=9+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16 27.3ns ± 3% 28.5ns ± 2% +4.28% (p=0.000 n=10+9) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16 17.1ns ± 2% 18.5ns ± 3% +8.60% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16 26.7ns ± 2% 29.2ns ± 4% +9.64% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16 15.9ns ± 2% 15.2ns ± 2% -4.31% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16 26.7ns ± 2% 46.3ns ± 4% +73.42% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16 16.6ns ± 1% 21.9ns ± 1% +32.13% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16 28.3ns ± 4% 24.8ns ± 2% -12.55% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16 15.8ns ± 2% 14.7ns ± 1% -6.68% (p=0.000 n=10+9) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16 29.4ns ± 4% 43.2ns ± 2% +46.91% (p=0.000 n=10+8) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16 16.7ns ± 4% 21.4ns ± 2% +28.37% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16 123ns ± 4% 109ns ± 3% -11.54% (p=0.000 n=9+9) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16 23.0ns ± 5% 22.4ns ± 1% -2.90% (p=0.016 n=10+8) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16 124ns ± 6% 118ns ± 4% -4.62% (p=0.004 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16 24.3ns ± 4% 25.8ns ± 3% +5.87% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16 130ns ± 8% 28ns ± 5% -78.73% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16 23.8ns ± 4% 17.1ns ± 4% -28.06% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16 128ns ± 9% 139ns ± 4% +8.14% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16 25.0ns ± 4% 26.6ns ± 3% +6.44% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16 123ns ± 9% 29ns ± 5% -76.28% (p=0.000 n=9+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16 23.0ns ± 2% 14.9ns ± 3% -35.04% (p=0.000 n=10+9) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16 132ns ± 2% 153ns ± 5% +15.42% (p=0.000 n=8+10) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16 24.3ns ± 3% 25.4ns ± 2% +4.54% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16 1.45µs ± 8% 1.24µs ± 6% -14.53% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16 75.5ns ± 2% 72.0ns ± 2% -4.62% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16 1.34µs ± 3% 1.30µs ± 6% ~ (p=0.117 n=9+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16 77.0ns ± 3% 76.4ns ± 5% ~ (p=0.447 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16 1.46µs ± 6% 0.12µs ± 5% -92.05% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16 76.4ns ± 3% 21.9ns ± 3% -71.38% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16 1.47µs ± 8% 1.42µs ±11% ~ (p=0.113 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16 78.1ns ± 4% 58.4ns ± 4% -25.28% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16 1.34µs ± 5% 0.03µs ± 3% -97.91% (p=0.000 n=9+9) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16 77.0ns ± 2% 15.9ns ± 4% -79.39% (p=0.000 n=8+10) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16 1.42µs ± 9% 1.46µs ± 7% ~ (p=0.118 n=10+10) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16 78.5ns ± 4% 54.0ns ± 4% -31.18% (p=0.000 n=10+10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added two commits that can be ignored -- they are simple performance optimizations. The last commit also has the latest benchmark numbers -- the worst case slowdown is 1.75x and the best case speedup is 48x.
Reviewable status: 0 of 13 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @petermattis)
sstable/value_block.go, line 77 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
It would be helpful to have an ascii-art diagram that describes the structure here.
If I'm understanding the above correctly, the data block still contains keys for all of the historic versions. The space for those keys can be significant as well. Did you try to imagine any designs where the keys and values are stored elsewhere for historic versions. For example, the newest key could be stored in the data block along with its inline value and a pointer to another data block where subsequent versions of the key are stored. Iteration would become a bit more complex as iterating in a data block potentially has to move to another data block before popping back to the current one. This is just a drive-by thought. I haven't put much time into thinking it through or truly understanding what you've done in this PR.
I didn't consider moving the keys elsewhere since (a) if there are many versions of a key, we should be getting a lot out of prefix compression (modulo restarts), (b) hopefully keys are not very long (I realize this is not a great assumption for sql tables and indexes), (c) the iterator tree and the top-level iterators make decisions based on userkey comparisons, and hiding userkeys completely would be quite complicated -- anything being moved to the value blocks is not necessarily an older version since the latest version could have been deleted in a higher level sstable (this PR is doing what I sketched out at a high-level in #1170 (comment)).
name old time/op new time/op delta ValueBlocksDeletedValue/valueSize=100/needValue=false/hasCache=false-16 40.7ns ± 1% 22.8ns ± 2% -44.02% (p=0.000 n=9+10) ValueBlocksDeletedValue/valueSize=100/needValue=false/hasCache=true-16 15.2ns ± 3% 14.7ns ± 4% -3.09% (p=0.007 n=10+10) ValueBlocksDeletedValue/valueSize=100/needValue=true/hasCache=false-16 41.2ns ± 4% 33.0ns ± 5% -20.01% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=100/needValue=true/hasCache=true-16 15.9ns ± 2% 19.2ns ± 4% +20.85% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=1000/needValue=false/hasCache=false-16 65.9ns ± 3% 24.1ns ± 4% -63.50% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=1000/needValue=false/hasCache=true-16 19.3ns ± 5% 14.5ns ± 3% -24.99% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=1000/needValue=true/hasCache=false-16 65.9ns ± 6% 90.0ns ±10% +36.61% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=1000/needValue=true/hasCache=true-16 19.6ns ± 4% 21.0ns ± 4% +7.51% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=10000/needValue=false/hasCache=false-16 630ns ± 8% 23ns ± 3% -96.34% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=10000/needValue=false/hasCache=true-16 43.2ns ± 2% 14.7ns ± 1% -66.04% (p=0.000 n=9+8) ValueBlocksDeletedValue/valueSize=10000/needValue=true/hasCache=false-16 680ns ± 4% 729ns ± 8% +7.19% (p=0.007 n=10+10) ValueBlocksDeletedValue/valueSize=10000/needValue=true/hasCache=true-16 43.3ns ± 2% 37.3ns ± 3% -13.92% (p=0.000 n=9+10) value-size: 100,data-block-bytes: 89448,total-bytes: 1090925,fraction-in-data-blocks:0.08 value-size: 1000,data-block-bytes: 95216,total-bytes: 10100022,fraction-in-data-blocks:0.01 value-size:10000,data-block-bytes: 105246,total-bytes:100138926,fraction-in-data-blocks:0.00 The time savings for the needValue=false/hasCache=false cases are less than the byte savings (see fraction-in-data-blocks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 13 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
sstable/value_block.go, line 77 at r3 (raw file):
Previously, sumeerbhola wrote…
I didn't consider moving the keys elsewhere since (a) if there are many versions of a key, we should be getting a lot out of prefix compression (modulo restarts), (b) hopefully keys are not very long (I realize this is not a great assumption for sql tables and indexes), (c) the iterator tree and the top-level iterators make decisions based on userkey comparisons, and hiding userkeys completely would be quite complicated -- anything being moved to the value blocks is not necessarily an older version since the latest version could have been deleted in a higher level sstable (this PR is doing what I sketched out at a high-level in #1170 (comment)).
Prefix compression will definitely save a lot of space for different versions of the same key, but even with that compression the space usage isn't trivial and decoding the prefix compressed keys has overhead that shows up on scan benchmarks. For many SQL tables, I think the keys and values are approximately the same size (it would be interesting to verify this, but that is my recollection from looking at TPC-C data at one point). My thinking for storing both the keys/values for historic versions out-of-line is that an sstable needs to have fast access to the newest key/value for a particular userkey, but the historic versions can be slower to access. For a queue-like workload, I'd expect an L6 sstable to have the majority of its data in the slower-to-access region.
My mental model for how we could completely separate older versions is to model a logical sstable as two physical sstables (though we could also store the older versions in separate blocks of the same sstable). sstable.Writer.Add
would add the first version of the key to the "upper" (newer) table and the subsequent versions of the key to the "lower" (older) table. A special sstable.Iterator
would blend the two tables. We'd need to expose an Iterator.NextPrefix
method which would be akin to MVCCIterator.NextKey
. If we're iterating over the latest version we never even have to touch the older version blocks, and that lack of decoding of the older version keys would be very significant. The MVCC scanner "iters before seek" mechanism would be pushed into NextPrefix
and we'd in turn push some of that logic down into the leaf iterators. For sstables with separated historic versions, the sstable NextPrefix
would be almost as as fast as a single Next
and we could avoid the slow SeekGE
calls that destroy scan performance. The downside to this layout is that if you do need to access a historic version you incur the penalty of another Seek.
|
Yeah, I think we can avoid the second sstable-seek, and doing so is likely necessary for performance. The continuation handle approach would still suffer from a potential disk seek. On the other hand, by moving the keys out of the scan path we're in the optimal design space for scanning over the most recent versions of the keys. With the separated historic value approach a scan would still see all of the data for the historic keys. I'm unconvinced that we'll frequently see data blocks that only contain historic keys. We target a data block size of 32 KB. So on average we'll have 16 KB of data block remaining when we start storing historic keys. If I'm remembering correctly, in TPCC keys are ~100 bytes in size. With prefix compression we're only storing that 100 bytes every 16th key, and the remainder of the keys we're storing the suffix. So figure ~15 bytes per key on average for the historic keys (I'm estimating 9 bytes for the suffix) plus 9 bytes for the value handle. We can store 682 historic versions for such keys in 16 KB, so it would only be tables with truly massive numbers of historic versions that would see a benefit from being able to skip data blocks that contain only historic versions. Additionally, these data blocks would sequentially follow in the sstable, so readahead would be reading them from disk even if they are skipped. I'm not attached to the two-sstable approach, or to other ideas of completing separating historic keys and values. Seems like there are perf trade-offs with either approach (e.g. GC needs access to the value sizes, but not the values).
This statement wasn't correct. In a queue-like workload, a record is added once (to the head of the queue), and then deleted once (from the tail of the queue). The problem with queue-like workloads is that there a huge prefix of MVCC deletion tombstones that have to be iterated over at the tail in order to find the first live row. I don't think any of the approaches discussed in this PR address this problem. PS I'm quite excited that we have some viable approaches to speeding up scanning tables that have many historic versions. |
NextSkipOlderVersions is currently a best-effort attempt that skips older SET versions only for sstables written using ValueBlocksAreEnabled. The benchmark tests this with long keys and values, 1000 bytes each, where consecutive keys with different prefixes are completely random so unlikely to have any significant shared prefix. The versions and BlockRestartInterval is varied: versions: 1,restart-interval:16,data-block-bytes: 20177767,total-bytes: 20188429,fraction-in-data-blocks:1.00 versions: 1,restart-interval:32,data-block-bytes: 20177767,total-bytes: 20188429,fraction-in-data-blocks:1.00 versions: 1,restart-interval:64,data-block-bytes: 20177767,total-bytes: 20188429,fraction-in-data-blocks:1.00 versions: 2,restart-interval:16,data-block-bytes: 10863502,total-bytes: 16208363,fraction-in-data-blocks:0.67 versions: 2,restart-interval:32,data-block-bytes: 10529835,total-bytes: 15874694,fraction-in-data-blocks:0.66 versions: 2,restart-interval:64,data-block-bytes: 10529835,total-bytes: 15874694,fraction-in-data-blocks:0.66 versions: 10,restart-interval:16,data-block-bytes: 2356652,total-bytes: 11453297,fraction-in-data-blocks:0.21 versions: 10,restart-interval:32,data-block-bytes: 2423386,total-bytes: 11469656,fraction-in-data-blocks:0.21 versions: 10,restart-interval:64,data-block-bytes: 2429837,total-bytes: 11512104,fraction-in-data-blocks:0.21 versions:100,restart-interval:16,data-block-bytes: 345979,total-bytes: 10284363,fraction-in-data-blocks:0.03 versions:100,restart-interval:32,data-block-bytes: 318700,total-bytes: 10246809,fraction-in-data-blocks:0.03 versions:100,restart-interval:64,data-block-bytes: 298197,total-bytes: 10221192,fraction-in-data-blocks:0.03 The results are BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=16/hasCache=true-16 10406703 108.7 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=32/hasCache=true-16 11452221 122.4 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=64/hasCache=true-16 9762536 115.9 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=16/hasCache=true-16 14293557 87.28 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=32/hasCache=true-16 15624984 79.73 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=64/hasCache=true-16 15065860 80.10 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=16/hasCache=true-16 7101529 165.1 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=32/hasCache=true-16 7561116 155.1 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=64/hasCache=true-16 7571413 149.7 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=16/hasCache=true-16 1000000 1112 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=32/hasCache=true-16 1211478 1040 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=64/hasCache=true-16 1238484 1019 ns/op 0 B/op 0 allocs/op - The versions=100 cases are ~10x slower than versions=1. A naive implementation would be expected to be 100x slower. But 10x slower is not great. A ratio closer to 1 is desirable. Looking at cpu profiles for BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=16/hasCache=true-16 readEntryOrSkipIfOldVersion is ~90% of the time. This includes having to decode various varints and assembling fullKey. Assembling the fullkey is ~43%, even though it is O(key-length) only when there is a restart or the prefix changes. So the throughput is a direct result of the inherent limitation of storing all keys in the data blocks. - The versions=10 cases are quite good -- ~1.5x slower than versions=1.
I've added a new commit with results for
I thought some more about removing older version keys out of the data blocks: there are places that do seeks using a key that has a non-empty version in CockroachDB (intent resolution; pebbleMVCCScanner using the read timestamp; deletion step of GC that seeks to the version being deleted). To support these seeks, we would need to be able to accurately position among the older versions, which adds more complexity to how these older versions are stored and indexed. So my current thinking is that removing older version keys out of the data blocks is too risky (from a performance perspective when older versions are needed) and too complicated as a first iteration over the sstable format. @petermattis let me know what you think. |
…o key prefix compression 10 versions is now ~1x of 1 version and 100 versions is ~6x BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=16/hasCache=true-16 9323478 108.9 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=32/hasCache=true-16 11679625 103.3 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=64/hasCache=true-16 11009826 110.7 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=16/hasCache=true-16 13600150 74.92 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=32/hasCache=true-16 13882718 72.97 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=64/hasCache=true-16 13649173 81.17 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=16/hasCache=true-16 10129204 113.3 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=32/hasCache=true-16 10582024 109.6 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=64/hasCache=true-16 10869471 107.7 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=16/hasCache=true-16 1934696 643.7 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=32/hasCache=true-16 2090880 567.4 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=64/hasCache=true-16 2177060 554.1 ns/op 0 B/op 0 allocs/op
Added another commit that reduces 10 versions to 1x of 1 version and 100 versions to 6x of 1 version. |
Most of the cost (75-80%) in the 100 versions case is in decoding varints as we step through all the key versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the cost (75-80%) in the 100 versions case is in decoding varints as we step through all the key versions.
We can avoid this cost by dynamically switching (at read time) from stepping through each key to stepping through restarts. We can correctly (and cheaply) do the latter by storing a byte (bool) (at write time) with each restart key that indicates that all intervening keys since the previous restart in the same block are SETs, and the previous restart has the same prefix.
I've seen varint decoding show up on sstable iterator benchmarks as well. I've wondered in the past if there is a better base block layout, possibly using a different varint encoding scheme. I did some experimentation in this area but never had satisfying results.
Your idea to mark restart keys is interesting. Would that bool be placed in the restart array? You could store this bool in the low order bit of the restart offset so that it would usually not consume any additional storage (though there would be an extra shift while decoding from the restart array). Another idea would be to store some sort of pointer to the next userkey with the value for the first userkey. That would be more complicated, but would make skipping larger numbers of versions very cheap.
So my current thinking is that removing older version keys out of the data blocks is too risky (from a performance perspective when older versions are needed) and too complicated as a first iteration over the sstable format. @petermattis let me know what you think.
Your performance numbers are compelling. I'm good with continuing down the path you started on. This PR is a bit ginormous, though. Not sure where to concretely start reviewing it. Perhaps an RFC is in order.
Reviewable status: 0 of 13 files reviewed, 3 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
-- commits, line 177 at r7:
1000 byte values feels larger than what I'd expect, especially for keys that are being overwritten frequently and thus have a lot of historic version. I wonder if we can track down some actual value sizes from customer workloads that have experienced problems with historic versions. Or add a variant of these benchmarks for values that are 10 bytes in size (i.e. just a single column).
The only cheap thing I've encountered is fixed-width encoding, based on knowledge of the maximum value in some context (like a file), but then one has to store the width somewhere. I did that here for the value blocks index for a different reason (because I wanted to calculate the offset based on the blocknum).
I really like this idea! I was bothered by wasting a whole byte for a bool. We definitely don't need 32 bits for restart offsets (even if we end up having larger blocks for disaggregated shared sstables, they would still be O(1MB)).
Yes, I was just trying to be extreme in this benchmark. I can look into historic reports of customer problems in this area to see if they mention the schema.
Sorry, this degenerated into a series of hacky experiments piled onto a baseline commit. I agree this should get a design doc/RFC now that we have a design direction that seems to work. After that one can start extracting parts from here into smaller PRs for review. |
What I experimented with were group-varint encodings. StreanVByte is that latest hotness in that area that I'm aware of. Block records currently have 3 varints: the shared key prefix length, the unshared key length, and the value length. On the surface that seems amenable to some form of group varint encoding. But what I found in practice is that this didn't make a measurable difference on iteration speed, even though varint decoding was faster. |
Thanks for the links and the review!
I totally agree that the approach here doesn't solve the problem. But until we offer first-class queue support and folks are still doing naive queues, and happen to have small keys (like UUIDs) and large values (diffs/log entries/...) it will at least reduce the volume of bytes they need to iterate over (which may not fit well in the block cache). The benchmark for this case (from a later commit) shows 99% of bytes in the value blocks for 1000 byte values. The cpu saving is less. The benchmark is not paying any IO latency (due to using a mem FS).
One could argue that this just delays when a user notices an unscalable approach, which is not a desirable outcome :) |
I had a realization today: storing the value separately from the key is exactly what WiscKey does. Only in this case you're only storing historic values separately. For large values, there is an argument to be made that we do this for all values. And we could also condition whether a value is stored separately based on the size of the value. I don't recall the threshold WiscKey uses, but I believe Badger used a threshold of 128 bytes. |
Storing the values separately within the same sstable instead of a separate log (like WiscKey) does eliminate a lot of WiscKey's write amplification benefits, because a WiscKey compaction avoids rewriting values. |
These are all good points. There was a deliberate choice here not to go as far as WiscKey, and to give up (for now) on the write amp benefits:
|
I wasn't meaning to imply that we should go the full WiscKey approach. #112 gives an overview of what that would involve. I very much like the simplification taken in this PR. My point was that there may be some learnings from WiscKey that apply here, such as having a size threshold for deciding when to store values inline vs out-of-line. And considering storing the most recent value out of line if its size is large. |
Good point. I suppose we could use something like a cumulative size threshold. I'll think some more about it. |
reminder: with MVCCValue, the stats adjustment code needs both the encoded size of the (localtimestamp, value) and whether the underlying value was empty (indicating an empty tombstone). We can use a bit for the latter and keep it in the key section. |
Coming back the discussion about group varint encoding in #1443 (comment)
|
@sumeerbhola - shall we close this out? |
When WriterOptions.ValueBlocksAreEnabled is set to true, older
versions of a key are written to a sequence of value blocks, and
the key contains a valueHandle which is a tuple
(valueLen, blockNum, offsetInBlock). The assumption here is
that most reads only need the value of the latest version, and
many reads that care about an older version only need the value
length.
Value blocks are a simple sequence of (varint encoded length,
value bytes) tuples such that given the uncompressed value
block, the valueHandle can cheaply read the value. The value
blocks index connects the blockNum in the valueHandle to the
location of the value block. It uses fixed width encoding to
avoid the expense of a general purpose key-value block.
See the comment at the top of value_block.go for details.
The following are preliminary results from a read benchmark,
after some performance tuning. The old numbers are master.
The needValue=false cases are the ones where value blocks are
expected to help.
is the extra call to valueBlockReader that needs to subslice
to remove the single byte prefix.
will be additional wasted decompression of values that we don't
need (when needValue=false). As expected, when there is an
improvement, it is larger with hasCache=false. For example the
-97.83% below (almost 50x faster) compared with -79.89%.
The higher slowdowns occur when the value size is smaller. In
such cases more inline values can be packed into an ssblock and
the code overhead of decoding the valueHandle, and the value
length in the value block (all of these are varints) becomes
a significant component.
This is a prototype in that there are no changes to the
InternalIterator interface, and the read path only works for
singleLevelIterator.
Informs #1170