[DNM] sstable: add value blocks for storing older versions of a key #1443

sumeerbhola · 2022-01-06T22:52:10Z

When WriterOptions.ValueBlocksAreEnabled is set to true, older
versions of a key are written to a sequence of value blocks, and
the key contains a valueHandle which is a tuple
(valueLen, blockNum, offsetInBlock). The assumption here is
that most reads only need the value of the latest version, and
many reads that care about an older version only need the value
length.

Value blocks are a simple sequence of (varint encoded length,
value bytes) tuples such that given the uncompressed value
block, the valueHandle can cheaply read the value. The value
blocks index connects the blockNum in the valueHandle to the
location of the value block. It uses fixed width encoding to
avoid the expense of a general purpose key-value block.

See the comment at the top of value_block.go for details.

The following are preliminary results from a read benchmark,
after some performance tuning. The old numbers are master.
The needValue=false cases are the ones where value blocks are
expected to help.

The versions=1 have no values in value blocks, and the slowdown
is the extra call to valueBlockReader that needs to subslice
to remove the single byte prefix.
The hasCache=false case correspond to a cold cache, where there
will be additional wasted decompression of values that we don't
need (when needValue=false). As expected, when there is an
improvement, it is larger with hasCache=false. For example the
-97.83% below (almost 50x faster) compared with -79.89%.
The needValue=true is where the code can be slower up to 2x.
The higher slowdowns occur when the value size is smaller. In
such cases more inline values can be packed into an ssblock and
the code overhead of decoding the valueHandle, and the value
length in the value block (all of these are varints) becomes
a significant component.

This is a prototype in that there are no changes to the
InternalIterator interface, and the read path only works for
singleLevelIterator.

name                                                                        old time/op    new time/op    delta
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16        25.5ns ± 3%    25.9ns ± 2%   +1.50%  (p=0.028 n=10+10)
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16         15.6ns ± 1%    15.5ns ± 2%     ~     (p=0.268 n=9+10)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16         27.3ns ± 3%    29.5ns ± 3%   +8.11%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16          17.1ns ± 2%    19.2ns ± 2%  +12.74%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16       26.7ns ± 2%    29.4ns ± 2%  +10.46%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16        15.9ns ± 2%    15.2ns ± 3%   -4.63%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16        26.7ns ± 2%    53.0ns ± 4%  +98.79%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16         16.6ns ± 1%    26.7ns ± 2%  +61.05%  (p=0.000 n=9+9)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16      28.3ns ± 4%    25.3ns ± 5%  -10.74%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16       15.8ns ± 2%    14.9ns ± 1%   -5.66%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16       29.4ns ± 4%    47.8ns ± 3%  +62.46%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16        16.7ns ± 4%    26.1ns ± 3%  +56.04%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16        123ns ± 4%     125ns ± 7%     ~     (p=0.735 n=9+10)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16        23.0ns ± 5%    22.9ns ± 5%     ~     (p=0.684 n=10+10)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16         124ns ± 6%     131ns ± 7%   +5.76%  (p=0.008 n=9+10)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16         24.3ns ± 4%    26.4ns ± 3%   +8.26%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16       130ns ± 8%      27ns ± 4%  -79.10%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16       23.8ns ± 4%    16.6ns ± 2%  -30.00%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16        128ns ± 9%     164ns ±12%  +27.94%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16        25.0ns ± 4%    33.0ns ± 2%  +32.22%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16      123ns ± 9%      28ns ± 3%  -76.89%  (p=0.000 n=9+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16      23.0ns ± 2%    15.3ns ± 5%  -33.36%  (p=0.000 n=10+9)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16       132ns ± 2%     171ns ± 5%  +29.24%  (p=0.000 n=8+10)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16       24.3ns ± 3%    32.6ns ± 3%  +33.98%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16      1.45µs ± 8%    1.35µs ±10%   -6.41%  (p=0.015 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16       75.5ns ± 2%    76.7ns ± 5%     ~     (p=0.218 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16       1.34µs ± 3%    1.46µs ±16%   +9.03%  (p=0.022 n=9+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16        77.0ns ± 3%    79.9ns ± 3%   +3.80%  (p=0.000 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16     1.46µs ± 6%    0.13µs ± 3%  -91.15%  (p=0.000 n=9+9)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16      76.4ns ± 3%    21.4ns ± 2%  -72.06%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16      1.47µs ± 8%    1.56µs ± 7%   +5.72%  (p=0.013 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16       78.1ns ± 4%    76.1ns ± 2%   -2.52%  (p=0.009 n=10+10)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16    1.34µs ± 5%    0.03µs ± 2%  -97.83%  (p=0.000 n=9+10)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16     77.0ns ± 2%    15.5ns ± 2%  -79.89%  (p=0.000 n=8+10)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16     1.42µs ± 9%    1.49µs ± 2%   +5.28%  (p=0.007 n=10+9)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16      78.5ns ± 4%    73.0ns ± 4%   -7.01%  (p=0.000 n=10+9)

Informs #1170

cockroach-teamcity · 2022-01-06T22:52:17Z

This change is

sumeerbhola

Reviewable status: 0 of 13 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @petermattis)

sstable/reader_test.go, line 1175 at r1 (raw file):

	for _, valueSize := range []int{100, 1000, 10000} {
		b.Run(fmt.Sprintf("valueSize=%d", valueSize), func(b *testing.B) {
			for _, versions := range []int{1, 10, 100} {

to put this in context, https://github.com/cockroachlabs/support/issues/1348#issuecomment-985867830 had > 350 versions per key.

When WriterOptions.ValueBlocksAreEnabled is set to true, older versions of a key are written to a sequence of value blocks, and the key contains a valueHandle which is a tuple (valueLen, blockNum, offsetInBlock). The assumption here is that most reads only need the value of the latest version, and many reads that care about an older version only need the value length. Value blocks are a simple sequence of (varint encoded length, value bytes) tuples such that given the uncompressed value block, the valueHandle can cheaply read the value. The value blocks index connects the blockNum in the valueHandle to the location of the value block. It uses fixed width encoding to avoid the expense of a general purpose key-value block. See the comment at the top of value_block.go for details. The following are preliminary results from a read benchmark, after some performance tuning. The old numbers are master. The needValue=false cases are the ones where value blocks are expected to help. - The versions=1 have no values in value blocks, and the slowdown is the extra call to valueBlockReader that needs to subslice to remove the single byte prefix. - The hasCache=false case correspond to a cold cache, where there will be additional wasted decompression of values that we don't need (when needValue=false). As expected, when there is an improvement, it is larger with hasCache=false. For example the -97.83% below (almost 50x faster) compared with -79.89%. - The needValue=true is where the code can be slower up to 2x. The higher slowdowns occur when the value size is smaller. In such cases more inline values can be packed into an ssblock and the code overhead of decoding the valueHandle, and the value length in the value block (all of these are varints) becomes a significant component. This is a prototype in that there are no changes to the InternalIterator interface, and the read path only works for singleLevelIterator. name old time/op new time/op delta ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16 25.5ns ± 3% 25.9ns ± 2% +1.50% (p=0.028 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16 15.6ns ± 1% 15.5ns ± 2% ~ (p=0.268 n=9+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16 27.3ns ± 3% 29.5ns ± 3% +8.11% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16 17.1ns ± 2% 19.2ns ± 2% +12.74% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16 26.7ns ± 2% 29.4ns ± 2% +10.46% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16 15.9ns ± 2% 15.2ns ± 3% -4.63% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16 26.7ns ± 2% 53.0ns ± 4% +98.79% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16 16.6ns ± 1% 26.7ns ± 2% +61.05% (p=0.000 n=9+9) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16 28.3ns ± 4% 25.3ns ± 5% -10.74% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16 15.8ns ± 2% 14.9ns ± 1% -5.66% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16 29.4ns ± 4% 47.8ns ± 3% +62.46% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16 16.7ns ± 4% 26.1ns ± 3% +56.04% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16 123ns ± 4% 125ns ± 7% ~ (p=0.735 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16 23.0ns ± 5% 22.9ns ± 5% ~ (p=0.684 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16 124ns ± 6% 131ns ± 7% +5.76% (p=0.008 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16 24.3ns ± 4% 26.4ns ± 3% +8.26% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16 130ns ± 8% 27ns ± 4% -79.10% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16 23.8ns ± 4% 16.6ns ± 2% -30.00% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16 128ns ± 9% 164ns ±12% +27.94% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16 25.0ns ± 4% 33.0ns ± 2% +32.22% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16 123ns ± 9% 28ns ± 3% -76.89% (p=0.000 n=9+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16 23.0ns ± 2% 15.3ns ± 5% -33.36% (p=0.000 n=10+9) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16 132ns ± 2% 171ns ± 5% +29.24% (p=0.000 n=8+10) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16 24.3ns ± 3% 32.6ns ± 3% +33.98% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16 1.45µs ± 8% 1.35µs ±10% -6.41% (p=0.015 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16 75.5ns ± 2% 76.7ns ± 5% ~ (p=0.218 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16 1.34µs ± 3% 1.46µs ±16% +9.03% (p=0.022 n=9+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16 77.0ns ± 3% 79.9ns ± 3% +3.80% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16 1.46µs ± 6% 0.13µs ± 3% -91.15% (p=0.000 n=9+9) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16 76.4ns ± 3% 21.4ns ± 2% -72.06% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16 1.47µs ± 8% 1.56µs ± 7% +5.72% (p=0.013 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16 78.1ns ± 4% 76.1ns ± 2% -2.52% (p=0.009 n=10+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16 1.34µs ± 5% 0.03µs ± 2% -97.83% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16 77.0ns ± 2% 15.5ns ± 2% -79.89% (p=0.000 n=8+10) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16 1.42µs ± 9% 1.49µs ± 2% +5.28% (p=0.007 n=10+9) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16 78.5ns ± 4% 73.0ns ± 4% -7.01% (p=0.000 n=10+9) To put these numbers in context of what fraction of bytes in an sstable are in regular data blocks (the rest are in value blocks): value-size: 100,versions: 1,data-block-bytes: 1185521,total-bytes: 1186984,fraction-in-data-blocks:1.00 value-size: 100,versions: 10,data-block-bytes: 174822,total-bytes: 1085218,fraction-in-data-blocks:0.16 value-size: 100,versions:100,data-block-bytes: 82118,total-bytes: 1083393,fraction-in-data-blocks:0.08 value-size: 1000,versions: 1,data-block-bytes: 10197788,total-bytes: 10203513,fraction-in-data-blocks:1.00 value-size: 1000,versions: 10,data-block-bytes: 1199435,total-bytes: 10222244,fraction-in-data-blocks:0.12 value-size: 1000,versions:100,data-block-bytes: 169931,total-bytes: 10094442,fraction-in-data-blocks:0.02 value-size:10000,versions: 1,data-block-bytes:100246534,total-bytes:100292713,fraction-in-data-blocks:1.00 value-size:10000,versions: 10,data-block-bytes: 10204053,total-bytes:100256921,fraction-in-data-blocks:0.10 value-size:10000,versions:100,data-block-bytes: 1198846,total-bytes:100252245,fraction-in-data-blocks:0.01 Informs cockroachdb#1170

petermattis

Reviewable status: 0 of 13 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

sstable/value_block.go, line 77 at r3 (raw file):

// which is modest. Therefore, we don't support more than one value index block.
// The metaindex block contains the valueBlockIndexHandle which in addition to
// the BlockHandle also specifies the widths of these tuple fields.

It would be helpful to have an ascii-art diagram that describes the structure here.

If I'm understanding the above correctly, the data block still contains keys for all of the historic versions. The space for those keys can be significant as well. Did you try to imagine any designs where the keys and values are stored elsewhere for historic versions. For example, the newest key could be stored in the data block along with its inline value and a pointer to another data block where subsequent versions of the key are stored. Iteration would become a bit more complex as iterating in a data block potentially has to move to another data block before popping back to the current one. This is just a drive-by thought. I haven't put much time into thinking it through or truly understanding what you've done in this PR.

This eliminates the value length encoding/decoding cost. Latest benchmark (old is master). We see some unexpected improvement in the needValue=true case too when the valueSize is large (10K). name old time/op new time/op delta ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16 25.5ns ± 3% 25.8ns ± 3% ~ (p=0.118 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16 15.6ns ± 1% 15.4ns ± 3% ~ (p=0.117 n=9+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16 27.3ns ± 3% 28.5ns ± 2% +4.28% (p=0.000 n=10+9) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16 17.1ns ± 2% 18.5ns ± 3% +8.60% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16 26.7ns ± 2% 29.2ns ± 4% +9.64% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16 15.9ns ± 2% 15.2ns ± 2% -4.31% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16 26.7ns ± 2% 46.3ns ± 4% +73.42% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16 16.6ns ± 1% 21.9ns ± 1% +32.13% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16 28.3ns ± 4% 24.8ns ± 2% -12.55% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16 15.8ns ± 2% 14.7ns ± 1% -6.68% (p=0.000 n=10+9) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16 29.4ns ± 4% 43.2ns ± 2% +46.91% (p=0.000 n=10+8) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16 16.7ns ± 4% 21.4ns ± 2% +28.37% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16 123ns ± 4% 109ns ± 3% -11.54% (p=0.000 n=9+9) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16 23.0ns ± 5% 22.4ns ± 1% -2.90% (p=0.016 n=10+8) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16 124ns ± 6% 118ns ± 4% -4.62% (p=0.004 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16 24.3ns ± 4% 25.8ns ± 3% +5.87% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16 130ns ± 8% 28ns ± 5% -78.73% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16 23.8ns ± 4% 17.1ns ± 4% -28.06% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16 128ns ± 9% 139ns ± 4% +8.14% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16 25.0ns ± 4% 26.6ns ± 3% +6.44% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16 123ns ± 9% 29ns ± 5% -76.28% (p=0.000 n=9+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16 23.0ns ± 2% 14.9ns ± 3% -35.04% (p=0.000 n=10+9) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16 132ns ± 2% 153ns ± 5% +15.42% (p=0.000 n=8+10) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16 24.3ns ± 3% 25.4ns ± 2% +4.54% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16 1.45µs ± 8% 1.24µs ± 6% -14.53% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16 75.5ns ± 2% 72.0ns ± 2% -4.62% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16 1.34µs ± 3% 1.30µs ± 6% ~ (p=0.117 n=9+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16 77.0ns ± 3% 76.4ns ± 5% ~ (p=0.447 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16 1.46µs ± 6% 0.12µs ± 5% -92.05% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16 76.4ns ± 3% 21.9ns ± 3% -71.38% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16 1.47µs ± 8% 1.42µs ±11% ~ (p=0.113 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16 78.1ns ± 4% 58.4ns ± 4% -25.28% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16 1.34µs ± 5% 0.03µs ± 3% -97.91% (p=0.000 n=9+9) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16 77.0ns ± 2% 15.9ns ± 4% -79.39% (p=0.000 n=8+10) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16 1.42µs ± 9% 1.46µs ± 7% ~ (p=0.118 n=10+10) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16 78.5ns ± 4% 54.0ns ± 4% -31.18% (p=0.000 n=10+10)

sumeerbhola

I've added two commits that can be ignored -- they are simple performance optimizations. The last commit also has the latest benchmark numbers -- the worst case slowdown is 1.75x and the best case speedup is 48x.

Reviewable status: 0 of 13 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @petermattis)

sstable/value_block.go, line 77 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It would be helpful to have an ascii-art diagram that describes the structure here.

If I'm understanding the above correctly, the data block still contains keys for all of the historic versions. The space for those keys can be significant as well. Did you try to imagine any designs where the keys and values are stored elsewhere for historic versions. For example, the newest key could be stored in the data block along with its inline value and a pointer to another data block where subsequent versions of the key are stored. Iteration would become a bit more complex as iterating in a data block potentially has to move to another data block before popping back to the current one. This is just a drive-by thought. I haven't put much time into thinking it through or truly understanding what you've done in this PR.

I didn't consider moving the keys elsewhere since (a) if there are many versions of a key, we should be getting a lot out of prefix compression (modulo restarts), (b) hopefully keys are not very long (I realize this is not a great assumption for sql tables and indexes), (c) the iterator tree and the top-level iterators make decisions based on userkey comparisons, and hiding userkeys completely would be quite complicated -- anything being moved to the value blocks is not necessarily an older version since the latest version could have been deleted in a higher level sstable (this PR is doing what I sketched out at a high-level in #1170 (comment)).

name old time/op new time/op delta ValueBlocksDeletedValue/valueSize=100/needValue=false/hasCache=false-16 40.7ns ± 1% 22.8ns ± 2% -44.02% (p=0.000 n=9+10) ValueBlocksDeletedValue/valueSize=100/needValue=false/hasCache=true-16 15.2ns ± 3% 14.7ns ± 4% -3.09% (p=0.007 n=10+10) ValueBlocksDeletedValue/valueSize=100/needValue=true/hasCache=false-16 41.2ns ± 4% 33.0ns ± 5% -20.01% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=100/needValue=true/hasCache=true-16 15.9ns ± 2% 19.2ns ± 4% +20.85% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=1000/needValue=false/hasCache=false-16 65.9ns ± 3% 24.1ns ± 4% -63.50% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=1000/needValue=false/hasCache=true-16 19.3ns ± 5% 14.5ns ± 3% -24.99% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=1000/needValue=true/hasCache=false-16 65.9ns ± 6% 90.0ns ±10% +36.61% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=1000/needValue=true/hasCache=true-16 19.6ns ± 4% 21.0ns ± 4% +7.51% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=10000/needValue=false/hasCache=false-16 630ns ± 8% 23ns ± 3% -96.34% (p=0.000 n=10+10) ValueBlocksDeletedValue/valueSize=10000/needValue=false/hasCache=true-16 43.2ns ± 2% 14.7ns ± 1% -66.04% (p=0.000 n=9+8) ValueBlocksDeletedValue/valueSize=10000/needValue=true/hasCache=false-16 680ns ± 4% 729ns ± 8% +7.19% (p=0.007 n=10+10) ValueBlocksDeletedValue/valueSize=10000/needValue=true/hasCache=true-16 43.3ns ± 2% 37.3ns ± 3% -13.92% (p=0.000 n=9+10) value-size: 100,data-block-bytes: 89448,total-bytes: 1090925,fraction-in-data-blocks:0.08 value-size: 1000,data-block-bytes: 95216,total-bytes: 10100022,fraction-in-data-blocks:0.01 value-size:10000,data-block-bytes: 105246,total-bytes:100138926,fraction-in-data-blocks:0.00 The time savings for the needValue=false/hasCache=false cases are less than the byte savings (see fraction-in-data-blocks).

petermattis

Reviewable status: 0 of 13 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

sstable/value_block.go, line 77 at r3 (raw file):

Previously, sumeerbhola wrote…

I didn't consider moving the keys elsewhere since (a) if there are many versions of a key, we should be getting a lot out of prefix compression (modulo restarts), (b) hopefully keys are not very long (I realize this is not a great assumption for sql tables and indexes), (c) the iterator tree and the top-level iterators make decisions based on userkey comparisons, and hiding userkeys completely would be quite complicated -- anything being moved to the value blocks is not necessarily an older version since the latest version could have been deleted in a higher level sstable (this PR is doing what I sketched out at a high-level in #1170 (comment)).

Prefix compression will definitely save a lot of space for different versions of the same key, but even with that compression the space usage isn't trivial and decoding the prefix compressed keys has overhead that shows up on scan benchmarks. For many SQL tables, I think the keys and values are approximately the same size (it would be interesting to verify this, but that is my recollection from looking at TPC-C data at one point). My thinking for storing both the keys/values for historic versions out-of-line is that an sstable needs to have fast access to the newest key/value for a particular userkey, but the historic versions can be slower to access. For a queue-like workload, I'd expect an L6 sstable to have the majority of its data in the slower-to-access region.

My mental model for how we could completely separate older versions is to model a logical sstable as two physical sstables (though we could also store the older versions in separate blocks of the same sstable). sstable.Writer.Add would add the first version of the key to the "upper" (newer) table and the subsequent versions of the key to the "lower" (older) table. A special sstable.Iterator would blend the two tables. We'd need to expose an Iterator.NextPrefix method which would be akin to MVCCIterator.NextKey. If we're iterating over the latest version we never even have to touch the older version blocks, and that lack of decoding of the older version keys would be very significant. The MVCC scanner "iters before seek" mechanism would be pushed into NextPrefix and we'd in turn push some of that logic down into the leaf iterators. For sstables with separated historic versions, the sstable NextPrefix would be almost as as fast as a single Next and we could avoid the slow SeekGE calls that destroy scan performance. The downside to this layout is that if you do need to access a historic version you incur the penalty of another Seek.

sumeerbhola · 2022-01-12T00:39:30Z

... The downside to this layout is that if you do need to access a historic version you incur the penalty of another Seek.

The 2 physical sstable model is roughly how the original issue perf: separate latest and old MVCC versions for faster reads #1170 was written. There are some difficulties there in that the latest version could be deleted by a higher level, so completely hiding an older version when deciding to construct an iterator is not viable. I believe what you are suggesting is exposing the older version lazily when needed, which is possibly workable. I need to think some more.
I was trying to come up with a solution where accessing older versions would be almost the same cost as now. Doing a Seek would be much more costly.
There are also cases (the gc ones mentioned in perf: separate latest and old MVCC versions for faster reads #1170 (comment)) that need only the size of each version (including older versions), which this implementation provides without accessing the value blocks.
There seem to be two high-level options:
- move older keys to a different part too: I am quite convinced that accessing these should avoid a seek. I think avoiding the seek is doable by having some notion of a continuation handle, such as (a@10=>val, a@8=>continuation handle) where the continuation handle points to all remaining key-value pairs with prefix a.
- (the solution here) only the values are in a different part: One can still optimize NextPrefix to not do O(key-len) work (since we have the valueHandlePrefix byte) and if there are data blocks that contain only older versions one can completely skip them using block property filters (@jbowens outlined one idea here perf: separate latest and old MVCC versions for faster reads #1170 (comment)). So the main overhead over the other option is that we would repeat the key every BlockRestartInterval. If there are 100s of versions in a swathe of the key space and the default value of 16 of BlockRestartInterval is insufficient we could also keep stats in the properties block about versions and use the input stats for all sstables in a compaction to choose the BlockRestartInterval.
I'm still partial to the latter option since the layout and code is simpler. I'll try to add some optimized code for NextPrefix for the case of long keys and see what the cpu profiles and the block bytes tell us about the cost.

petermattis · 2022-01-12T14:22:06Z

move older keys to a different part too: I am quite convinced that accessing these should avoid a seek. I think avoiding the seek is doable by having some notion of a continuation handle, such as (a@10=>val, a@8=>continuation handle) where the continuation handle points to all remaining key-value pairs with prefix a.

Yeah, I think we can avoid the second sstable-seek, and doing so is likely necessary for performance. The continuation handle approach would still suffer from a potential disk seek. On the other hand, by moving the keys out of the scan path we're in the optimal design space for scanning over the most recent versions of the keys. With the separated historic value approach a scan would still see all of the data for the historic keys. I'm unconvinced that we'll frequently see data blocks that only contain historic keys. We target a data block size of 32 KB. So on average we'll have 16 KB of data block remaining when we start storing historic keys. If I'm remembering correctly, in TPCC keys are ~100 bytes in size. With prefix compression we're only storing that 100 bytes every 16th key, and the remainder of the keys we're storing the suffix. So figure ~15 bytes per key on average for the historic keys (I'm estimating 9 bytes for the suffix) plus 9 bytes for the value handle. We can store 682 historic versions for such keys in 16 KB, so it would only be tables with truly massive numbers of historic versions that would see a benefit from being able to skip data blocks that contain only historic versions. Additionally, these data blocks would sequentially follow in the sstable, so readahead would be reading them from disk even if they are skipped.

I'm not attached to the two-sstable approach, or to other ideas of completing separating historic keys and values. Seems like there are perf trade-offs with either approach (e.g. GC needs access to the value sizes, but not the values).

For a queue-like workload, I'd expect an L6 sstable to have the majority of its data in the slower-to-access region.

This statement wasn't correct. In a queue-like workload, a record is added once (to the head of the queue), and then deleted once (from the tail of the queue). The problem with queue-like workloads is that there a huge prefix of MVCC deletion tombstones that have to be iterated over at the tail in order to find the first live row. I don't think any of the approaches discussed in this PR address this problem.

PS I'm quite excited that we have some viable approaches to speeding up scanning tables that have many historic versions.

NextSkipOlderVersions is currently a best-effort attempt that skips older SET versions only for sstables written using ValueBlocksAreEnabled. The benchmark tests this with long keys and values, 1000 bytes each, where consecutive keys with different prefixes are completely random so unlikely to have any significant shared prefix. The versions and BlockRestartInterval is varied: versions: 1,restart-interval:16,data-block-bytes: 20177767,total-bytes: 20188429,fraction-in-data-blocks:1.00 versions: 1,restart-interval:32,data-block-bytes: 20177767,total-bytes: 20188429,fraction-in-data-blocks:1.00 versions: 1,restart-interval:64,data-block-bytes: 20177767,total-bytes: 20188429,fraction-in-data-blocks:1.00 versions: 2,restart-interval:16,data-block-bytes: 10863502,total-bytes: 16208363,fraction-in-data-blocks:0.67 versions: 2,restart-interval:32,data-block-bytes: 10529835,total-bytes: 15874694,fraction-in-data-blocks:0.66 versions: 2,restart-interval:64,data-block-bytes: 10529835,total-bytes: 15874694,fraction-in-data-blocks:0.66 versions: 10,restart-interval:16,data-block-bytes: 2356652,total-bytes: 11453297,fraction-in-data-blocks:0.21 versions: 10,restart-interval:32,data-block-bytes: 2423386,total-bytes: 11469656,fraction-in-data-blocks:0.21 versions: 10,restart-interval:64,data-block-bytes: 2429837,total-bytes: 11512104,fraction-in-data-blocks:0.21 versions:100,restart-interval:16,data-block-bytes: 345979,total-bytes: 10284363,fraction-in-data-blocks:0.03 versions:100,restart-interval:32,data-block-bytes: 318700,total-bytes: 10246809,fraction-in-data-blocks:0.03 versions:100,restart-interval:64,data-block-bytes: 298197,total-bytes: 10221192,fraction-in-data-blocks:0.03 The results are BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=16/hasCache=true-16 10406703 108.7 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=32/hasCache=true-16 11452221 122.4 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=64/hasCache=true-16 9762536 115.9 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=16/hasCache=true-16 14293557 87.28 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=32/hasCache=true-16 15624984 79.73 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=64/hasCache=true-16 15065860 80.10 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=16/hasCache=true-16 7101529 165.1 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=32/hasCache=true-16 7561116 155.1 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=64/hasCache=true-16 7571413 149.7 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=16/hasCache=true-16 1000000 1112 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=32/hasCache=true-16 1211478 1040 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=64/hasCache=true-16 1238484 1019 ns/op 0 B/op 0 allocs/op - The versions=100 cases are ~10x slower than versions=1. A naive implementation would be expected to be 100x slower. But 10x slower is not great. A ratio closer to 1 is desirable. Looking at cpu profiles for BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=16/hasCache=true-16 readEntryOrSkipIfOldVersion is ~90% of the time. This includes having to decode various varints and assembling fullKey. Assembling the fullkey is ~43%, even though it is O(key-length) only when there is a restart or the prefix changes. So the throughput is a direct result of the inherent limitation of storing all keys in the data blocks. - The versions=10 cases are quite good -- ~1.5x slower than versions=1.

sumeerbhola · 2022-01-12T21:54:59Z

I've added a new commit with results for NextSkipOlderVersions. The commit has details, and here is a summary:

With 10 versions, the results are competitive in comparison with 1 version (only 1.5x slower -- a naive iteration would be 10x slower since it would call Next 10 times).
100 versions is 10x slower (so the slowdown is accelerating since a constant slope would mean 1.5*1.5=2.25x slower). Though still better than a naive 100x slowdown, but could have been better. The major cost based on cpu profiles is inherently due to the fact that we are storing all keys in the data blocks.

I thought some more about removing older version keys out of the data blocks: there are places that do seeks using a key that has a non-empty version in CockroachDB (intent resolution; pebbleMVCCScanner using the read timestamp; deletion step of GC that seeks to the version being deleted). To support these seeks, we would need to be able to accurately position among the older versions, which adds more complexity to how these older versions are stored and indexed.

So my current thinking is that removing older version keys out of the data blocks is too risky (from a performance perspective when older versions are needed) and too complicated as a first iteration over the sstable format. @petermattis let me know what you think.

…o key prefix compression 10 versions is now ~1x of 1 version and 100 versions is ~6x BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=16/hasCache=true-16 9323478 108.9 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=32/hasCache=true-16 11679625 103.3 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=64/hasCache=true-16 11009826 110.7 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=16/hasCache=true-16 13600150 74.92 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=32/hasCache=true-16 13882718 72.97 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=64/hasCache=true-16 13649173 81.17 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=16/hasCache=true-16 10129204 113.3 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=32/hasCache=true-16 10582024 109.6 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=64/hasCache=true-16 10869471 107.7 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=16/hasCache=true-16 1934696 643.7 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=32/hasCache=true-16 2090880 567.4 ns/op 0 B/op 0 allocs/op BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=64/hasCache=true-16 2177060 554.1 ns/op 0 B/op 0 allocs/op

sumeerbhola · 2022-01-12T22:25:29Z

Added another commit that reduces 10 versions to 1x of 1 version and 100 versions to 6x of 1 version.

sumeerbhola · 2022-01-13T13:40:46Z

Added another commit that reduces 10 versions to 1x of 1 version and 100 versions to 6x of 1 version.

Most of the cost (75-80%) in the 100 versions case is in decoding varints as we step through all the key versions.
We can avoid this cost by dynamically switching (at read time) from stepping through each key to stepping through restarts. We can correctly (and cheaply) do the latter by storing a byte (bool) (at write time) with each restart key that indicates that all intervening keys since the previous restart in the same block are SETs, and the previous restart has the same prefix.

petermattis

Most of the cost (75-80%) in the 100 versions case is in decoding varints as we step through all the key versions.
We can avoid this cost by dynamically switching (at read time) from stepping through each key to stepping through restarts. We can correctly (and cheaply) do the latter by storing a byte (bool) (at write time) with each restart key that indicates that all intervening keys since the previous restart in the same block are SETs, and the previous restart has the same prefix.

I've seen varint decoding show up on sstable iterator benchmarks as well. I've wondered in the past if there is a better base block layout, possibly using a different varint encoding scheme. I did some experimentation in this area but never had satisfying results.

Your idea to mark restart keys is interesting. Would that bool be placed in the restart array? You could store this bool in the low order bit of the restart offset so that it would usually not consume any additional storage (though there would be an extra shift while decoding from the restart array). Another idea would be to store some sort of pointer to the next userkey with the value for the first userkey. That would be more complicated, but would make skipping larger numbers of versions very cheap.

So my current thinking is that removing older version keys out of the data blocks is too risky (from a performance perspective when older versions are needed) and too complicated as a first iteration over the sstable format. @petermattis let me know what you think.

Your performance numbers are compelling. I'm good with continuing down the path you started on. This PR is a bit ginormous, though. Not sure where to concretely start reviewing it. Perhaps an RFC is in order.

Reviewable status: 0 of 13 files reviewed, 3 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

-- commits, line 177 at r7:
1000 byte values feels larger than what I'd expect, especially for keys that are being overwritten frequently and thus have a lot of historic version. I wonder if we can track down some actual value sizes from customer workloads that have experienced problems with historic versions. Or add a variant of these benchmarks for values that are 10 bytes in size (i.e. just a single column).

sumeerbhola · 2022-01-13T15:47:41Z

I've seen varint decoding show up on sstable iterator benchmarks as well. I've wondered in the past if there is a better base block layout, possibly using a different varint encoding scheme. I did some experimentation in this area but never had satisfying results.

The only cheap thing I've encountered is fixed-width encoding, based on knowledge of the maximum value in some context (like a file), but then one has to store the width somewhere. I did that here for the value blocks index for a different reason (because I wanted to calculate the offset based on the blocknum).

Your idea to mark restart keys is interesting. Would that bool be placed in the restart array? You could store this bool in the low order bit of the restart offset so that it would usually not consume any additional storage (though there would be an extra shift while decoding from the restart array).

I really like this idea! I was bothered by wasting a whole byte for a bool. We definitely don't need 32 bits for restart offsets (even if we end up having larger blocks for disaggregated shared sstables, they would still be O(1MB)).

1000 byte values feels larger than what I'd expect, especially for keys that are being overwritten frequently and thus have a lot of historic version. I wonder if we can track down some actual value sizes from customer workloads that have experienced problems with historic versions. Or add a variant of these benchmarks for values that are 10 bytes in size (i.e. just a single column).

Yes, I was just trying to be extreme in this benchmark. I can look into historic reports of customer problems in this area to see if they mention the schema.

Your performance numbers are compelling. I'm good with continuing down the path you started on. This PR is a bit ginormous, though. Not sure where to concretely start reviewing it. Perhaps an RFC is in order.

Sorry, this degenerated into a series of hacky experiments piled onto a baseline commit. I agree this should get a design doc/RFC now that we have a design direction that seems to work. After that one can start extracting parts from here into smaller PRs for review.
So there isn't really a need to review this. For the other curious folks, value_block.go in the first commit (and the tiny comment in table.go) should be sufficient to understand what is being done here.

petermattis · 2022-01-13T16:02:25Z

The only cheap thing I've encountered is fixed-width encoding, based on knowledge of the maximum value in some context (like a file), but then one has to store the width somewhere. I did that here for the value blocks index for a different reason (because I wanted to calculate the offset based on the blocknum).

What I experimented with were group-varint encodings. StreanVByte is that latest hotness in that area that I'm aware of. Block records currently have 3 varints: the shared key prefix length, the unshared key length, and the value length. On the surface that seems amenable to some form of group varint encoding. But what I found in practice is that this didn't make a measurable difference on iteration speed, even though varint decoding was faster.

sumeerbhola · 2022-01-13T16:40:11Z

Thanks for the links and the review!

This statement wasn't correct. In a queue-like workload, a record is added once (to the head of the queue), and then deleted once (from the tail of the queue). The problem with queue-like workloads is that there a huge prefix of MVCC deletion tombstones that have to be iterated over at the tail in order to find the first live row. I don't think any of the approaches discussed in this PR address this problem.

I totally agree that the approach here doesn't solve the problem. But until we offer first-class queue support and folks are still doing naive queues, and happen to have small keys (like UUIDs) and large values (diffs/log entries/...) it will at least reduce the volume of bytes they need to iterate over (which may not fit well in the block cache). The benchmark for this case (from a later commit) shows 99% of bytes in the value blocks for 1000 byte values. The cpu saving is less. The benchmark is not paying any IO latency (due to using a mem FS).

ValueBlocksDeletedValue/valueSize=1000/needValue=false/hasCache=false-16     65.9ns ± 3%    24.1ns ± 4%  -63.50%  (p=0.000 n=10+10)
value-size: 1000,data-block-bytes:    95216,total-bytes: 10100022,fraction-in-data-blocks:0.01

One could argue that this just delays when a user notices an unscalable approach, which is not a desirable outcome :)

petermattis · 2022-01-14T21:38:59Z

I had a realization today: storing the value separately from the key is exactly what WiscKey does. Only in this case you're only storing historic values separately. For large values, there is an argument to be made that we do this for all values. And we could also condition whether a value is stored separately based on the size of the value. I don't recall the threshold WiscKey uses, but I believe Badger used a threshold of 128 bytes.

jbowens · 2022-01-18T21:48:42Z

I had a realization today: storing the value separately from the key is exactly what WiscKey does. Only in this case you're only storing historic values separately

Storing the values separately within the same sstable instead of a separate log (like WiscKey) does eliminate a lot of WiscKey's write amplification benefits, because a WiscKey compaction avoids rewriting values.

sumeerbhola · 2022-01-18T22:10:38Z

I had a realization today: storing the value separately from the key is exactly what WiscKey does. Only in this case you're only storing historic values separately. For large values, there is an argument to be made that we do this for all values. And we could also condition whether a value is stored separately based on the size of the value. I don't recall the threshold WiscKey uses, but I believe Badger used a threshold of 128 bytes.

Storing the values separately within the same sstable instead of a separate log (like WiscKey) does eliminate a lot of WiscKey's write amplification benefits, because a WiscKey compaction avoids rewriting values.

These are all good points. There was a deliberate choice here not to go as far as WiscKey, and to give up (for now) on the write amp benefits:

WiscKey is using parallel reads to overcome the cost of values being arbitrarily ordered in the vLog. This approach may be fine when one is directly provisioning SSDs, but we have be cognizant of what IOPS cost us in the public cloud (on virtualized block storage or on blob storage).
The vLog abstraction is sufficiently different from what we currently have that I think we would need to experiment a lot to be sure we are not regressing certain workloads. Also, we need to write a background GC for it. I think separate value blocks in the same file is an easier reach.
I think it will be viable to further extend the format here with value handles that point into other files, if we decide to in the future. So something like: inline the latest version if not a "very large" value; "very large" value in a separate file; remaining in value blocks in the same file.

petermattis · 2022-01-18T22:19:44Z

I wasn't meaning to imply that we should go the full WiscKey approach. #112 gives an overview of what that would involve. I very much like the simplification taken in this PR. My point was that there may be some learnings from WiscKey that apply here, such as having a size threshold for deciding when to store values inline vs out-of-line. And considering storing the most recent value out of line if its size is large.

sumeerbhola · 2022-01-20T16:19:31Z

My point was that there may be some learnings from WiscKey that apply here, such as having a size threshold for deciding when to store values inline vs out-of-line. And considering storing the most recent value out of line if its size is large.

Good point. I suppose we could use something like a cumulative size threshold. I'll think some more about it.

sumeerbhola · 2022-05-09T15:07:04Z

reminder: with MVCCValue, the stats adjustment code needs both the encoded size of the (localtimestamp, value) and whether the underlying value was empty (indicating an empty tombstone). We can use a bit for the latter and keep it in the key section.

sumeerbhola · 2022-10-06T15:44:52Z

Coming back the discussion about group varint encoding in #1443 (comment)

We have 2 places we use groups of varints that are potential for optimization since they are present for each key-value pair (optimizing in index entries is not worth it): the valueHandle is 3 uint32s (valueLen, blockNum, offsetInBlock). The blockIter.readEntry reads 3 uint32s (shared key length, unshared key length, value length).
StreamVByte and the likes are for large streams of numbers, while we only have three. So they are not really applicable. But FTR, there is a better Go implementation of StreamVByte in https://github.com/thempatel/streamvbyte-simdgo
Group varint with 1 control byte followed by 3 uint32s seems the best prospect. I hacked up some (unclean) implementations in sumeerbhola@8f82bd7 and did not see enough benefit in benchmarks to bother with the deficiencies (noted in the commit).

nicktrav · 2024-08-15T20:37:24Z

@sumeerbhola - shall we close this out?

sumeerbhola requested review from itsbilal, jbowens, petermattis and a team January 6, 2022 22:52

sumeerbhola commented Jan 7, 2022

View reviewed changes

sumeerbhola force-pushed the val_blocks branch from cab59da to 1df3b3b Compare January 7, 2022 14:30

sstable: more inlining and loop unrolling

3541b21

petermattis reviewed Jan 7, 2022

View reviewed changes

sumeerbhola commented Jan 7, 2022

View reviewed changes

petermattis reviewed Jan 11, 2022

View reviewed changes

petermattis reviewed Jan 13, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] sstable: add value blocks for storing older versions of a key #1443

[DNM] sstable: add value blocks for storing older versions of a key #1443

sumeerbhola commented Jan 6, 2022

cockroach-teamcity commented Jan 6, 2022

sumeerbhola left a comment

petermattis left a comment

sumeerbhola left a comment

petermattis left a comment

sumeerbhola commented Jan 12, 2022

petermattis commented Jan 12, 2022

sumeerbhola commented Jan 12, 2022

sumeerbhola commented Jan 12, 2022

sumeerbhola commented Jan 13, 2022

petermattis left a comment

sumeerbhola commented Jan 13, 2022

petermattis commented Jan 13, 2022

sumeerbhola commented Jan 13, 2022

petermattis commented Jan 14, 2022

jbowens commented Jan 18, 2022

sumeerbhola commented Jan 18, 2022

petermattis commented Jan 18, 2022

sumeerbhola commented Jan 20, 2022

sumeerbhola commented May 9, 2022

sumeerbhola commented Oct 6, 2022

nicktrav commented Aug 15, 2024

[DNM] sstable: add value blocks for storing older versions of a key #1443

Are you sure you want to change the base?

[DNM] sstable: add value blocks for storing older versions of a key #1443

Conversation

sumeerbhola commented Jan 6, 2022

cockroach-teamcity commented Jan 6, 2022

sumeerbhola left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola commented Jan 12, 2022

petermattis commented Jan 12, 2022

sumeerbhola commented Jan 12, 2022

sumeerbhola commented Jan 12, 2022

sumeerbhola commented Jan 13, 2022

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola commented Jan 13, 2022

petermattis commented Jan 13, 2022

sumeerbhola commented Jan 13, 2022

petermattis commented Jan 14, 2022

jbowens commented Jan 18, 2022

sumeerbhola commented Jan 18, 2022

petermattis commented Jan 18, 2022

sumeerbhola commented Jan 20, 2022

sumeerbhola commented May 9, 2022

sumeerbhola commented Oct 6, 2022

nicktrav commented Aug 15, 2024