Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] sstable: add value blocks for storing older versions of a key #1443

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

sumeerbhola
Copy link
Collaborator

When WriterOptions.ValueBlocksAreEnabled is set to true, older
versions of a key are written to a sequence of value blocks, and
the key contains a valueHandle which is a tuple
(valueLen, blockNum, offsetInBlock). The assumption here is
that most reads only need the value of the latest version, and
many reads that care about an older version only need the value
length.

Value blocks are a simple sequence of (varint encoded length,
value bytes) tuples such that given the uncompressed value
block, the valueHandle can cheaply read the value. The value
blocks index connects the blockNum in the valueHandle to the
location of the value block. It uses fixed width encoding to
avoid the expense of a general purpose key-value block.

See the comment at the top of value_block.go for details.

The following are preliminary results from a read benchmark,
after some performance tuning. The old numbers are master.
The needValue=false cases are the ones where value blocks are
expected to help.

  • The versions=1 have no values in value blocks, and the slowdown
    is the extra call to valueBlockReader that needs to subslice
    to remove the single byte prefix.
  • The hasCache=false case correspond to a cold cache, where there
    will be additional wasted decompression of values that we don't
    need (when needValue=false). As expected, when there is an
    improvement, it is larger with hasCache=false. For example the
    -97.83% below (almost 50x faster) compared with -79.89%.
  • The needValue=true is where the code can be slower up to 2x.
    The higher slowdowns occur when the value size is smaller. In
    such cases more inline values can be packed into an ssblock and
    the code overhead of decoding the valueHandle, and the value
    length in the value block (all of these are varints) becomes
    a significant component.

This is a prototype in that there are no changes to the
InternalIterator interface, and the read path only works for
singleLevelIterator.

name                                                                        old time/op    new time/op    delta
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16        25.5ns ± 3%    25.9ns ± 2%   +1.50%  (p=0.028 n=10+10)
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16         15.6ns ± 1%    15.5ns ± 2%     ~     (p=0.268 n=9+10)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16         27.3ns ± 3%    29.5ns ± 3%   +8.11%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16          17.1ns ± 2%    19.2ns ± 2%  +12.74%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16       26.7ns ± 2%    29.4ns ± 2%  +10.46%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16        15.9ns ± 2%    15.2ns ± 3%   -4.63%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16        26.7ns ± 2%    53.0ns ± 4%  +98.79%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16         16.6ns ± 1%    26.7ns ± 2%  +61.05%  (p=0.000 n=9+9)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16      28.3ns ± 4%    25.3ns ± 5%  -10.74%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16       15.8ns ± 2%    14.9ns ± 1%   -5.66%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16       29.4ns ± 4%    47.8ns ± 3%  +62.46%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16        16.7ns ± 4%    26.1ns ± 3%  +56.04%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16        123ns ± 4%     125ns ± 7%     ~     (p=0.735 n=9+10)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16        23.0ns ± 5%    22.9ns ± 5%     ~     (p=0.684 n=10+10)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16         124ns ± 6%     131ns ± 7%   +5.76%  (p=0.008 n=9+10)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16         24.3ns ± 4%    26.4ns ± 3%   +8.26%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16       130ns ± 8%      27ns ± 4%  -79.10%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16       23.8ns ± 4%    16.6ns ± 2%  -30.00%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16        128ns ± 9%     164ns ±12%  +27.94%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16        25.0ns ± 4%    33.0ns ± 2%  +32.22%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16      123ns ± 9%      28ns ± 3%  -76.89%  (p=0.000 n=9+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16      23.0ns ± 2%    15.3ns ± 5%  -33.36%  (p=0.000 n=10+9)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16       132ns ± 2%     171ns ± 5%  +29.24%  (p=0.000 n=8+10)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16       24.3ns ± 3%    32.6ns ± 3%  +33.98%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16      1.45µs ± 8%    1.35µs ±10%   -6.41%  (p=0.015 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16       75.5ns ± 2%    76.7ns ± 5%     ~     (p=0.218 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16       1.34µs ± 3%    1.46µs ±16%   +9.03%  (p=0.022 n=9+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16        77.0ns ± 3%    79.9ns ± 3%   +3.80%  (p=0.000 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16     1.46µs ± 6%    0.13µs ± 3%  -91.15%  (p=0.000 n=9+9)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16      76.4ns ± 3%    21.4ns ± 2%  -72.06%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16      1.47µs ± 8%    1.56µs ± 7%   +5.72%  (p=0.013 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16       78.1ns ± 4%    76.1ns ± 2%   -2.52%  (p=0.009 n=10+10)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16    1.34µs ± 5%    0.03µs ± 2%  -97.83%  (p=0.000 n=9+10)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16     77.0ns ± 2%    15.5ns ± 2%  -79.89%  (p=0.000 n=8+10)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16     1.42µs ± 9%    1.49µs ± 2%   +5.28%  (p=0.007 n=10+9)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16      78.5ns ± 4%    73.0ns ± 4%   -7.01%  (p=0.000 n=10+9)

Informs #1170

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 13 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @petermattis)


sstable/reader_test.go, line 1175 at r1 (raw file):

	for _, valueSize := range []int{100, 1000, 10000} {
		b.Run(fmt.Sprintf("valueSize=%d", valueSize), func(b *testing.B) {
			for _, versions := range []int{1, 10, 100} {

to put this in context, https://github.com/cockroachlabs/support/issues/1348#issuecomment-985867830 had > 350 versions per key.

When WriterOptions.ValueBlocksAreEnabled is set to true, older
versions of a key are written to a sequence of value blocks, and
the key contains a valueHandle which is a tuple
(valueLen, blockNum, offsetInBlock). The assumption here is
that most reads only need the value of the latest version, and
many reads that care about an older version only need the value
length.

Value blocks are a simple sequence of (varint encoded length,
value bytes) tuples such that given the uncompressed value
block, the valueHandle can cheaply read the value. The value
blocks index connects the blockNum in the valueHandle to the
location of the value block. It uses fixed width encoding to
avoid the expense of a general purpose key-value block.

See the comment at the top of value_block.go for details.

The following are preliminary results from a read benchmark,
after some performance tuning. The old numbers are master.
The needValue=false cases are the ones where value blocks are
expected to help.
- The versions=1 have no values in value blocks, and the slowdown
  is the extra call to valueBlockReader that needs to subslice
  to remove the single byte prefix.
- The hasCache=false case correspond to a cold cache, where there
  will be additional wasted decompression of values that we don't
  need (when needValue=false). As expected, when there is an
  improvement, it is larger with hasCache=false. For example the
  -97.83% below (almost 50x faster) compared with -79.89%.
- The needValue=true is where the code can be slower up to 2x.
  The higher slowdowns occur when the value size is smaller. In
  such cases more inline values can be packed into an ssblock and
  the code overhead of decoding the valueHandle, and the value
  length in the value block (all of these are varints) becomes
  a significant component.

This is a prototype in that there are no changes to the
InternalIterator interface, and the read path only works for
singleLevelIterator.

name                                                                        old time/op    new time/op    delta
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16        25.5ns ± 3%    25.9ns ± 2%   +1.50%  (p=0.028 n=10+10)
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16         15.6ns ± 1%    15.5ns ± 2%     ~     (p=0.268 n=9+10)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16         27.3ns ± 3%    29.5ns ± 3%   +8.11%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16          17.1ns ± 2%    19.2ns ± 2%  +12.74%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16       26.7ns ± 2%    29.4ns ± 2%  +10.46%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16        15.9ns ± 2%    15.2ns ± 3%   -4.63%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16        26.7ns ± 2%    53.0ns ± 4%  +98.79%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16         16.6ns ± 1%    26.7ns ± 2%  +61.05%  (p=0.000 n=9+9)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16      28.3ns ± 4%    25.3ns ± 5%  -10.74%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16       15.8ns ± 2%    14.9ns ± 1%   -5.66%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16       29.4ns ± 4%    47.8ns ± 3%  +62.46%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16        16.7ns ± 4%    26.1ns ± 3%  +56.04%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16        123ns ± 4%     125ns ± 7%     ~     (p=0.735 n=9+10)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16        23.0ns ± 5%    22.9ns ± 5%     ~     (p=0.684 n=10+10)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16         124ns ± 6%     131ns ± 7%   +5.76%  (p=0.008 n=9+10)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16         24.3ns ± 4%    26.4ns ± 3%   +8.26%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16       130ns ± 8%      27ns ± 4%  -79.10%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16       23.8ns ± 4%    16.6ns ± 2%  -30.00%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16        128ns ± 9%     164ns ±12%  +27.94%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16        25.0ns ± 4%    33.0ns ± 2%  +32.22%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16      123ns ± 9%      28ns ± 3%  -76.89%  (p=0.000 n=9+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16      23.0ns ± 2%    15.3ns ± 5%  -33.36%  (p=0.000 n=10+9)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16       132ns ± 2%     171ns ± 5%  +29.24%  (p=0.000 n=8+10)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16       24.3ns ± 3%    32.6ns ± 3%  +33.98%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16      1.45µs ± 8%    1.35µs ±10%   -6.41%  (p=0.015 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16       75.5ns ± 2%    76.7ns ± 5%     ~     (p=0.218 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16       1.34µs ± 3%    1.46µs ±16%   +9.03%  (p=0.022 n=9+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16        77.0ns ± 3%    79.9ns ± 3%   +3.80%  (p=0.000 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16     1.46µs ± 6%    0.13µs ± 3%  -91.15%  (p=0.000 n=9+9)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16      76.4ns ± 3%    21.4ns ± 2%  -72.06%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16      1.47µs ± 8%    1.56µs ± 7%   +5.72%  (p=0.013 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16       78.1ns ± 4%    76.1ns ± 2%   -2.52%  (p=0.009 n=10+10)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16    1.34µs ± 5%    0.03µs ± 2%  -97.83%  (p=0.000 n=9+10)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16     77.0ns ± 2%    15.5ns ± 2%  -79.89%  (p=0.000 n=8+10)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16     1.42µs ± 9%    1.49µs ± 2%   +5.28%  (p=0.007 n=10+9)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16      78.5ns ± 4%    73.0ns ± 4%   -7.01%  (p=0.000 n=10+9)

To put these numbers in context of what fraction of bytes in an sstable
are in regular data blocks (the rest are in value blocks):

value-size:  100,versions:  1,data-block-bytes:  1185521,total-bytes:  1186984,fraction-in-data-blocks:1.00
value-size:  100,versions: 10,data-block-bytes:   174822,total-bytes:  1085218,fraction-in-data-blocks:0.16
value-size:  100,versions:100,data-block-bytes:    82118,total-bytes:  1083393,fraction-in-data-blocks:0.08
value-size: 1000,versions:  1,data-block-bytes: 10197788,total-bytes: 10203513,fraction-in-data-blocks:1.00
value-size: 1000,versions: 10,data-block-bytes:  1199435,total-bytes: 10222244,fraction-in-data-blocks:0.12
value-size: 1000,versions:100,data-block-bytes:   169931,total-bytes: 10094442,fraction-in-data-blocks:0.02
value-size:10000,versions:  1,data-block-bytes:100246534,total-bytes:100292713,fraction-in-data-blocks:1.00
value-size:10000,versions: 10,data-block-bytes: 10204053,total-bytes:100256921,fraction-in-data-blocks:0.10
value-size:10000,versions:100,data-block-bytes:  1198846,total-bytes:100252245,fraction-in-data-blocks:0.01

Informs cockroachdb#1170
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 13 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


sstable/value_block.go, line 77 at r3 (raw file):

// which is modest. Therefore, we don't support more than one value index block.
// The metaindex block contains the valueBlockIndexHandle which in addition to
// the BlockHandle also specifies the widths of these tuple fields.

It would be helpful to have an ascii-art diagram that describes the structure here.

If I'm understanding the above correctly, the data block still contains keys for all of the historic versions. The space for those keys can be significant as well. Did you try to imagine any designs where the keys and values are stored elsewhere for historic versions. For example, the newest key could be stored in the data block along with its inline value and a pointer to another data block where subsequent versions of the key are stored. Iteration would become a bit more complex as iterating in a data block potentially has to move to another data block before popping back to the current one. This is just a drive-by thought. I haven't put much time into thinking it through or truly understanding what you've done in this PR.

This eliminates the value length encoding/decoding cost.

Latest benchmark (old is master). We see some unexpected improvement
in the needValue=true case too when the valueSize is large (10K).

name                                                                        old time/op    new time/op    delta
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16        25.5ns ± 3%    25.8ns ± 3%     ~     (p=0.118 n=10+10)
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16         15.6ns ± 1%    15.4ns ± 3%     ~     (p=0.117 n=9+10)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16         27.3ns ± 3%    28.5ns ± 2%   +4.28%  (p=0.000 n=10+9)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16          17.1ns ± 2%    18.5ns ± 3%   +8.60%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16       26.7ns ± 2%    29.2ns ± 4%   +9.64%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16        15.9ns ± 2%    15.2ns ± 2%   -4.31%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16        26.7ns ± 2%    46.3ns ± 4%  +73.42%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16         16.6ns ± 1%    21.9ns ± 1%  +32.13%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16      28.3ns ± 4%    24.8ns ± 2%  -12.55%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16       15.8ns ± 2%    14.7ns ± 1%   -6.68%  (p=0.000 n=10+9)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16       29.4ns ± 4%    43.2ns ± 2%  +46.91%  (p=0.000 n=10+8)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16        16.7ns ± 4%    21.4ns ± 2%  +28.37%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16        123ns ± 4%     109ns ± 3%  -11.54%  (p=0.000 n=9+9)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16        23.0ns ± 5%    22.4ns ± 1%   -2.90%  (p=0.016 n=10+8)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16         124ns ± 6%     118ns ± 4%   -4.62%  (p=0.004 n=9+10)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16         24.3ns ± 4%    25.8ns ± 3%   +5.87%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16       130ns ± 8%      28ns ± 5%  -78.73%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16       23.8ns ± 4%    17.1ns ± 4%  -28.06%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16        128ns ± 9%     139ns ± 4%   +8.14%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16        25.0ns ± 4%    26.6ns ± 3%   +6.44%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16      123ns ± 9%      29ns ± 5%  -76.28%  (p=0.000 n=9+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16      23.0ns ± 2%    14.9ns ± 3%  -35.04%  (p=0.000 n=10+9)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16       132ns ± 2%     153ns ± 5%  +15.42%  (p=0.000 n=8+10)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16       24.3ns ± 3%    25.4ns ± 2%   +4.54%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16      1.45µs ± 8%    1.24µs ± 6%  -14.53%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16       75.5ns ± 2%    72.0ns ± 2%   -4.62%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16       1.34µs ± 3%    1.30µs ± 6%     ~     (p=0.117 n=9+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16        77.0ns ± 3%    76.4ns ± 5%     ~     (p=0.447 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16     1.46µs ± 6%    0.12µs ± 5%  -92.05%  (p=0.000 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16      76.4ns ± 3%    21.9ns ± 3%  -71.38%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16      1.47µs ± 8%    1.42µs ±11%     ~     (p=0.113 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16       78.1ns ± 4%    58.4ns ± 4%  -25.28%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16    1.34µs ± 5%    0.03µs ± 3%  -97.91%  (p=0.000 n=9+9)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16     77.0ns ± 2%    15.9ns ± 4%  -79.39%  (p=0.000 n=8+10)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16     1.42µs ± 9%    1.46µs ± 7%     ~     (p=0.118 n=10+10)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16      78.5ns ± 4%    54.0ns ± 4%  -31.18%  (p=0.000 n=10+10)
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added two commits that can be ignored -- they are simple performance optimizations. The last commit also has the latest benchmark numbers -- the worst case slowdown is 1.75x and the best case speedup is 48x.

Reviewable status: 0 of 13 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @petermattis)


sstable/value_block.go, line 77 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It would be helpful to have an ascii-art diagram that describes the structure here.

If I'm understanding the above correctly, the data block still contains keys for all of the historic versions. The space for those keys can be significant as well. Did you try to imagine any designs where the keys and values are stored elsewhere for historic versions. For example, the newest key could be stored in the data block along with its inline value and a pointer to another data block where subsequent versions of the key are stored. Iteration would become a bit more complex as iterating in a data block potentially has to move to another data block before popping back to the current one. This is just a drive-by thought. I haven't put much time into thinking it through or truly understanding what you've done in this PR.

I didn't consider moving the keys elsewhere since (a) if there are many versions of a key, we should be getting a lot out of prefix compression (modulo restarts), (b) hopefully keys are not very long (I realize this is not a great assumption for sql tables and indexes), (c) the iterator tree and the top-level iterators make decisions based on userkey comparisons, and hiding userkeys completely would be quite complicated -- anything being moved to the value blocks is not necessarily an older version since the latest version could have been deleted in a higher level sstable (this PR is doing what I sketched out at a high-level in #1170 (comment)).

name                                                                       old time/op    new time/op    delta
ValueBlocksDeletedValue/valueSize=100/needValue=false/hasCache=false-16      40.7ns ± 1%    22.8ns ± 2%  -44.02%  (p=0.000 n=9+10)
ValueBlocksDeletedValue/valueSize=100/needValue=false/hasCache=true-16       15.2ns ± 3%    14.7ns ± 4%   -3.09%  (p=0.007 n=10+10)
ValueBlocksDeletedValue/valueSize=100/needValue=true/hasCache=false-16       41.2ns ± 4%    33.0ns ± 5%  -20.01%  (p=0.000 n=10+10)
ValueBlocksDeletedValue/valueSize=100/needValue=true/hasCache=true-16        15.9ns ± 2%    19.2ns ± 4%  +20.85%  (p=0.000 n=10+10)
ValueBlocksDeletedValue/valueSize=1000/needValue=false/hasCache=false-16     65.9ns ± 3%    24.1ns ± 4%  -63.50%  (p=0.000 n=10+10)
ValueBlocksDeletedValue/valueSize=1000/needValue=false/hasCache=true-16      19.3ns ± 5%    14.5ns ± 3%  -24.99%  (p=0.000 n=10+10)
ValueBlocksDeletedValue/valueSize=1000/needValue=true/hasCache=false-16      65.9ns ± 6%    90.0ns ±10%  +36.61%  (p=0.000 n=10+10)
ValueBlocksDeletedValue/valueSize=1000/needValue=true/hasCache=true-16       19.6ns ± 4%    21.0ns ± 4%   +7.51%  (p=0.000 n=10+10)
ValueBlocksDeletedValue/valueSize=10000/needValue=false/hasCache=false-16     630ns ± 8%      23ns ± 3%  -96.34%  (p=0.000 n=10+10)
ValueBlocksDeletedValue/valueSize=10000/needValue=false/hasCache=true-16     43.2ns ± 2%    14.7ns ± 1%  -66.04%  (p=0.000 n=9+8)
ValueBlocksDeletedValue/valueSize=10000/needValue=true/hasCache=false-16      680ns ± 4%     729ns ± 8%   +7.19%  (p=0.007 n=10+10)
ValueBlocksDeletedValue/valueSize=10000/needValue=true/hasCache=true-16      43.3ns ± 2%    37.3ns ± 3%  -13.92%  (p=0.000 n=9+10)

value-size:  100,data-block-bytes:    89448,total-bytes:  1090925,fraction-in-data-blocks:0.08
value-size: 1000,data-block-bytes:    95216,total-bytes: 10100022,fraction-in-data-blocks:0.01
value-size:10000,data-block-bytes:   105246,total-bytes:100138926,fraction-in-data-blocks:0.00

The time savings for the needValue=false/hasCache=false cases are less than
the byte savings (see fraction-in-data-blocks).
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 13 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


sstable/value_block.go, line 77 at r3 (raw file):

Previously, sumeerbhola wrote…

I didn't consider moving the keys elsewhere since (a) if there are many versions of a key, we should be getting a lot out of prefix compression (modulo restarts), (b) hopefully keys are not very long (I realize this is not a great assumption for sql tables and indexes), (c) the iterator tree and the top-level iterators make decisions based on userkey comparisons, and hiding userkeys completely would be quite complicated -- anything being moved to the value blocks is not necessarily an older version since the latest version could have been deleted in a higher level sstable (this PR is doing what I sketched out at a high-level in #1170 (comment)).

Prefix compression will definitely save a lot of space for different versions of the same key, but even with that compression the space usage isn't trivial and decoding the prefix compressed keys has overhead that shows up on scan benchmarks. For many SQL tables, I think the keys and values are approximately the same size (it would be interesting to verify this, but that is my recollection from looking at TPC-C data at one point). My thinking for storing both the keys/values for historic versions out-of-line is that an sstable needs to have fast access to the newest key/value for a particular userkey, but the historic versions can be slower to access. For a queue-like workload, I'd expect an L6 sstable to have the majority of its data in the slower-to-access region.

My mental model for how we could completely separate older versions is to model a logical sstable as two physical sstables (though we could also store the older versions in separate blocks of the same sstable). sstable.Writer.Add would add the first version of the key to the "upper" (newer) table and the subsequent versions of the key to the "lower" (older) table. A special sstable.Iterator would blend the two tables. We'd need to expose an Iterator.NextPrefix method which would be akin to MVCCIterator.NextKey. If we're iterating over the latest version we never even have to touch the older version blocks, and that lack of decoding of the older version keys would be very significant. The MVCC scanner "iters before seek" mechanism would be pushed into NextPrefix and we'd in turn push some of that logic down into the leaf iterators. For sstables with separated historic versions, the sstable NextPrefix would be almost as as fast as a single Next and we could avoid the slow SeekGE calls that destroy scan performance. The downside to this layout is that if you do need to access a historic version you incur the penalty of another Seek.

@sumeerbhola
Copy link
Collaborator Author

... The downside to this layout is that if you do need to access a historic version you incur the penalty of another Seek.

  • The 2 physical sstable model is roughly how the original issue perf: separate latest and old MVCC versions for faster reads #1170 was written. There are some difficulties there in that the latest version could be deleted by a higher level, so completely hiding an older version when deciding to construct an iterator is not viable. I believe what you are suggesting is exposing the older version lazily when needed, which is possibly workable. I need to think some more.

  • I was trying to come up with a solution where accessing older versions would be almost the same cost as now. Doing a Seek would be much more costly.

  • There are also cases (the gc ones mentioned in perf: separate latest and old MVCC versions for faster reads #1170 (comment)) that need only the size of each version (including older versions), which this implementation provides without accessing the value blocks.

  • There seem to be two high-level options:

    • move older keys to a different part too: I am quite convinced that accessing these should avoid a seek. I think avoiding the seek is doable by having some notion of a continuation handle, such as (a@10=>val, a@8=>continuation handle) where the continuation handle points to all remaining key-value pairs with prefix a.
    • (the solution here) only the values are in a different part: One can still optimize NextPrefix to not do O(key-len) work (since we have the valueHandlePrefix byte) and if there are data blocks that contain only older versions one can completely skip them using block property filters (@jbowens outlined one idea here perf: separate latest and old MVCC versions for faster reads #1170 (comment)). So the main overhead over the other option is that we would repeat the key every BlockRestartInterval. If there are 100s of versions in a swathe of the key space and the default value of 16 of BlockRestartInterval is insufficient we could also keep stats in the properties block about versions and use the input stats for all sstables in a compaction to choose the BlockRestartInterval.

    I'm still partial to the latter option since the layout and code is simpler. I'll try to add some optimized code for NextPrefix for the case of long keys and see what the cpu profiles and the block bytes tell us about the cost.

@petermattis
Copy link
Collaborator

move older keys to a different part too: I am quite convinced that accessing these should avoid a seek. I think avoiding the seek is doable by having some notion of a continuation handle, such as (a@10=>val, a@8=>continuation handle) where the continuation handle points to all remaining key-value pairs with prefix a.

Yeah, I think we can avoid the second sstable-seek, and doing so is likely necessary for performance. The continuation handle approach would still suffer from a potential disk seek. On the other hand, by moving the keys out of the scan path we're in the optimal design space for scanning over the most recent versions of the keys. With the separated historic value approach a scan would still see all of the data for the historic keys. I'm unconvinced that we'll frequently see data blocks that only contain historic keys. We target a data block size of 32 KB. So on average we'll have 16 KB of data block remaining when we start storing historic keys. If I'm remembering correctly, in TPCC keys are ~100 bytes in size. With prefix compression we're only storing that 100 bytes every 16th key, and the remainder of the keys we're storing the suffix. So figure ~15 bytes per key on average for the historic keys (I'm estimating 9 bytes for the suffix) plus 9 bytes for the value handle. We can store 682 historic versions for such keys in 16 KB, so it would only be tables with truly massive numbers of historic versions that would see a benefit from being able to skip data blocks that contain only historic versions. Additionally, these data blocks would sequentially follow in the sstable, so readahead would be reading them from disk even if they are skipped.

I'm not attached to the two-sstable approach, or to other ideas of completing separating historic keys and values. Seems like there are perf trade-offs with either approach (e.g. GC needs access to the value sizes, but not the values).

For a queue-like workload, I'd expect an L6 sstable to have the majority of its data in the slower-to-access region.

This statement wasn't correct. In a queue-like workload, a record is added once (to the head of the queue), and then deleted once (from the tail of the queue). The problem with queue-like workloads is that there a huge prefix of MVCC deletion tombstones that have to be iterated over at the tail in order to find the first live row. I don't think any of the approaches discussed in this PR address this problem.

PS I'm quite excited that we have some viable approaches to speeding up scanning tables that have many historic versions.

NextSkipOlderVersions is currently a best-effort attempt that
skips older SET versions only for sstables written using
ValueBlocksAreEnabled.

The benchmark tests this with long keys and values, 1000 bytes
each, where consecutive keys with different prefixes are completely
random so unlikely to have any significant shared prefix.

The versions and BlockRestartInterval is varied:
versions:  1,restart-interval:16,data-block-bytes: 20177767,total-bytes: 20188429,fraction-in-data-blocks:1.00
versions:  1,restart-interval:32,data-block-bytes: 20177767,total-bytes: 20188429,fraction-in-data-blocks:1.00
versions:  1,restart-interval:64,data-block-bytes: 20177767,total-bytes: 20188429,fraction-in-data-blocks:1.00
versions:  2,restart-interval:16,data-block-bytes: 10863502,total-bytes: 16208363,fraction-in-data-blocks:0.67
versions:  2,restart-interval:32,data-block-bytes: 10529835,total-bytes: 15874694,fraction-in-data-blocks:0.66
versions:  2,restart-interval:64,data-block-bytes: 10529835,total-bytes: 15874694,fraction-in-data-blocks:0.66
versions: 10,restart-interval:16,data-block-bytes:  2356652,total-bytes: 11453297,fraction-in-data-blocks:0.21
versions: 10,restart-interval:32,data-block-bytes:  2423386,total-bytes: 11469656,fraction-in-data-blocks:0.21
versions: 10,restart-interval:64,data-block-bytes:  2429837,total-bytes: 11512104,fraction-in-data-blocks:0.21
versions:100,restart-interval:16,data-block-bytes:   345979,total-bytes: 10284363,fraction-in-data-blocks:0.03
versions:100,restart-interval:32,data-block-bytes:   318700,total-bytes: 10246809,fraction-in-data-blocks:0.03
versions:100,restart-interval:64,data-block-bytes:   298197,total-bytes: 10221192,fraction-in-data-blocks:0.03

The results are

BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=16/hasCache=true-16         	10406703	       108.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=32/hasCache=true-16         	11452221	       122.4 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=64/hasCache=true-16         	 9762536	       115.9 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=16/hasCache=true-16         	14293557	        87.28 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=32/hasCache=true-16         	15624984	        79.73 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=64/hasCache=true-16         	15065860	        80.10 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=16/hasCache=true-16        	 7101529	       165.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=32/hasCache=true-16        	 7561116	       155.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=64/hasCache=true-16        	 7571413	       149.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=16/hasCache=true-16       	 1000000	      1112 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=32/hasCache=true-16       	 1211478	      1040 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=64/hasCache=true-16       	 1238484	      1019 ns/op	       0 B/op	       0 allocs/op

- The versions=100 cases are ~10x slower than versions=1. A naive implementation would
  be expected to be 100x slower. But 10x slower is not great. A ratio closer to 1 is
  desirable.
  Looking at cpu profiles for
  BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=16/hasCache=true-16
  readEntryOrSkipIfOldVersion is ~90% of the time. This includes having to decode various
  varints and assembling fullKey. Assembling the fullkey is ~43%, even though it is
  O(key-length) only when there is a restart or the prefix changes.

  So the throughput is a direct result of the inherent limitation of storing all keys in
  the data blocks.

- The versions=10 cases are quite good -- ~1.5x slower than versions=1.
@sumeerbhola
Copy link
Collaborator Author

I've added a new commit with results for NextSkipOlderVersions. The commit has details, and here is a summary:

  • With 10 versions, the results are competitive in comparison with 1 version (only 1.5x slower -- a naive iteration would be 10x slower since it would call Next 10 times).
  • 100 versions is 10x slower (so the slowdown is accelerating since a constant slope would mean 1.5*1.5=2.25x slower). Though still better than a naive 100x slowdown, but could have been better. The major cost based on cpu profiles is inherently due to the fact that we are storing all keys in the data blocks.

I thought some more about removing older version keys out of the data blocks: there are places that do seeks using a key that has a non-empty version in CockroachDB (intent resolution; pebbleMVCCScanner using the read timestamp; deletion step of GC that seeks to the version being deleted). To support these seeks, we would need to be able to accurately position among the older versions, which adds more complexity to how these older versions are stored and indexed.

So my current thinking is that removing older version keys out of the data blocks is too risky (from a performance perspective when older versions are needed) and too complicated as a first iteration over the sstable format. @petermattis let me know what you think.

…o key prefix compression

10 versions is now ~1x of 1 version and 100 versions is ~6x

BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=16/hasCache=true-16         	 9323478	       108.9 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=32/hasCache=true-16         	11679625	       103.3 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=1/restartInterval=64/hasCache=true-16         	11009826	       110.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=16/hasCache=true-16         	13600150	        74.92 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=32/hasCache=true-16         	13882718	        72.97 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=2/restartInterval=64/hasCache=true-16         	13649173	        81.17 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=16/hasCache=true-16        	10129204	       113.3 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=32/hasCache=true-16        	10582024	       109.6 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=10/restartInterval=64/hasCache=true-16        	10869471	       107.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=16/hasCache=true-16       	 1934696	       643.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=32/hasCache=true-16       	 2090880	       567.4 ns/op	       0 B/op	       0 allocs/op
BenchmarkValueBlocksNextSkipOlderVersions/versions=100/restartInterval=64/hasCache=true-16       	 2177060	       554.1 ns/op	       0 B/op	       0 allocs/op
@sumeerbhola
Copy link
Collaborator Author

Added another commit that reduces 10 versions to 1x of 1 version and 100 versions to 6x of 1 version.

@sumeerbhola
Copy link
Collaborator Author

Added another commit that reduces 10 versions to 1x of 1 version and 100 versions to 6x of 1 version.

Most of the cost (75-80%) in the 100 versions case is in decoding varints as we step through all the key versions.
We can avoid this cost by dynamically switching (at read time) from stepping through each key to stepping through restarts. We can correctly (and cheaply) do the latter by storing a byte (bool) (at write time) with each restart key that indicates that all intervening keys since the previous restart in the same block are SETs, and the previous restart has the same prefix.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the cost (75-80%) in the 100 versions case is in decoding varints as we step through all the key versions.
We can avoid this cost by dynamically switching (at read time) from stepping through each key to stepping through restarts. We can correctly (and cheaply) do the latter by storing a byte (bool) (at write time) with each restart key that indicates that all intervening keys since the previous restart in the same block are SETs, and the previous restart has the same prefix.

I've seen varint decoding show up on sstable iterator benchmarks as well. I've wondered in the past if there is a better base block layout, possibly using a different varint encoding scheme. I did some experimentation in this area but never had satisfying results.

Your idea to mark restart keys is interesting. Would that bool be placed in the restart array? You could store this bool in the low order bit of the restart offset so that it would usually not consume any additional storage (though there would be an extra shift while decoding from the restart array). Another idea would be to store some sort of pointer to the next userkey with the value for the first userkey. That would be more complicated, but would make skipping larger numbers of versions very cheap.

So my current thinking is that removing older version keys out of the data blocks is too risky (from a performance perspective when older versions are needed) and too complicated as a first iteration over the sstable format. @petermattis let me know what you think.

Your performance numbers are compelling. I'm good with continuing down the path you started on. This PR is a bit ginormous, though. Not sure where to concretely start reviewing it. Perhaps an RFC is in order.

Reviewable status: 0 of 13 files reviewed, 3 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


-- commits, line 177 at r7:
1000 byte values feels larger than what I'd expect, especially for keys that are being overwritten frequently and thus have a lot of historic version. I wonder if we can track down some actual value sizes from customer workloads that have experienced problems with historic versions. Or add a variant of these benchmarks for values that are 10 bytes in size (i.e. just a single column).

@sumeerbhola
Copy link
Collaborator Author

I've seen varint decoding show up on sstable iterator benchmarks as well. I've wondered in the past if there is a better base block layout, possibly using a different varint encoding scheme. I did some experimentation in this area but never had satisfying results.

The only cheap thing I've encountered is fixed-width encoding, based on knowledge of the maximum value in some context (like a file), but then one has to store the width somewhere. I did that here for the value blocks index for a different reason (because I wanted to calculate the offset based on the blocknum).

Your idea to mark restart keys is interesting. Would that bool be placed in the restart array? You could store this bool in the low order bit of the restart offset so that it would usually not consume any additional storage (though there would be an extra shift while decoding from the restart array).

I really like this idea! I was bothered by wasting a whole byte for a bool. We definitely don't need 32 bits for restart offsets (even if we end up having larger blocks for disaggregated shared sstables, they would still be O(1MB)).

1000 byte values feels larger than what I'd expect, especially for keys that are being overwritten frequently and thus have a lot of historic version. I wonder if we can track down some actual value sizes from customer workloads that have experienced problems with historic versions. Or add a variant of these benchmarks for values that are 10 bytes in size (i.e. just a single column).

Yes, I was just trying to be extreme in this benchmark. I can look into historic reports of customer problems in this area to see if they mention the schema.

Your performance numbers are compelling. I'm good with continuing down the path you started on. This PR is a bit ginormous, though. Not sure where to concretely start reviewing it. Perhaps an RFC is in order.

Sorry, this degenerated into a series of hacky experiments piled onto a baseline commit. I agree this should get a design doc/RFC now that we have a design direction that seems to work. After that one can start extracting parts from here into smaller PRs for review.
So there isn't really a need to review this. For the other curious folks, value_block.go in the first commit (and the tiny comment in table.go) should be sufficient to understand what is being done here.

@petermattis
Copy link
Collaborator

The only cheap thing I've encountered is fixed-width encoding, based on knowledge of the maximum value in some context (like a file), but then one has to store the width somewhere. I did that here for the value blocks index for a different reason (because I wanted to calculate the offset based on the blocknum).

What I experimented with were group-varint encodings. StreanVByte is that latest hotness in that area that I'm aware of. Block records currently have 3 varints: the shared key prefix length, the unshared key length, and the value length. On the surface that seems amenable to some form of group varint encoding. But what I found in practice is that this didn't make a measurable difference on iteration speed, even though varint decoding was faster.

@sumeerbhola
Copy link
Collaborator Author

Thanks for the links and the review!

This statement wasn't correct. In a queue-like workload, a record is added once (to the head of the queue), and then deleted once (from the tail of the queue). The problem with queue-like workloads is that there a huge prefix of MVCC deletion tombstones that have to be iterated over at the tail in order to find the first live row. I don't think any of the approaches discussed in this PR address this problem.

I totally agree that the approach here doesn't solve the problem. But until we offer first-class queue support and folks are still doing naive queues, and happen to have small keys (like UUIDs) and large values (diffs/log entries/...) it will at least reduce the volume of bytes they need to iterate over (which may not fit well in the block cache). The benchmark for this case (from a later commit) shows 99% of bytes in the value blocks for 1000 byte values. The cpu saving is less. The benchmark is not paying any IO latency (due to using a mem FS).

ValueBlocksDeletedValue/valueSize=1000/needValue=false/hasCache=false-16     65.9ns ± 3%    24.1ns ± 4%  -63.50%  (p=0.000 n=10+10)
value-size: 1000,data-block-bytes:    95216,total-bytes: 10100022,fraction-in-data-blocks:0.01

One could argue that this just delays when a user notices an unscalable approach, which is not a desirable outcome :)

@petermattis
Copy link
Collaborator

I had a realization today: storing the value separately from the key is exactly what WiscKey does. Only in this case you're only storing historic values separately. For large values, there is an argument to be made that we do this for all values. And we could also condition whether a value is stored separately based on the size of the value. I don't recall the threshold WiscKey uses, but I believe Badger used a threshold of 128 bytes.

@jbowens
Copy link
Collaborator

jbowens commented Jan 18, 2022

I had a realization today: storing the value separately from the key is exactly what WiscKey does. Only in this case you're only storing historic values separately

Storing the values separately within the same sstable instead of a separate log (like WiscKey) does eliminate a lot of WiscKey's write amplification benefits, because a WiscKey compaction avoids rewriting values.

@sumeerbhola
Copy link
Collaborator Author

I had a realization today: storing the value separately from the key is exactly what WiscKey does. Only in this case you're only storing historic values separately. For large values, there is an argument to be made that we do this for all values. And we could also condition whether a value is stored separately based on the size of the value. I don't recall the threshold WiscKey uses, but I believe Badger used a threshold of 128 bytes.

Storing the values separately within the same sstable instead of a separate log (like WiscKey) does eliminate a lot of WiscKey's write amplification benefits, because a WiscKey compaction avoids rewriting values.

These are all good points. There was a deliberate choice here not to go as far as WiscKey, and to give up (for now) on the write amp benefits:

  • WiscKey is using parallel reads to overcome the cost of values being arbitrarily ordered in the vLog. This approach may be fine when one is directly provisioning SSDs, but we have be cognizant of what IOPS cost us in the public cloud (on virtualized block storage or on blob storage).
  • The vLog abstraction is sufficiently different from what we currently have that I think we would need to experiment a lot to be sure we are not regressing certain workloads. Also, we need to write a background GC for it. I think separate value blocks in the same file is an easier reach.
  • I think it will be viable to further extend the format here with value handles that point into other files, if we decide to in the future. So something like: inline the latest version if not a "very large" value; "very large" value in a separate file; remaining in value blocks in the same file.

@petermattis
Copy link
Collaborator

I wasn't meaning to imply that we should go the full WiscKey approach. #112 gives an overview of what that would involve. I very much like the simplification taken in this PR. My point was that there may be some learnings from WiscKey that apply here, such as having a size threshold for deciding when to store values inline vs out-of-line. And considering storing the most recent value out of line if its size is large.

@sumeerbhola
Copy link
Collaborator Author

My point was that there may be some learnings from WiscKey that apply here, such as having a size threshold for deciding when to store values inline vs out-of-line. And considering storing the most recent value out of line if its size is large.

Good point. I suppose we could use something like a cumulative size threshold. I'll think some more about it.

@sumeerbhola
Copy link
Collaborator Author

reminder: with MVCCValue, the stats adjustment code needs both the encoded size of the (localtimestamp, value) and whether the underlying value was empty (indicating an empty tombstone). We can use a bit for the latter and keep it in the key section.

@sumeerbhola
Copy link
Collaborator Author

Coming back the discussion about group varint encoding in #1443 (comment)

  • We have 2 places we use groups of varints that are potential for optimization since they are present for each key-value pair (optimizing in index entries is not worth it): the valueHandle is 3 uint32s (valueLen, blockNum, offsetInBlock). The blockIter.readEntry reads 3 uint32s (shared key length, unshared key length, value length).
  • StreamVByte and the likes are for large streams of numbers, while we only have three. So they are not really applicable. But FTR, there is a better Go implementation of StreamVByte in https://github.com/thempatel/streamvbyte-simdgo
  • Group varint with 1 control byte followed by 3 uint32s seems the best prospect. I hacked up some (unclean) implementations in sumeerbhola@8f82bd7 and did not see enough benefit in benchmarks to bother with the deficiencies (noted in the commit).

@nicktrav
Copy link
Contributor

@sumeerbhola - shall we close this out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants