Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: draft design sketch for ranged keys #1341

Merged
merged 1 commit into from
Sep 13, 2022

Conversation

sumeerbhola
Copy link
Collaborator

Informs #1339

@sumeerbhola sumeerbhola requested review from erikgrinaker, jbowens and a team October 19, 2021 02:20
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dumped the current text from the issue here so we can collaborate on the design.

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @erikgrinaker and @jbowens)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like pushing this complexity into Pebble rather than trying to do it above Pebble.

Reviewable status: 0 of 1 files reviewed, 6 unresolved discussions (waiting on @erikgrinaker, @jbowens, and @sumeerbhola)


docs/rfcs/20211018_range_keys.md, line 38 at r1 (raw file):

There are two new operations: 

- `Set([k1, k2), [optional suffix], <value>)`: This represents the

We should come up with the real name for this operation at some point. Set and Del imply point operations to me. RangeSet and RangeDel would make sense, except RangeDel will likely get confused with DeleteRange. SpanSet and SpanDel?


docs/rfcs/20211018_range_keys.md, line 39 at r1 (raw file):

- `Set([k1, k2), [optional suffix], <value>)`: This represents the
  mapping `[k1, k2)@suffix => value`.

Should the optional suffix be explicitly supplied, or can we require that the suffixes returned by Split on k1 and k2 are equal?


docs/rfcs/20211018_range_keys.md, line 86 at r1 (raw file):

  iterators used for compaction writes.

- Point deletes only apply to points.

I think you can incorporate this bullet into the one above about "point keys and range keys do not interact". I'd find it clarifying to be explicit that Del operations only apply to range keys.


docs/rfcs/20211018_range_keys.md, line 91 at r1 (raw file):

  can remove part of a range key, just like the new Del we introduced
  earlier. Note that these two delete operations are different since
  the Del requires that the suffix match. A RANGEDEL on the other hand

What are the semantics of a RANGEDEL applying to a part of a range key? Consider a RANGEDEL [a@30, d@90) and a range-key [c,e)@70. Does the range-key get masked to become [d,e)@70?

Do range-key Del operations mask point operations? I think not based on other bullets, but that is another difference between the two operations. Is it problematic for CRDB if Del operations mask point operations? I ask because I find it a bit unfortunate that we're going to have both RANGEDEL and range-key Del. That feels like it will cause confusion. Perhaps it is necessary (I haven't thought through this as much as you have).


docs/rfcs/20211018_range_keys.md, line 153 at r1 (raw file):

Key() []byte
PointValue() []byte
RangeEndKey) []byte

Nit: s/RangeEndKey)/RangeEndKey()/g


docs/rfcs/20211018_range_keys.md, line 406 at r1 (raw file):

cleanly applied to range keys since it is not limited to a subset of
versions. However, currently one can also construct RANGEDELs like
[b@30,d@90), which are acceptable for bulk deletion of points, but

Do we ever construct RANGEDELs in CRDB like [b@30,d@90)? My recollection is that we're always passing in an empty MVCC timestamp for DeleteRange operations.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @erikgrinaker and @sumeerbhola)


docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
[Continuing our discussion from #1339]

There are difficulties here. For example [b@30,c). How does it represent that bb@49 is masked

Good point. I do think we want the sstable and its ranges to be self-describing, even if we think we could work around it. It seems subtle and error prone otherwise. I think your diagram from the range deletions document is a useful visualization here if we redefine the y-axis to be timestamp:

     ^
     |   .       > .        > .        > .        > .↥
     |   .      >  .       >  .       >  .       >  .
     |   .     >   .      >   .      >   .      >   .
@50  |   V    >    x↧    >    .     >    .     >    .
@40  |   .   >     x.   >     x.   >     y.   >     .
@30  |   .  >      x.  >      x.  >      y.  >      .
@20  |   . >       x. >       y. >       y. >       .
@10  |   .>        x.>        y.>        y.>        .
     -------------------------------------------------------->
                   k        IS(k)    IS(IS(k))  IS(IS(IS(k)))
                   k          k1         k2         k3

Say we want to split this range into two sstables, x and y, at such that k1@20 is the first key in the second sstable. We need to be able to encode that a range spans whole user keys, and also a that a range acts only within the timestamps of a key.

We could define [a@t1,b@t2) as covering the keys in [a,b)@t1 + b@[t1,t2). In the above diagram, sstable x would have a range [k@50, k1@20). The sstable y needs to encode the rest of the key k1, and could do that as [k1@20,ImmediateSuccessor(k1)). Then, the remainder of the whole user keys covered could be represented as [ImmediateSuccessor(k1)@50,k3).

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @erikgrinaker, @petermattis, and @sumeerbhola)


docs/rfcs/20211018_range_keys.md, line 91 at r1 (raw file):

What are the semantics of a RANGEDEL applying to a part of a range key? Consider a RANGEDEL [a@30, d@90) and a range-key [c,e)@70. Does the range-key get masked to become [d,e)@70?

Good question. This is why the later text introduces RANGEDEL_PREFIX that can only be of the form [a, d).

Do range-key Del operations mask point operations? I think not based on other bullets, but that is another difference between the two operations. Is it problematic for CRDB if Del operations mask point operations?

I realized I've used masking in a different sense below for MVCC version masking. So here let me use the work overwrite. I'm actually not sure whether having range keys not overwrite point keys and vice versa is the right thing to do. And now that we need ImmediateSuccessor should there really be a difference between b and [b,ImmediateSuccessor(b))? The ImmediateSuccessor also makes it viable to construct user keys to accomplish such overwrites. One practical issue is that point keys are deterministic -- if we blur the difference between point and range keys, code that is trying to anchor itself at deterministic places (like the metamorphic test) may no longer be able to do so. We could still preserve the difference from an iterator and storage perspective while doing the overwrites. In that world a range-key Del operation would also apply to point keys.


docs/rfcs/20211018_range_keys.md, line 406 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Do we ever construct RANGEDELs in CRDB like [b@30,d@90)? My recollection is that we're always passing in an empty MVCC timestamp for DeleteRange operations.

We do, in https://github.com/cockroachdb/cockroach/blob/c8c186f2d27f24c489aba5f6a38b57b48f628148/pkg/storage/mvcc.go#L2171-L2213
That is the only place (its easy to tell now since the Writer makes the caller distinguish between ClearMVCCRangeAndIntents(start, end roachpb.Key) and ClearMVCCRange(start, end MVCCKey)).

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, and @sumeerbhola)


docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):

We could define [a@t1,b@t2) as covering the keys in [a,b)@t1 + b@[t1,t2). In the above diagram, sstable x would have a range [k@50, k1@20). The sstable y needs to encode the rest of the key k1, and could do that as [k1@20,ImmediateSuccessor(k1)). Then, the remainder of the whole user keys covered could be represented as [ImmediateSuccessor(k1)@50,k3).

In this example are point keys for k1 spanning the two sstables?

Another alternative:

  • [k,k1)@50 in first sstable. Does not apply to any k1 prefix.
  • [k1, ImmediateSuccessor(k1))@50 in first sstable, because we want it to version-mask point key k1@49 in the first sstable
  • First point key in the second sstable is k@40. So include [k1, ImmediateSuccessor(k1))@41 in the second sstable.

This semantics of a range key are simpler in this case. But we can't in the future merge [k, ImmediateSuccessor(k1))@50, [k1, ImmediateSuccessor(k1))@41, unlike your idea.

What bothers me about both options is that we are starting to construct versions that were not in the original write, and I think that it will increase complexity.

I'm still wondering whether we can get away with [k1, ImmediateSuccessor(k1))@50 in the first sstable being able to version-mask the k@40 in the second sstable. Given that range keys do not follow the LSM invariant, I suspect we'll need to keep internal state in the iterator about ranges from sstables that are no longer loaded. I need to think more about how the iterator would function ignoring any issues wrt improperly split range keys.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, and @sumeerbhola)


docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):

In this example are point keys for k1 spanning the two sstables?

Yeah.


Given that range keys do not follow the LSM invariant

This might be a dumb question, but why do they not follow the LSM invariant with respect to the keys they cover? Can we do more or less exactly what we do with range deletions?

We have a versioned keyspace like: a, a@t10, a@t8, a@t3, b, b@t11, b@t12. We could choose to define Pebble range keys as occupying the one-dimensional keyspace just like Pebble range deletions. A range key [a@t10, b@t11) covers all keys k s.t. a@t10 ≤ k < b@t11 in the one-dimensional space. It doesn't mask all of them. Those semantics are up to the iterator consumer.

Keys with higher sequence numbers but the same prefix may descend beneath other keys with lower sequence numbers and the same prefix. So keys with the same prefix may be jumbled up and down the LSM, regardless of sequence order. However since range keys cover that same one-dimensional space, the LSM invariant can be maintained with the respect to the relationship between the range key and the point keys it covers. Range keys are fragmented across sstables and contribute to sstable boundaries like range deletions do. [It seems like we could also do the same "effective-within" logical truncation that we do with range deletions to avoid constructing synthetic keys using ImmediateSuccessor?]

With this structure the mergingIter would keep one sstable's range key block open at every level of the LSM just like it does with range deletions.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, and @v)


docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):

but why do they not follow the LSM invariant with respect to the keys they cover? Can we do more or less exactly what we do with range deletions?

I misspoke in my previous comment. I had meant to say "don't follow the LSM invariant for prefixes". They do follow the LSM invariant for the user key since [k1, k2)@v#s where k1 != k2 contributes k1@v#s to the sstable start bound and k2#inf to the sstable end bound.

I thought a bit about the mergingIter case and I think what is written in the current text of the doc suffices. Consider the two cases (lets ignore equal versions for point and range keys for now -- we need to clean up the spec in that area):

  • [k1,k2)@v#s where k1 != k2: this guarantees that there is no other sstable on the same level that has (a) point key k1@v2 and v2 < v, (b) point keys with prefixes in [k1, k2) and prefix > k1. So once the masking iterator has "reached" k1@v it can stay at this sstable (for this level) until the heap for points reaches k2. Which means this sstable stays loaded (at this level) while we need to use [k1,k2)@v#s to compute the masking behavior.
  • [k1,IS(k1))@v#s: This behaves like a point k1@v#s. There can be a point at this level with k1@v2 and v2 < v, which is in the next sstable. But before we transition to that next sstable, the merging iterator only needs to remember that it has emitted something for prefix k1 (which is the last prefix it has emitted), and so it should skip past all other k1 prefixed keys. It doesn't even need to remember @v since the heap ordering will ensure that any k1 it encounters in the future will have version < v.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, and @v)


docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):

Previously, sumeerbhola wrote…

but why do they not follow the LSM invariant with respect to the keys they cover? Can we do more or less exactly what we do with range deletions?

I misspoke in my previous comment. I had meant to say "don't follow the LSM invariant for prefixes". They do follow the LSM invariant for the user key since [k1, k2)@v#s where k1 != k2 contributes k1@v#s to the sstable start bound and k2#inf to the sstable end bound.

I thought a bit about the mergingIter case and I think what is written in the current text of the doc suffices. Consider the two cases (lets ignore equal versions for point and range keys for now -- we need to clean up the spec in that area):

  • [k1,k2)@v#s where k1 != k2: this guarantees that there is no other sstable on the same level that has (a) point key k1@v2 and v2 < v, (b) point keys with prefixes in [k1, k2) and prefix > k1. So once the masking iterator has "reached" k1@v it can stay at this sstable (for this level) until the heap for points reaches k2. Which means this sstable stays loaded (at this level) while we need to use [k1,k2)@v#s to compute the masking behavior.
  • [k1,IS(k1))@v#s: This behaves like a point k1@v#s. There can be a point at this level with k1@v2 and v2 < v, which is in the next sstable. But before we transition to that next sstable, the merging iterator only needs to remember that it has emitted something for prefix k1 (which is the last prefix it has emitted), and so it should skip past all other k1 prefixed keys. It doesn't even need to remember @v since the heap ordering will ensure that any k1 it encounters in the future will have version < v.

That makes sense, but isn't it still more complicated than fragmenting these ranges like we do range deletions, such that each sstable is self-describing?

I might be overlooking something, but it seems like fragmenting these ranges within the one-dimensional space and ignoring prefixes would also allow us to not require ImmediateSuccessor implementations. Reusing the same concepts as range tombstones also reduces complexity.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When discussing range tombstones for CRDB, we noted that a key span which sees frequent range deletes would keep accumulating fragments over time: any new range key that spans an existing range key has to write the same amount of fragments or more. Without an opposing mechanism that opportunistically merges fragments (e.g. during compaction/GC), the amount of fragments could build up until it becomes pathological. I could not see any mention of this in the document; should we consider merging fragments during compaction where possible to avoid this?

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @41, @50, @petermattis, @sumeerbhola, and @v)

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @41, @50, @petermattis, @sumeerbhola, and @v)


docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
We probably need a more synchronous discussion.

That makes sense, but isn't it still more complicated than fragmenting these ranges like we do range deletions, such that each sstable is self-describing?

I'm not sure what that would really be like.
RANGEDELs are "easier" since they are not made visible to the user via the Iterator interface, so the improper splitting is hidden. One can't also delete parts of a RANGEDEL with another user operation, like we can do with these range keys.
Which is what makes me happiest with designs that only store and expose range keys of the form [k1,k2)@v, and have the sstable bounds be self-describing as it pertains to these range keys.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider merging fragments during compaction where possible to avoid this?

good point -- that was an omission that should be easy to rectify.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @41, @50, @petermattis, @sumeerbhola, and @v)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @41, @50, @petermattis, @sumeerbhola, and @v)


docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):

Previously, sumeerbhola wrote…

We probably need a more synchronous discussion.

That makes sense, but isn't it still more complicated than fragmenting these ranges like we do range deletions, such that each sstable is self-describing?

I'm not sure what that would really be like.
RANGEDELs are "easier" since they are not made visible to the user via the Iterator interface, so the improper splitting is hidden. One can't also delete parts of a RANGEDEL with another user operation, like we can do with these range keys.
Which is what makes me happiest with designs that only store and expose range keys of the form [k1,k2)@v, and have the sstable bounds be self-describing as it pertains to these range keys.

I suspect my thinking is still anchored to some of the earlier thinking, before we went down the path of interleaving range keys in the iterator output.

I was originally thinking of range keys as "coloring" a span of the 1-dimensional keyspace. You can use range keys to say all point keys in the span [a,c) have color blue. All point keys in the span [b,d) have color green. Or all keys in the span [d@t11, d@t3) have color orange. During iteration, the iterator surfaces a point key and all covering 'colors' applied through range keys. The user above Pebble is responsible for applying whatever semantics they attribute to the colors. These range keys are conceptually similar to range tombstones, since their spans are only defined over the 1-dimensional key space only and know nothing of key prefixes. Unlike range tombstones, they don't overwrite each other. They stack.

In the case of MVCC delete range, the timestamp of a delete range is stored outside of the span (eg, in the value, or the "param" in your earlier iteration). You might have a range key that was fragmented into the span [d@t11, d@t6) but separately encodes a timestamp <@t4 that the user-level semantics interpret as irrelevant.

The Pebble iterator needs to expose every covering color to the user, because it knows nothing of the semantics of a color. To avoid this, I think you could still add Pebble knowledge of these separate suffixes associated a range key so that Pebble only surfaces 1 range key for a suffix > then the current point key's suffix. I think the user would need to provide a comparator for suffixes to be able to do this.

These 1-dimensional range keys are nice because they act very similar to RANGEDELs and can easily be split to accommodate compactions splitting prefixes across multiple sstables.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @41, @50, @petermattis, @sumeerbhola, @t, and @v)


docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
(I need to think more about what you are proposing, but some preliminary questions below)

Going back to your earlier example:

We could define [a@t1,b@t2) as covering the keys in [a,b)@t1 + b@[t1,t2). In the above diagram, sstable x would have a range [k@50, k1@20). The sstable y needs to encode the rest of the key k1, and could do that as [k1@20,ImmediateSuccessor(k1)). Then, the remainder of the whole user keys covered could be represented as [ImmediateSuccessor(k1)@50,k3).

  • IIUC, when the user originally writes [k1,k2)@t, in this model it would actually be [k1@t,k2) with value <@t. Is that correct?
  • One of the reasons to not expose all the colors is that point keys are no longer coincident with range keys when we are dealing with MVCC keys. For example, with [a,c)@50, when we expose point a@60 through the iterator, the range key does not overlap, but when we expose b@60, there is an overlap based on just the key range (if we use the 1-dimensional keyspace). Semantically, it seems dubious to have this difference since both a@60 and b@60 are not masked, and the iterator should not behave differently for the two (whether in masked mode or not).
  • Hence the non-masked mode has the repetition semantics in order to be consistent, and output a@60, [a,c)@50, b@60, [b,c)@50. When exactly this repetition occurs needs some cleanup I think -- the text currently says only if there is a lower versioned point that succeeds. It may be that we want to repeat if there is any point at the prefix, regardless of what version it has.
  • I am not sure we need the generality (for the user) implied by "In the case of MVCC delete range, the timestamp of a delete range is stored outside of the span" and "needs to expose every covering color to the user, because it knows nothing of the semantics of a color". Pebble has documented split as being for MVCC, and Pebble needs some semantic knowledge for doing efficient masking. I suspect you are leaning in the same direction with the statement about "comparator for suffixes".

The main question I have is what are we getting in terms of (a) benefit to user semantics, or (b) implementation benefit, by supporting [k1@t1,k2@t2), compared to [k1,k2)@t. I believe the latter is easier to explain. It is very likely that I am missing something.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @41, @50, @petermattis, @sumeerbhola, @t, and @v)


docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):

IIUC, when the user originally writes [k1,k2)@t, in this model it would actually be [k1@t,k2) with value <@t. Is that correct?

Yeah, that's right.

Semantically, it seems dubious to have this difference since both a@60 and b@60 are not masked, and the iterator should not behave differently for the two (whether in masked mode or not).

I think the primary difference between these two paradigms is whether or not Pebble has first-class MVCC support, or continues to work within a 1-dimensional realm.

Pebble needs some semantic knowledge for doing efficient masking

I'm not sure about this, but I thought that it only needed to expose the ability to a) dynamically mask using a block property filter and b) to expose the covering ranges. This might be what I'm overlooking.

The main question I have is what are we getting in terms of (a) benefit to user semantics, or (b) implementation benefit, by supporting [k1@t1,k2@t2), compared to [k1,k2)@t. I believe the latter is easier to explain.

The benefits I see are:

  1. Pebble continues to deal with 1-dimensional keyspace only. Although Pebble knows about Comparer.Suffix, its usage is extremely limited. MVCC is a user-level projection onto Pebble's 1-dimensional keyspace, and prefix-range keys are a divergence from that.
  2. The range keys and RANGEDELs mirror each other closely. I suspect this conceptual similarity will make implementation easier, in compactions, fragmenting and maybe iteration. Also, it removes the need for RANGEDEL_PREFIX since range keys and RANGEDELs occupy the same 1-dimensional keyspace.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)


docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

IIUC, when the user originally writes [k1,k2)@t, in this model it would actually be [k1@t,k2) with value <@t. Is that correct?

Yeah, that's right.

Semantically, it seems dubious to have this difference since both a@60 and b@60 are not masked, and the iterator should not behave differently for the two (whether in masked mode or not).

I think the primary difference between these two paradigms is whether or not Pebble has first-class MVCC support, or continues to work within a 1-dimensional realm.

Pebble needs some semantic knowledge for doing efficient masking

I'm not sure about this, but I thought that it only needed to expose the ability to a) dynamically mask using a block property filter and b) to expose the covering ranges. This might be what I'm overlooking.

The main question I have is what are we getting in terms of (a) benefit to user semantics, or (b) implementation benefit, by supporting [k1@t1,k2@t2), compared to [k1,k2)@t. I believe the latter is easier to explain.

The benefits I see are:

  1. Pebble continues to deal with 1-dimensional keyspace only. Although Pebble knows about Comparer.Suffix, its usage is extremely limited. MVCC is a user-level projection onto Pebble's 1-dimensional keyspace, and prefix-range keys are a divergence from that.
  2. The range keys and RANGEDELs mirror each other closely. I suspect this conceptual similarity will make implementation easier, in compactions, fragmenting and maybe iteration. Also, it removes the need for RANGEDEL_PREFIX since range keys and RANGEDELs occupy the same 1-dimensional keyspace.

I tried to flesh out the behavior of this 1-dimensional key space model

I considered [k1@t,k2)#s, < @t in this model, which is what the user will write. The semantic equivalent in the other model (which offers less flexibility) is [k1,k2)@t#s. The behavior wrt 1-dimensional keys is the same for both: the start is k1@t#s and the end is k2#inf.

Now we start considering the differences.

  • Say we have [a@50,d)#s <@50 and we have a compaction output where point c@40#s is in the first sstable and c@39#s is in the second sstable. We would write [a@50,c@39)#s < @50 to the first sst and [c@39,d)#s < @50 to the second sst. The other model would write [a,c)@50#s and [c,IS(c))@50#s to the first sst and [IS(c),d)@50#s to the second sst. Both accomplish the requirement that the two ssts are non-overlapping: the first model ends the first sst at c@39#inf and starts the second sst at c@39#s, and the second model ends the first sst at c@40#s and starts the second sst at c@39#s. The second model needs the information from the first sst to mask the c in the second sst (we've discussed how this can be handled in an earlier comment). The first model does not need that. Note that in this example I used the same seqnum for all these points and ranges -- say they came about due to sst ingestion.

  • Non-masked iteration: say there is also a point c@60 in the previous example. And consider the case where we haven't fragmented the range key. How will we present this output in the iterator? One possibility is to break at 1-dimensional key boundaries: so [a@50,c@60), (c@60, [c@60,c@40)), (c@40, [c@40,c@39)), ... This is fine, but IMO less satisfying because c@60 sees this mixing while a hypothetical a@60 doesn't, and both are not masked.

  • For the same non-masked iteration how will it behave from the user's perspective when we have the two sstables. Will we expose the [a@50,c@39)#s < @50 and [c@39,d)#s < @50 to the user, or will we try to merge them. If we expose these, I worry about complexity in explaining this. Successfully merging these will require looking ahead in the iterator, which may not be practical since it involves switching to a different sstable in the same level.

  • Now say we had an [a@10,d)#s <@10 written to the LSM prior to the write of [a@50,d)#s <@50. How should Pebble interpret this newer write in the 1-dimensional key space? In the 1-dimensional key space the second write subsumes the first. What prevents it from being compacted away? Is it because the values are different?

  • The same issue arises when someone writes a del [a@50,d)#s <@50. We need to prevent it from deleting [a@10,d)#s <@10 too. The value does not save us since there is no value written with the del.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)


docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
@sumeerbhola I fleshed out some more thoughts on how I think this timestamp-carrying value would work, and I appended them to this document. The 'coloring' design aims to maintain as much parity between delete ranges and range keys as possible, so that all our learnings from one apply directly to the other.

For the same non-masked iteration how will it behave from the user's perspective when we have the two sstables. Will we expose the [a@50,c@39)#s < @50 and [c@39,d)#s < @50 to the user, or will we try to merge them. If we expose these, I worry about complexity in explaining this. Successfully merging these will require looking ahead in the iterator, which may not be practical since it involves switching to a different sstable in the same level.

I think the explanation might be clear within the context of Pebble not understanding MVCC timestamps at all. Pebble surfaces range keys that cover point keys. The user is responsible for imposing the second dimension of timestamps. It must look at the point key, then look at the range keys (that it knows overlap in key prefix, because otherwise Pebble wouldn't have surfaced it) to determine if they overlap within the timestamp dimension.

The same issue arises when someone writes a del [a@50,d)#s <@50. We need to prevent it from deleting [a@10,d)#s <@10 too. The value does not save us since there is no value written with the del.

I think there would be a param/value written with the del in that alternative design, because the param/value is what differentiates range keys at different timestamps.

@sumeerbhola
Copy link
Collaborator Author

The 'coloring' design aims to maintain as much parity between delete ranges and range keys as possible, so that all our learnings from one apply directly to the other.

I'm somewhat lost -- could you add a few examples?

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat lost -- could you add a few examples?

Sorry, I'm doing a poor job explaining here. Maybe walking through an
evolution from range tombstones into these range keys will make their
similarities and differences clearer, and highlight anything I'm
overlooking:

Imagine we duplicate the entire implementation of range deletions
alongside the existing one. There's a separate block within an sstable,
a separate RANGEKEY key kind. All the internal iterator state,
fragmentation, etc is identical and parallel.

We omit the IsDeleted calls that check whether point keys are covered by
these alternate reality RANGEKEY keys, so they don't drop points. We
omit the cascading iterator seeking optimizations, and we omit the
elision of these keys. These alternate reality range keys don't do
anything: they're just a span that we can set that gets fragmented and
compacted alongside other keys.

There would be some unknown additional complexity in compactions, just
because there's now two streams of range keys instead of one: Anywhere
we consult c.rangeDelFrag we also need to consult c.rangeKeyFrag. We
need to flush both up to any sstable split point. All the truncation
semantics are the same. All the sequence number semantics with respect
to other keys of the same kind are the same as well (eg, RANGEDELs
cause other overlapping RANGEDEL fragments within the same snapshot to
be dropped. If they're separated by a snapshot, both are kept.)

Now we add the ability to associate a value with these alternate
range keys. A normal range tombstone deleting [a,c]#3 looks like

  a.RANGEDEL#3: c

For these alternate keys, we prefix the end key with a varint encoding
of the end key's length and stick the value after:

  a.RANGEKEY#3: <varint-encoding(1)>c<value>

When we feed these range keys into the fragmenter, we need to first
parse the varint. We provide the start and end key but also the value.
The fragmenter just passes the opaque value around, preserving the value
on both sides of a fragmentation point. We can use the same fragmenter
implementation for range deletions and range keys. Range tombstones just
always have a nil value.

Now we have a range key that associates a key range with a value.
There's no way to read these range keys, but since we magically
duplicated all the range deletion iterator code, our levelIters are
maintaining pointers into their current open sstables' RANGEKEY blocks
and keeping it in sync with the levelIters point key.

Since we duplicated RANGEDEL code and semantics with respect to
tombstones within the same snapshot overwriting each other, new RANGEKEY
keys overwrite each other in compactions. For example if #5 and #10 are
in the same snapshot stripe and meet in a compaction, #10 wins:

  b.RANGEKEY#10:c,bar and a.RANGEKEY#5:z,foo

compacted become

 a.RANGEKEY#5:b,foo  b.RANGEKEY#10:c,bar  c.RANGEKEY#5:z,foo

These are functional range keys. We could now add a method to the pebble
Iterator called RangeValue() []byte that just returns the value of the
highest-levelled range key that overlaps the iterator's current point
key position (and has a seqnum less than our iterator's seq num). This
would just need to iterate through the levelIters, checking if any of
their RANGEKEY block pointers are pointing to a key covering the point.
Now we have a simple interface that exposes a 'range' value associated
with every point key and reuses almost all of the mechanics, semantics,
etc. of range tombstones.

We removed the elision of these RANGEKEY keys earlier, so now there's no
way to unset these RANGEKEYs. We account for this by allowing only
RANGEKEYs with zero-length values to be elided when in the bottommost
level and the last snapshot stripe. Removing these RANGEKEYs is done by
setting an empty value.

These range keys allow us to associate a value with a key range, and
query a point's range value just as efficiently as we can check if a
point is deleted by a range tombstone.

But they're very hard to use to implement MVCC delete range. Writing a
RANGEKEY overwrites the RangeValue for all points covered by the
RANGEKEY. We need to be able to mark deleted [a,f)@t2 and [b,c)@t3 and
remember that [b,c) was deleted at t2 and t3.

To solve that problem, we revisit the decision to inherit RANGEDEL's
semantics of RANGEDEL keys overwriting each other. When two overlapping
range keys meet in a compaction, the fragmenter splits them into two
fragments with identical bounds but different sequence numbers and
different values:

    b.RANGEKEY#10:c,t3
    b.RANGEKEY#5:c,t2

We inherited the behavior of RANGEDELs that drops the second key.
Instead, we alter the compaction iterator handling of RANGEKEYs to
invoke a user-defined 'MergeRangeValue' operation on the two values,
returning a combined value. The compaction invokes the user-defined
MergeRangeValue operation, passing t3,t2. The pkg/storage
MergeRangeValue function returns an encoded representation of both
timestamps [t3, t2]. Instead of outputting just the original fragment
b.RANGEKEY#10:c,t3, the compaction outputs b.RANGEKEY#10:c,[t3,t2].

We're keeping the same number of fragments as before, and the logic
surrounding the range key bounds (inherited from RANGEDELs) is
unchanged. The only difference is that values are merged.

Now we can write a RANGEKEY([a,f),@t2) and a RANGEKEY([b,c),@t3) and
ensure we don't lose information about @t2. We still need to change
the iterator interface though. If the two RANGEKEYs are still in
different levels of the LSM, an iterator positioned at b could have a
L2 levelIter whose rangeKeyIter is pointing at RANGEKEY([b,c),@t3) and
a L6 levelIter whose rangeKeyIter is pointing at RANGEKEY([a,f),@t2).

A straightforward solution is to implement (*Iterator).RangeValue() by
calling the user-defined MergeRangeValue, merging all of these
overlapping range keys' values, surfacing the merged value to the user.
There are two problems:

  1. Merging at read-time might be inefficient. The user knows the
    semantics of their own range values, and they might not need to read all
    of the operands.
  2. The API doesn't expose how long this RangeValue extends for. The
    completely merged value is only valid until the earliest end boundary of
    all the merged fragments. The iterator could find which one is earliest
    and surface it, but that requires O(fragments) key comparisons, which
    again might not be necessary depending on the range value's
    user-prescribed semantics.

Instead, we can expose each of these fragments, including start and end
bounds and intermediary values, to the end user. We can expose them
ordered, and provide a convenience function that does compute the merged
value using MergeRangeValue. In the case of CockroachDB's MVCC Delete
Range, this allows CockroachDB to:

a) Access the fragment start and end bounds associated with range values.
b) Find the first fragment encoding a timestamp value that is beneath the
read timestamp and use that timestamp as the effective delete-range
timestamp, avoiding unnecessarily merging all the fragments in the common
case of recent read timestamps.

Much of the semantics and implementation of these RANGEKEYs are just
copied over from RANGEDELs. These RANGEKEYs only deal with opaque Pebble
user keys, just like RANGEDELs, and know essentially nothing of the MVCC
semantics.

It is possible to have fragmented RANGEKEYs that aren't useful for
CockroachDB. For example, consider a DEL-RANGE([a,b)@t20. If a
compaction is splitting the various a@ MVCC keys into separate
sstables, you could end up with a sstable with keys:

a@90:foo1
a@80:foo2
a@80:foo3
...
a@50:foo

The bounds of this sstable might be [a@90,a@40). The 1-dimensional
Pebble key range of the DEL-RANGE RANGEKEY overlaps this sstable,
so this sstable contains a single RANGEKEY fragment whose "effective,
within" bounds are [a@90,a@40). From the perspective of the user
(CockroachDB's MVCC package), this isn't meaningful. None of the keys
contained within the sstable are deleted by the RANGEKEY key. Whenever a
CockroachDB MVCC iterator asks whether any of these point keys are
deleted by a MVCC Delete Range, it would find the @t20 timestamp as the
earliest timestamp less than the read timestamp and find that the
relevant user key's timestamp is >@t20, deciding the key is not
deleted by a MVCC delete range.

It seems this redundancy is inherent to the MVCC package's projection of
a two-dimensional MVCC key space onto Pebble's one-dimensional key
space. This redundancy is similar to the requirement that a CockroachDB
MVCC iterator skip over redundant MVCC versions in order to arrive at a
live one. Any alternative that avoids this artifact would require
Pebble becoming more intimately intertwined with the two-dimensional
MVCC key space, and would add significant design and implementation
distance between our existing Pebble range tombstones and Pebble
range keys.

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One bit I didn't give enough attention to above is overlapping RANGEKEY fragments separated by a snapshot within the same sstable. Consider an open snapshot at #7 and these two range keys meet in a compaction:

    b.RANGEKEY#10:c,t3
    b.RANGEKEY#5:c,t2

We need to preserve the history at the earlier snapshot, but that competes with our merging that ensures an iterator at any given sequence number to only need to examine 1 RANGEKEY per level to figure out the 'range value' of a point. Since these two fragments are unmerged, a read at sequence number #12 needs both fragments' values. A couple options:

  1. Merge their values as we would if there was no snapshot, but still preserve the older key. The sstable would contain b.RANGEKEY#10:c,[t3,t2]and b.RANGEKEY#5:c,t2. An iterator at #12 only needs to read the first fragment, and the invariant that an iterator only needs to expose at most O(num-levels) range fragments is maintained. This might add complexity to compactions though, because when this sstable is compacted, we don't know that b.RANGEKEY#10:c,[t3,t2] and b.RANGEKEY#5:c,t2 came from the same sstable and don't require merging. In the case of CockroachDB's MVCC Delete Range, I don't think this case matters since value merging can be made idempotent for that use case.
    1. Require MergeRangeValues implement semantics such that MergeRangeValue(a, b) = c = MergeRangeValue(c, b).
    2. Plumb knowledge of the originating sstable, so that compactions can avoid unnecessary merging.
  2. Store the fragments unmerged and surface all of them to the iterator. Instead of O(N) range fragments, the iterator may need to consider O(N+M) fragments, where N is the number of levels and M is the number of taken snapshots). Additionally, an iterator may now need to Next in a levelIter's rangeKeyIter to check for additional unmerged fragments.

I lean towards either of option 1's variants because option 2 complicates iteration

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)

@sumeerbhola
Copy link
Collaborator Author

I am still thinking about this -- some initial thoughts.

  • I think partly this proposal is saying that we should continue with the same model of [userkey1, userkey2)#seq that was introduced for the non-MVCC case -- i.e., don't alter it for MVCC keys in any way.

  • One could then make current RANGEDELs apply to both point and range keys even in the MVCC case. Or maybe not, since the intention in some cases is efficient clearing of a large number of points.

  • The b.RANGEKEY#10:c,[t3,t2] and the running example is actually closer to the original proposal but without being prescriptive. I think you are saying is that CockroachDB will only really write range keys where the start and end key are prefixes. And instead of Pebble knowing the shared suffix as a first-class concept, the user can put it in the value. Keeping the merged values as a set is an implementation detail that Pebble could very easily accomplish itself, so I don't think of that implementation as a real difference.

  • Regarding iteration, "Instead, we can expose each of these fragments, including start and end bounds and intermediary values, to the end user. We can expose them ordered, and provide a convenience function that does compute the merged
    value using MergeRangeValue." I think this is where explicit knowledge of the set and knowing that it is a set of versions potentially allows for a simpler interface -- expose them in version order as the iterator is stepped forward or backward.

  • I am coming to the conclusion that we are not disagreeing much regarding the underlying implementation approach, but mostly about what is understood by Pebble, and consequent effects on the interface.

    • The less prescriptive approach allows for more flexibility, though we won't use that flexibility for DeleteRange (and will only be shifting implementation details from Pebble from CockroachDB). What I am not quite convinced about why we should offer that flexibility -- the notion of MVCC keys is already present in Pebble via the Split documentation. Pebble isn't prescriptive about how the versions be represented in either scheme.
    • I believe the interface behavior is cleaner with the more prescriptive approach.
    • The one place where the prescriptive approach differs in the implementation detail is that it can't expose [b@t1, c@t2) to the user. Which is why it needs to play the game with the ImmediateSuccessor.
  • I am struggling to understand how "Find the first fragment encoding a timestamp value that is beneath the
    read timestamp and use that timestamp as the effective delete-range timestamp" allows the iterator exposed by Pebble to function in masking mode. I don't think we want the interface complexity of the CockroachDB code explicitly pointing out to Pebble at each step of the iterator which range key should be used for masking -- this will make the interface too complicated I think. It is better for Pebble to know what is the version that is being read at, so it can apply masking itself. Another way to think of this is that DeleteRange is a specialization of a Pebble range key that overwrites older versions with an empty value. The general scheme being offered here is always some kind of overwrite -- that is the whole purpose of sharing the same part of the keyspace for these point and range keys. If the user doesn't want overwrite they should put these range keys in a different part of the key space and do the merging themselves (akin to how we merge point intents with point values at the CockroachDB layer).

  • "Any alternative that avoids this artifact would require Pebble becoming more intimately intertwined with the two-dimensional MVCC key space, and would add significant design and implementation distance between our existing Pebble range tombstones and Pebble range keys." I'm trying to unpack this.

    • Neither proposal tries to change that fact that the LSM invariants are over the 1-dimensional key space.
    • Pebble range tombstones already act in a two-dimensional manner.
    • Even the prescriptive proposal keeps Dels and RANGEDEL_PREFIX of range keys in the 1-dimensional key space.
    • Range keys are acting in a two-dimensional manner in only three places, the first two of which are shared across both proposals:
      • The set union: we're just shifting who is responsible for maintaining this union.
      • In masking iterator mode: I am not sure how to do this cleanly from an interface design perspective without Pebble knowing about versions and their ordering function.
      • The fragmentation of the prefix using ImmediateSuccessor: from an implementation perspective I think this would be a minor change to the fragmenter.

We removed the elision of these RANGEKEY keys earlier, so now there's no
way to unset these RANGEKEYs. We account for this by allowing only
RANGEKEYs with zero-length values to be elided when in the bottommost
level and the last snapshot stripe. Removing these RANGEKEYs is done by
setting an empty value.

Is this central to this idea? Empty value implying deletion is a bit magical from an interface design perspective.
Also, I don't understand what is this empty value: the user is writing [b,c) with opaque value t3. What do they write to delete this -- something like [b,c) with opaque value -t3 that the MergeRangeValue can match on?

@sumeerbhola
Copy link
Collaborator Author

The sstable would contain b.RANGEKEY#10:c,[t3,t2]and b.RANGEKEY#5:c,t2

Are we landing in this difficulty because we want to promote version t2 from #5 to #10 for read time efficiency? Snapshots are not common enough, so why should we bother with this promotion unless #5 and #10 are in the same snapshot stripe.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more thoughts:

  • With the nonprescriptive approach, I don't think the generality is a goal itself. I do think there is some small value to maintaining the delineation between pkg/storage's MVCC implementation and Pebble, and maintaining a clear separation of concerns between those levels. It's not always possible: eg, bloom filters necessitating Comparer.Split. I agree we should drop generality if it makes implementation cleaner or easier.

  • I'm not sure, we might still be disagreeing about the underlying implementation approach. I might just still be misunderstanding the implementation of the prescriptive approach or haven't fully grokked how the pieces fit together. My motivation for pushing the MVCC complexity up is to keep the Pebble implementation simple. In my mind, the prescriptive approach doesn't eliminate that MVCC complexity, it works it into Pebble compaction, iteration, and sstable bounds logic, which I think are already pretty complicated.

    • The pkg/storage RangeValueMerger in the nonprescriptive approach is complicated. But it's also just a pure function that is easy to exhaustively unit test, which I think makes the complexity more manageable than if worked into Pebble.

    • Regarding masking in the nonprescriptive appraoch, I do think masking would be requested by the CockroachDB iterator when it finds a range key that it wants to use for masking. I didn't think it was too complicated, and it mirrors the existing block property filters generality, but I definitely could be overlooking something here.

  • Regarding unsetting range keys, assigning significance to the empty value isn't integral. An alternative is to allow the RangeValueMerger to return a bool signifying that the range key may be dropped. The intention is that CockroachDB can write range keys like +t3 or -t3, which are merged by the range value merger. When a range key has no timestamps, the range key is useless and we want some mechanism for it to be dropped. I think you're right that making it explicit is cleaner and less surprising than assigning special significance to the empty value.

  • "Are we landing in this difficulty because we want to promote version t2 from sstable: prefix bloom filter #5 to compaction: sstable output size #10 for read time efficiency?" — It's for iterator implementation simplicity. I'm hoping to maintain that for any given point key, there is at most 1 overlapping range key fragment in each level that an iterator needs to consider/surface (this is true for range tombstones). A pebble.Iterator could ask each mergingIterLevel for its single overlapping range key fragment, if any. Otherwise, I think we need to Next through a level's fragments and then Prev back to the beginning for the next user key. My understanding in the prescriptive approach is that within each level, you may have n range keys that overlap the current user key (eg [a,e)@100#5, [a,e)@40#4, [a,e)@30#3 [a,e)@30#2), and each prefix change requires Prev-ing back to the beginning.

My understanding of our views is that:

  • We both think the nonprescriptive approach pushes some of the complexity up, forcing pkg/storage to participate directly in the implementation MVCC range tombstones, and we should be get something of value if we do that.
  • We have differing estimates of the Pebble-level implementation complexity in the prescriptive approach. I'm estimating high complexity, which leads me to want to trade reduced Pebble implementation complexity for higher pkg/storage complexity, whereas you estimate the prescriptive approach is similar complexity in the Pebble implementation.

I think the biggest things that I see as complex in the prescriptive approach are sstable bounds/compaction logic, and the merging and interleaving of range keys during iteration. I'll try to dig deeper into these and thoroughly document how I understand these to work in the prescriptive approach.

Reviewable status: 0 of 1 files reviewed, 8 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)


docs/rfcs/20211018_range_keys.md, line 381 at r5 (raw file):

    - [b,ImmediateSuccessor(b))@50 is effectively representing a point
    b@50.  Hence the end bound for the first sstable is b@40#s1, which
    does not overlap with the b@30#s2 start of the second sstable.

I want to make sure I understand how fragmentation and sstable bounds work here in the case of multiple overlapping range keys.

Point keys: a@100, a@30, b@100, b@40, b@30, b@10, c@40, d@40, e@30
Range keys: [a,e)@50, [b,d)@20

Prefix-fragmented range keys: [a,b)@50, [b,d)@50, [b,d)@20, [d,e)@50

During a compaction, say the prefix b needs to be split across sstables.

sstable one:
  point keys: a@100, a@30, b@100, b@40, b@30
  range keys:
    - [a,b)@50          contributes smallest a@50, largest b#inf
    - [b,IS(b))@50      contributes smallest b@50, largest b@50

sstable two
  point keys: b@10, c@40, d@40, e@30  
  range keys:
    - [b,d)@20          contributes smallest b@20, largest d#inf
    - [IS(b),d)@50      contributes smallest IS(b)@50, largest d#inf
    - [d,e)@50          contributes smallest d@50, largest e#inf

The resulting sstable bounds are [a@100,b@30] and [b@20, e@30]. Is this right? This produces a gap between sstable bounds, unlike range tombstones, but that's ok, because we only care about placing the range keys in the right spot for iteration—not completely covering a 1-dimensional space of Pebble's user key space?

One awkward bit is that we need to produce the fragment [IS(b),d)@50 when we're processing the split point, but that fragment actually sorts after the next range key [b,d)@20. Do we need to produce point ranges using immediate successor for all remaining range keys with the prefix that is split across sstables? So in the case, we'd also put [b,IS(b))@20 in the first sstable and [IS(b),d)#20 in the second sstable? And I guess the compaction iterator would need to buffer all range keys at the current prefix?

@sumeerbhola
Copy link
Collaborator Author

sumeerbhola commented Oct 25, 2021

  • I do think there is some small value to maintaining the delineation between pkg/storage's MVCC implementation and Pebble, and maintaining a clear separation of concerns between those levels.

    I strongly believe that Pebble should know about what part is a version -- we will need that if we ever hope to do something about perf: separate latest and old MVCC versions for faster reads #1170 and storage, kv: do best-effort GC of old versions during storage compactions cockroach#57260. Pebble does not need to know about the encoding, but does need to be able to compare versions. Hiding the version in an opaque value that Pebble does not know about, and accessing it via user-specified extensions like MergeRangeValue, will make this harder -- at least prior to range keys everything was in the key, and split was already documented as for MVCC, so saying the second part is a version was the obvious corollary. We did not have an explicit version comparison function but we could have added it to unlock certain features.

  • The more I think about MergeRangeValue the less I like it. Just like a@50 and a@30 are separate keys, [a,b)@50 and [a,b)@30 are separate keys. The latter should not require a merge function. "Merging" them and maintaining a set for efficiency reasons can be accomplished as an implementation detail if needed.

  • I think we have different opinions on how much complexity each approach adds, and which one is less complex. I think we'll know only when we implement, and I hope we can reach agreement without implementing both and comparing, since that is going to be expensive. My current opinion is both are comparable in terms of write path and compaction complexity. But from an interface perspective I now vastly prefer the prescriptive one (the previous two bullets are some of the reasons why, and also the next bullet).

  • The masking interface for the non-prescriptive approach is still unclear to me, and it is not like block property filters which are specified at iterator creation time. With the prescriptive approach it will just be a bool in the iterator options. Tweaking your earlier example, say we are b@50 and there are two range keys overlapping RANGEKEY([b,d),@{100, 60, 30}) and RANGEKEY([a,g),@{40}) and say the user is reading @55. Do they tell Pebble to use both [b,d)@60 and [a,g)@30 to skip blocks? Does Pebble remember what they've said in the past wrt masking (since it is possible that [a,g)@30 was also observed earlier) during this iteration and continue using it? Also, we should make the masking iterator deterministic, even for the cases we are not able to skip blocks -- the more non-determinism we have the harder it is to do proper metamorphic testing. I'm not sure that it is possible in this approach -- it is certainly possible with the prescriptive approach (since the masking iterator only exposes points).

  • Otherwise, I think we need to Next through a level's fragments and then Prev back to the beginning for the next user key. My understanding in the prescriptive approach is that within each level, you may have n range keys that overlap the current user key (eg [a,e)@100#5, [a,e)@40#4, [a,e)@30#3 [a,e)@30#2), and each prefix change requires Prev-ing back to the beginning.

    Btw, I am completely open to an alternative iterator interface that reduces the stepping if that is better for the user in any way. The stepping, with the start key truncated to match the point key prefix was just the simplest interface I could think of. I am actually unsure if we need anything that iterates over both point and range keys and does not mask. For sending CockroachDB range snapshots the caller can have two iterators over points and keys and do a simple heap merge for creating a sorted output to be written to the sstable -- we definitely don't want unnecessary repetition of range keys in that case.

(I was hoping we could get to a consensus as we understand each approach more deeply, and I am still hopeful of that, but we'll also need to timebox this eventually, and pick something. Let's talk synchronously on how to approach that.)

@sumeerbhola
Copy link
Collaborator Author

The resulting sstable bounds are [a@100,b@30] and [b@20, e@30]. Is this right?

I believe so.

This produces a gap between sstable bounds, unlike range tombstones, but that's ok, because we only care about placing the range keys in the right spot for iteration—not completely covering a 1-dimensional space of Pebble's user key space?

We currently produce a gap between sstable bounds with range tombstones too, and it is worse since the "apply within sstable bounds" semantically means that some seqnums are no longer deleted as discussed in https://github.com/cockroachdb/pebble/blob/master/docs/range_deletions.md#loss-of-power. Of course that gap (for range tombstones) will go away when we no longer have user keys spanning two sstables on the same level.

With range keys I would argue there isn't a gap: [b,IS(b))@50 and [IS(b),d)@50 are semantically adjacent wrt prefixes, so there is no loss of power. But yes, they are not adjacent in the 1-dimensional key space, which is a feature :)

One awkward bit is that we need to produce the fragment [IS(b),d)@50 when we're processing the split point, but that fragment actually sorts after the next range key [b,d)@20. Do we need to produce point ranges using immediate successor for all remaining range keys with the prefix that is split across sstables? So in the case, we'd also put [b,IS(b))@20 in the first sstable and [IS(b),d)#20 in the second sstable? And I guess the compaction iterator would need to buffer all range keys at the current prefix?

We don't need to do [b,IS(b))@20 since its start key b@20 doesn't belong in the first sstable. And yes, we'd need to buffer range keys like we do in the rangedel fragmenter. We would probably write a new fragmenter for range keys that copies most of the code from the current fragmenter and then once we have both fragmenters in final form, we'd see if we can refactor and share code.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • For the MVCC/Pebble boundary, do you anticipate implementation of MVCC point tombstone masking also being moved into Pebble? Pebble definitely needs some level of knowledge of MVCC for MVCC-knowledge, but I believe it can be reduced to narrowly-scoped user-implemented interfaces, like Comparer does today.

  • "Just like a@50 and a@30 are separate keys, [a,b)@50 and [a,b)@30 are separate keys." — I think that's up for us to decide? In the value merging design, range keys are more descriptions of point keys than they are keys with substance. I don't think the requirement that they be keys is necessary, or necessarily a clearer interface. If these range keys are keys rather than descriptions of ranges of point keys, they exist in a parallel but also intertwined/interleaved keyspace which is at least a concept I have trouble reconciling.

  • "It is not like block property filters which are specified at iterator creation time" — I guess I don't understand why that makes them very dissimilar. Doesn't your example here with two overlapping range keys apply to the prescriptive approach? In both cases we need to make a decision about which range key to use for masking, it's just a matter of whether we're making that decision in Pebble or in pkg/storage? I don't think Pebble should remember the masking: the CockroachDB iterator should request it each time. This might just be my preference for simple Pebble implementation showing :).

  • Regarding determinism, the 'RangeValue' of every point key is deterministic and can be used in metamorphic testing. Just like in the prescriptive approach, the metamorphic testing can assert on the expected range value of a point, rather than the intermediary iterator states. The fragment boundaries are not deterministic, but that's also the case in the prescriptive approach.


We don't need to do [b,IS(b))@20 since its start key b@20 doesn't belong in the first sstable

What do we do about [b,d)@20 ordering before [IS(b),d)@50 then? In this model, we buffer all keys with the same prefix in the fragmenter. We output [b,IS(b))@50 in the first sstable, but need to remember that we need to also output [IS(b),d)@20 in the second sstable. We can't write it yet though even though we're starting the second sstable, because the fragmenter is still on the b prefix and we have unflushed fragments beginning with b. We also don't know whether there might be fragments further down the input iterators starting with IS(b). We need to additionally buffer which fragments have been partially flushed as a point-fragment? This is the type of complexity we're incurring because we're deviating from range tombstones' semantics with respect to sstable boundaries. If we reuse range deletion range semantics, I believe we can reuse the existing fragmenter and sstable boundaries logic near verbatim.


Separately, I've been mulling over the MVCCStats issue, and I haven't thought of anything great. I think one option is to prevent MVCC range tombstones from contributing to MVCCStats at all (other than updating covered keys' MVCCStats contributions). We want MVCC range tombstones to contribute to MVCCStats so long as CockroachDB still needs to do work to clean up the range keys. Can we tweak them such that CockroachDB doesn't need to clean up these range keys manually?

Riffing off the earlier 'coloring' concept, we could redefine range keys so that they're not first-class keys themselves, but instead a 'coloring' on top of and describing point keys. They have no substance themselves and only exist to describe point keys. If there are no described point keys beneath a range key, the range key has no effect and may be elided. As long as CockroachDB garbage collects all the point keys described by a range key, the range key will automatically be dropped too. This is the same mechanic as Pebble range tombstones.

In both the previously discussed designs, this would require changes so that Pebble can discern when all point keys deleted by a MVCC range tombstone fragment have been dropped. In the prescriptive approach, I think this would require a comparator for suffixes? In the value-merging approach, I think it would require the user provide something akin to a Describes(rangeValue []byte, pointKey []byte) bool function.

Reviewable status: 0 of 1 files reviewed, 8 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought regarding Pebble performing masking automatically during iteration, doesn't that preclude the possibility of eventually allowing MVCC range deletes to be transactional? Or at least it will require us to provide a mechanism for Pebble to consult intents?

Reviewable status: 0 of 1 files reviewed, 8 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 44 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)


docs/rfcs/20211018_range_keys.md, line 167 at r24 (raw file):

Previously, sumeerbhola wrote…

The current RANGEDEL would continue to apply only to points?

Yeah


docs/rfcs/20211018_range_keys.md, line 195 at r24 (raw file):

Previously, sumeerbhola wrote…

if both hasPoint and hasRange are true, does the value returned by Key() equal start?

I was leaning towards no, because I thought it would be preferable to have the fragment bounds stable over the lifetime of an iterator. What do you think?


docs/rfcs/20211018_range_keys.md, line 349 at r24 (raw file):

Previously, sumeerbhola wrote…

should this say RangeKeyMask instead of IteratorMask?

I was imagining IteratorMask would be an interface (with unexported methods). It seems like we could conceivably fold BlockPropertyFilters under this field too and expose constructors for combining several iterator masks, kinda like the functional options pattern.


docs/rfcs/20211018_range_keys.md, line 379 at r24 (raw file):

Previously, sumeerbhola wrote…

s/form/from/

Done.


docs/rfcs/20211018_range_keys.md, line 444 at r24 (raw file):

Previously, sumeerbhola wrote…

defined

Done.


docs/rfcs/20211018_range_keys.md, line 491 at r24 (raw file):

Previously, sumeerbhola wrote…
  we could
  impose the constraint that a prefix cannot be split across multiple
  range key sstables.

Would this help with anything? It doesn't seem to help with the iterator interface, given that we've now designed an interface which exposes range keys aligned with the point key.

I think it eliminates almost all my concerns with the original representation of range keys ([k1,k2)@t1#s as k1@t1.RANGESET#s→ k2), because it avoids complicating compaction or sstable boundaries. We could remove the merging of keys and tweak the iterator interface to expose an iterator over the range key fragments covering a point:

RangeKeys() *RangeKeyIterator

type RangeKey struct {
    Suffix []byte
    Value []byte
}

func (r *RangeKeyIterator) Next() RangeKey {
}

The advantage would be that if you have many large-valued range keys all overlapping, you don't need to merge them into a single key that must sit in a single sstable block. Instead of requiring O(sum-overlapping-key-values) memory, it requires O(max-overlapping-key-value) memory per level. The downside is that you need to prev back to the first fragment with the fragment bounds whenever you next, even if the next key falls within the same fragment bounds.

We could also choose to initially buffer all the overlapping fragments in memory, represent them as k1@t1.RANGESET#s→ k2 and expose them in an iterator interface that just reads from the buffered fragments. Later, if we begin to encounter issues with many overlapping large-valued range keys and iterator memory usage, we have an out.


docs/rfcs/20211018_range_keys.md, line 494 at r24 (raw file):

Previously, sumeerbhola wrote…
  of higher levels, iterators, etc all needing to deal with the concept
  of two parallel LSMs.

This does seem messy wrt code changes. This would spread into the manifest too.

We could infer the LSM to which the sstable belongs from the Smallest and Largest key kinds in the file metadata to avoid requiring manifest format changes.


docs/rfcs/20211018_range_keys.md, line 1032 at r24 (raw file):

Previously, sumeerbhola wrote…

r3 has 3 ranges?

oops, done.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 44 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)


docs/rfcs/20211018_range_keys.md, line 195 at r24 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I was leaning towards no, because I thought it would be preferable to have the fragment bounds stable over the lifetime of an iterator. What do you think?

I don't have an opinion. Whatever is simpler.


docs/rfcs/20211018_range_keys.md, line 349 at r24 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I was imagining IteratorMask would be an interface (with unexported methods). It seems like we could conceivably fold BlockPropertyFilters under this field too and expose constructors for combining several iterator masks, kinda like the functional options pattern.

So RangeKeyMask is an implementation of IteratorMask?
I don't mind the generalization but we should generalize only if it doesn't introduce much code complexity. The reason to generalize BlockPropertyFilters up front was that we were making an sstable format change. RangeKeyMask is a purely code-level abstraction that we could mark as "experimental and subject to incompatible changes, including removal", so that only CockroachDB uses it. Then we could generalize it over time if needed.

jbowens added a commit to jbowens/pebble that referenced this pull request Nov 23, 2021
Document the new Split requirements necessary for range keys support. See cockroachdb#1341.
jbowens added a commit to jbowens/pebble that referenced this pull request Nov 24, 2021
Document the new Split requirements necessary for range keys support. See cockroachdb#1341.
Copy link
Contributor

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 61 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nicktrav, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)

a discussion (no related file):
Flushing some comments. Still reading.



docs/rfcs/20211018_range_keys.md, line 271 at r27 (raw file):

(a,[a,e)), (b,[a,e)), (c,[a,e)), (d,[a,e)), e

Should this also emit the f point key?


docs/rfcs/20211018_range_keys.md, line 318 at r27 (raw file):

(point, range)
( [a,e)@50)

It's not immediately apparent why the range key [a,e)@50 is emitted first, without a point key. Perhaps a comment?

edit: @sumeerbhola was able to clear this up in an offline discussion. It might be worth referring back to the section that talks about widening the span on the lower bound to cover a, and not the initial lower bound of a@50.


docs/rfcs/20211018_range_keys.md, line 340 at r27 (raw file):

Only range keys with suffixes that sort after the
mask's suffix mask point keys. A range key that meets this condition
only masks points with suffixes that sort after the range key's suffix.

I found these two sentences rather dense and hard to unpack. I was almost expecting to see another sentence that followed that could help relieve some of the cognitive overhead, i.e. "Put another way ... blah". The "soft after" term also requires some mental gymnastics, as we are doing a sort-descending.

In looking at the examples, I think I have arrived at a mental model where the RangeKeyMask sets an "upper bound" on the suffix. A point key contained in the range that falls both "below" the mask suffix and the next lowest range key suffix will be masked. Anything "above" the range key suffix will not be masked. Something like:

                            RangeKey        RangeKeyMask(@50, ...)
                            [a,c)@30              |
      apple@10     a@20       |        a@40       |
          x         x         |         o         |
|---------|---------|---------|---------|---------|---------|   suffix
@0        @10       @20       @30       @40       @50       @60         
                              |
                  Masked <----|----> Unmasked

Am I right in thinking about this in terms of overlapping regions? Could this help with explaining the masking concept?


docs/rfcs/20211018_range_keys.md, line 375 at r27 (raw file):

Example: A user may construct an iterator with a mask RangeKeyMask(@t50, ...).

It may be useful if there was an additional, extensive example that shows how this works (i.e. by demonstrating which keys are masked / elided / not emitted).

It may also be nice to have it in the style of the (point, range) tabulated example you have above.


docs/rfcs/20211018_range_keys.md, line 434 at r27 (raw file):

Fragmentation may occur at user key contained within the `[k1,k2)`
range, including at intermediary keys that contain suffixes higher or
lower than the suffix of the range key.

This point seems important enough to justify having an explicit example highlighting what kind of fragmentation is permissible.


docs/rfcs/20211018_range_keys.md, line 440 at r27 (raw file):

or

Is this intention of this paragraph to say that we will support both? Or is this to motivate the picking of on one or the other?


docs/rfcs/20211018_range_keys.md, line 447 at r27 (raw file):

  range keys have no
  effect on point keys during compactions.

Is it worth stating the inverse too? Point keys have no impact on range keys?


docs/rfcs/20211018_range_keys.md, line 450 at r27 (raw file):

shared

Can you drop in an inline definition of what we mean by "shared" here? Are you talking about the case where an sstable contains both point key blocks and range key blocks? Or something else?


docs/rfcs/20211018_range_keys.md, line 451 at r27 (raw file):

that contains range keys

This reads a little confusingly. Do you mean something like "reads must iterate through the range key blocks of sstables with range keys that fall within the bounds of the iterator"? It read as if there was more to the sentence.


docs/rfcs/20211018_range_keys.md, line 457 at r27 (raw file):

range-key LSM

In the paragraph before these bullet points, we talk about this "parallel range-key LSM". I wonder if we can think of it as something like a "logical range-key LSM". My understanding is that it's still in the same "physical" LSM, just stored in blocks that are separate to the point keys.


docs/rfcs/20211018_range_keys.md, line 486 at r27 (raw file):

requiring ingests to target

I was a little confused by this phrasing. Do ingests have the ability to themselves target a level in the LSM? Could / should this be something along the lines of "... resulting in ingestions writing to at least L5 or higher". Seems like the main point is that because these keys hang around longer and aren't deleted / elided, this forces newer ranges to be placed in higher levels.


docs/rfcs/20211018_range_keys.md, line 491 at r27 (raw file):

to be order of magnitude

nit: to be an order of magnitude


docs/rfcs/20211018_range_keys.md, line 521 at r27 (raw file):

k1.RANGESET#s2 → (k2,[(@t2,v2),(@t1,v1)])

Given the description below, I wonder if we should update this to be something like:

k1.RANGESET#s2 → (k2,[(@t2,len(v2)),(@t1,len(v1))][v1,v2])

It took me a bit to understand what was meant by "value itself is encoded at the end of the RANGESET's value", and how the reverse order part is useful (i.e. can index from the back, on the fly). It may be a little verbose though.


docs/rfcs/20211018_range_keys.md, line 575 at r27 (raw file):

As described above, overlapping `RANGESET`s and
`RANGEUNSET`s within the same snapshot stripe and sstable are merged

Should the definition of "overlapping" here be more strict in the sense that only when the start and end key are the same can we merge into a single key-value pair. For example, k1.RANGESET#s1 -> (k2,[@t1]) can merge with k1.RANGESET#s2 -> (k2,[@t2]) into a single key-value. But can we merge k1.RANGESET#s1 -> (k2,[@t1]) with k1.RANGESET#s2 -> (k3,[@t2]) (i.e. different end-key)?


docs/rfcs/20211018_range_keys.md, line 623 at r27 (raw file):

must not be masked

We talk about masking here in this example, but don't specify what the mask is. Is "mask" the correct term to be using here? Or does it only makes sense to refer to masking when we are specifying a mask (i.e. RangeKeyMask).


docs/rfcs/20211018_range_keys.md, line 625 at r27 (raw file):

not be masked because it's not covered by the new range key, and indeed
that's the case because the covering range key's fragment is unchanged
`[a,j)@{t1}#10`.

Might be nice to talk about what happens to the point key m@t2#11 in right-hand half of the fragmented range.

My current understanding is that RangeKeyMask(@t3, ...) would mask this point key, whereas RangeKeyMask(@t1, ...) would not mask it?

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 61 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nicktrav, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)


docs/rfcs/20211018_range_keys.md, line 271 at r27 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Should this also emit the f point key?

Yeah, updated.


docs/rfcs/20211018_range_keys.md, line 318 at r27 (raw file):

Previously, nicktrav (Nick Travers) wrote…

It's not immediately apparent why the range key [a,e)@50 is emitted first, without a point key. Perhaps a comment?

edit: @sumeerbhola was able to clear this up in an offline discussion. It might be worth referring back to the section that talks about widening the span on the lower bound to cover a, and not the initial lower bound of a@50.

Added a line below this example:

Remember that the suffixless `a` sorts before all suffixed keys with the same
prefix (eg, `a@100`).

docs/rfcs/20211018_range_keys.md, line 340 at r27 (raw file):

Previously, nicktrav (Nick Travers) wrote…
Only range keys with suffixes that sort after the
mask's suffix mask point keys. A range key that meets this condition
only masks points with suffixes that sort after the range key's suffix.

I found these two sentences rather dense and hard to unpack. I was almost expecting to see another sentence that followed that could help relieve some of the cognitive overhead, i.e. "Put another way ... blah". The "soft after" term also requires some mental gymnastics, as we are doing a sort-descending.

In looking at the examples, I think I have arrived at a mental model where the RangeKeyMask sets an "upper bound" on the suffix. A point key contained in the range that falls both "below" the mask suffix and the next lowest range key suffix will be masked. Anything "above" the range key suffix will not be masked. Something like:

                            RangeKey        RangeKeyMask(@50, ...)
                            [a,c)@30              |
      apple@10     a@20       |        a@40       |
          x         x         |         o         |
|---------|---------|---------|---------|---------|---------|   suffix
@0        @10       @20       @30       @40       @50       @60         
                              |
                  Masked <----|----> Unmasked

Am I right in thinking about this in terms of overlapping regions? Could this help with explaining the masking concept?

Yeah, that's right. I think it might be useful to include a two-dimensional ascii graph, eg, with user key on the x-axis and suffix on the y-axis. That might help demonstrate how masking works.


docs/rfcs/20211018_range_keys.md, line 440 at r27 (raw file):

Previously, nicktrav (Nick Travers) wrote…
or

Is this intention of this paragraph to say that we will support both? Or is this to motivate the picking of on one or the other?

To motivate picking one or the other. Clarified the language here.


docs/rfcs/20211018_range_keys.md, line 450 at r27 (raw file):

Previously, nicktrav (Nick Travers) wrote…
shared

Can you drop in an inline definition of what we mean by "shared" here? Are you talking about the case where an sstable contains both point key blocks and range key blocks? Or something else?

Clarified up above.


docs/rfcs/20211018_range_keys.md, line 457 at r27 (raw file):

Previously, nicktrav (Nick Travers) wrote…
range-key LSM

In the paragraph before these bullet points, we talk about this "parallel range-key LSM". I wonder if we can think of it as something like a "logical range-key LSM". My understanding is that it's still in the same "physical" LSM, just stored in blocks that are separate to the point keys.

In this context, it is referring to a separate physical LSM.


docs/rfcs/20211018_range_keys.md, line 486 at r27 (raw file):

Previously, nicktrav (Nick Travers) wrote…
requiring ingests to target

I was a little confused by this phrasing. Do ingests have the ability to themselves target a level in the LSM? Could / should this be something along the lines of "... resulting in ingestions writing to at least L5 or higher". Seems like the main point is that because these keys hang around longer and aren't deleted / elided, this forces newer ranges to be placed in higher levels.

"Target" is a bit of jargon derived from the code, because during ingestion we determine the ingestTargetLevel for each level. I'll update the language to use less jargon.


docs/rfcs/20211018_range_keys.md, line 575 at r27 (raw file):

Previously, nicktrav (Nick Travers) wrote…
As described above, overlapping `RANGESET`s and
`RANGEUNSET`s within the same snapshot stripe and sstable are merged

Should the definition of "overlapping" here be more strict in the sense that only when the start and end key are the same can we merge into a single key-value pair. For example, k1.RANGESET#s1 -> (k2,[@t1]) can merge with k1.RANGESET#s2 -> (k2,[@t2]) into a single key-value. But can we merge k1.RANGESET#s1 -> (k2,[@t1]) with k1.RANGESET#s2 -> (k3,[@t2]) (i.e. different end-key)?

they're always exactly overlapping because they're fragmented. updated to resurface that point here.


docs/rfcs/20211018_range_keys.md, line 623 at r27 (raw file):

Previously, nicktrav (Nick Travers) wrote…
must not be masked

We talk about masking here in this example, but don't specify what the mask is. Is "mask" the correct term to be using here? Or does it only makes sense to refer to masking when we are specifying a mask (i.e. RangeKeyMask).

Yeah, "Masking" only refers to RangeKeyMask. The range keys themselves are the "masks", and the RangeKeyMask configured by the user specifies which range keys may act as masks.

jbowens added a commit that referenced this pull request Nov 30, 2021
Document the new Split requirements necessary for range keys support. See #1341.
Copy link
Contributor

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 52 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nicktrav, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)


docs/rfcs/20211018_range_keys.md, line 623 at r27 (raw file):

range keys themselves are the "masks", and the RangeKeyMask configured by the user specifies which range keys may act as masks.

I think this is a really nice way of re-stating things, if you could incorporate somehow.

nicktrav added a commit to nicktrav/pebble that referenced this pull request Dec 8, 2021
Range keys (see cockroachdb#1341) will be stored in their own, single block of an
sstable. Add a new, optional meta block, indexed as "pebble.range_key"
in the metablock index, to the sstable structure. This block is only
present when at least one range key has been written to the sstable.

Add the ability to add range keys to an sstable via
`(*sstable.Writer).Write`.

Update existing data-driven tests to support printing of the range key
summary. Add additional test coverage demonstrating writing of range
keys with an `sstable.Writer`.

Add minimal functionality to `sstable.Reader` to support writing the
data-driven test cases for the writer. Additional read-oriented
functionality will be added in a subsequent patch.

Related to cockroachdb#1339.
nicktrav added a commit to nicktrav/pebble that referenced this pull request Dec 8, 2021
Range keys (see cockroachdb#1341) will be stored in their own, single block of an
sstable. Add a new, optional meta block, indexed as "pebble.range_key"
in the metablock index, to the sstable structure. This block is only
present when at least one range key has been written to the sstable.

Add the ability to add range keys to an sstable via
`(*sstable.Writer).Write`.

Update existing data-driven tests to support printing of the range key
summary. Add additional test coverage demonstrating writing of range
keys with an `sstable.Writer`.

Add minimal functionality to `sstable.Reader` to support writing the
data-driven test cases for the writer. Additional read-oriented
functionality will be added in a subsequent patch.

Related to cockroachdb#1339.
nicktrav added a commit to nicktrav/pebble that referenced this pull request Dec 8, 2021
Range keys (see cockroachdb#1341) will be stored in their own, single block of an
sstable. Add a new, optional meta block, indexed as "pebble.range_key"
in the metablock index, to the sstable structure. This block is only
present when at least one range key has been written to the sstable.

Add the ability to add range keys to an sstable via
`(*sstable.Writer).Write`.

Update existing data-driven tests to support printing of the range key
summary. Add additional test coverage demonstrating writing of range
keys with an `sstable.Writer`.

Add minimal functionality to `sstable.Reader` to support writing the
data-driven test cases for the writer. Additional read-oriented
functionality will be added in a subsequent patch.

Related to cockroachdb#1339.
nicktrav added a commit to nicktrav/pebble that referenced this pull request Dec 13, 2021
Range keys (see cockroachdb#1341) will be stored in their own, single block of an
sstable. Add a new, optional meta block, indexed as "pebble.range_key"
in the metablock index, to the sstable structure. This block is only
present when at least one range key has been written to the sstable.

Add the ability to add range keys to an sstable via
`(*sstable.Writer).Write`.

Update existing data-driven tests to support printing of the range key
summary. Add additional test coverage demonstrating writing of range
keys with an `sstable.Writer`.

Add minimal functionality to `sstable.Reader` to support writing the
data-driven test cases for the writer. Additional read-oriented
functionality will be added in a subsequent patch.

Related to cockroachdb#1339.
nicktrav added a commit to nicktrav/pebble that referenced this pull request Dec 13, 2021
Range keys (see cockroachdb#1341) will be stored in their own, single block of an
sstable. Add a new, optional meta block, indexed as "pebble.range_key"
in the metablock index, to the sstable structure. This block is only
present when at least one range key has been written to the sstable.

Add the ability to add range keys to an sstable via
`(*sstable.Writer).Write`.

Update existing data-driven tests to support printing of the range key
summary. Add additional test coverage demonstrating writing of range
keys with an `sstable.Writer`.

Add minimal functionality to `sstable.Reader` to support writing the
data-driven test cases for the writer. Additional read-oriented
functionality will be added in a subsequent patch.

Related to cockroachdb#1339.
nicktrav added a commit that referenced this pull request Dec 13, 2021
Range keys (see #1341) will be stored in their own, single block of an
sstable. Add a new, optional meta block, indexed as "pebble.range_key"
in the metablock index, to the sstable structure. This block is only
present when at least one range key has been written to the sstable.

Add the ability to add range keys to an sstable via
`(*sstable.Writer).Write`.

Update existing data-driven tests to support printing of the range key
summary. Add additional test coverage demonstrating writing of range
keys with an `sstable.Writer`.

Add minimal functionality to `sstable.Reader` to support writing the
data-driven test cases for the writer. Additional read-oriented
functionality will be added in a subsequent patch.

Related to #1339.
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 2 files reviewed, 62 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nicktrav, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)


docs/rfcs/20211018_range_keys.md line 93 at r29 (raw file):

### Writes

This design introduces a three new write operations:

nit: introduces three ...


docs/rfcs/20211018_range_keys.md line 294 at r29 (raw file):

A pebble.Iterator, which shows user keys, will visit the following iterator
positions during forward iteration:

nit: the fruit example from the previous section is nice, and the only thing it lacks is point keys. It may be better to expand on that example here rather than introduce these a, b, ... keys.


docs/rfcs/20211018_range_keys.md line 314 at r29 (raw file):

  the range key to those bounds. For example, if the upper bound was c, the
  sequence seen would be `(a,[a,c))`, `(b,[a,c))`. The same applies to lower
  bound and backward iteration.

nit: these 3 bullets stating iteration properties of range keys can probably be stated with the pure range key example from the previous section.


docs/rfcs/20211018_range_keys.md line 475 at r29 (raw file):

numbers, like other internal keys. These keys are written to the same sstables
as point keys and stored in a separate block, similar to the existing range
deletion tombstones. Log-structured merge tree level invariants are valid across

nit: I think this "same sstables" and "separate block" sentence is repeating what was said earlier in the section.


docs/rfcs/20211018_range_keys.md line 478 at r29 (raw file):

range and point keys (just like they are for range tombstones and point keys).
That is, `RangeKeySet([k1,k2))#s2` cannot be at a lower level than
`RangeKeySet(k)#s1` where `k \in [k1,k2)` and `s1 < s2`.

RangeKeySet(k)? Did you mean to use a point key?
We should probably mention both LSM invariants among range keys, and between range and point keys.
IIUC, we have both, but we don't really rely on the latter in our implementation.


docs/rfcs/20211018_range_keys.md line 511 at r29 (raw file):

* End key, `varstring`, encodes the end user key of the fragment.
* A series of (suffix, value-length) tuples representing the logical range keys

s/value-length/value/

(iirc, we are no longer doing the opposite order stuff that was in the previous iteration of this doc)


docs/rfcs/20211018_range_keys.md line 649 at r29 (raw file):

  `Split(k) < len(k)`.

# TKTK: Discuss merging iterator

todo in a later PR?


docs/rfcs/20211018_range_keys.md line 705 at r29 (raw file):

a@100, [a,e)@50, a@30, b@100, [b,e)@50, b@40, b@30, [c,e)@50, c@40, [d,e)@50, d@40, e@30

this needs to be rephrased since we are exposing range keys together with point keys.


docs/rfcs/20211018_range_keys.md line 797 at r29 (raw file):

- With shared sstables, `SeekPrefixGE` cannot use bloom filters to entirely
  eliminate sstables that contain range keys. Pebble does not use bloom filters

this "does not use bloom filters in L6" is not true anymore.


docs/rfcs/20211018_range_keys.md line 828 at r29 (raw file):

We decide to share sstables, because preserving the LSM invariant between range
keys and point keys is expected to be useful in the long-term.

We don't have this invariant now in CockroachDB, yes? I vaguely remember dt mentioning a case where a range key that covers a point key is written before the point key.
I thought we are doing this because its simpler from a code perspective to not maintain a separate LSM stack.

Describe design of range keys.

Informs cockroachdb#1339

Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 62 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nicktrav, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)


docs/rfcs/20211018_range_keys.md line 797 at r29 (raw file):

Previously, sumeerbhola wrote…

this "does not use bloom filters in L6" is not true anymore.

changed to "does not always use bloom filters in L6", since most of the time we still don't use them.


docs/rfcs/20211018_range_keys.md line 828 at r29 (raw file):

Previously, sumeerbhola wrote…

We don't have this invariant now in CockroachDB, yes? I vaguely remember dt mentioning a case where a range key that covers a point key is written before the point key.
I thought we are doing this because its simpler from a code perspective to not maintain a separate LSM stack.

We don't have it between suffixes and sequence numbers, but otherwise we have it.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamping myself, since Sumeer opened the pull request

@jbowens jbowens merged commit f079ef5 into cockroachdb:master Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants