Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/manifest: Relax SeqNum overlap invariant in L0 #591

Merged

Conversation

itsbilal
Copy link
Member

@itsbilal itsbilal commented Mar 27, 2020

Changes Manifest.CheckOrdering to allow L0 SSTables to overlap
in sequence numbers.

Implemented for better compatibility with future versions of
Pebble which could support partitioned flushes and L0 sublevels.

Fixes #587 .

@petermattis
Copy link
Collaborator

This change is Reviewable

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like what I was imagining. Mostly small comments.

Reviewable status: 0 of 5 files reviewed, 5 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)


internal/manifest/version.go, line 596 at r1 (raw file):

						overlapRangeStartIdx = i
					}
					if f.SmallestSeqNum < smallestOverlapRangeFlushedSeqNum {

Nit: for cases where we're trying to find the smallest or largest, I like to arrange the comparison to match the subsequent assignment:

if smallestOverlapRangeFlushedSeqNum > f.SmallestSeqNum {
  smallestOverlapRangeFlushedSeqNum = f.SmallestSeqNum
}

internal/manifest/version.go, line 605 at r1 (raw file):

						// range and check for any key range overlaps in set of
						// files.
						overlapRange := expandOverlapRange(i-1)

I'm confused about what overlapRangeStartIdx is being used for. It seems to either be unnecessary, or could be used to optimize expandOverlapRange. Perhaps it is too late on Friday afternoon and I'm missing something.


internal/manifest/version.go, line 607 at r1 (raw file):

						overlapRange := expandOverlapRange(i-1)
						overlapRangeCopy := make([]*FileMetadata, len(overlapRange))
						copy(overlapRangeCopy, overlapRange)

Rather than having checkOverlapRangeForKeyOverlap mutate the slice it is passed, I think it would be cleaner and safer to have that function make a copy internally. This would get rid of a small amount of duplication, and ensure that a new caller of checkOverlapRangeForKeyOverlap doesn't have to be aware of this requirement.


internal/manifest/testdata/version_check_ordering, line 104 at r1 (raw file):

  m.SET.20-n.SET.20
----
L0 flushed file 000007 has an ingested file coincident with smallest seqnum: 15-17

I realize you didn't touch this error message in this change, but it reads a bit strangely. How does a "flushed file" have an "ingested file coincident"? I think this should be something like: L0 flushed file 000007 has smallest seqnum coincident with ingested file: 15 vs 15-17.


tool/testdata/manifest_dump, line 173 at r1 (raw file):

  added:         L0 000002:0<#1-#4>[#0,DEL-#0,DEL]
EOF
pebble: internal error: L0 files 000002 and 000001 have overlapping key and seqnum ranges: #0,DEL-#0,DEL vs #0,DEL-#0,DEL, and 1-4 vs 2-5

I think we should use the format used in manifest dump: <#SmallestSeq-#LargestSeq>[SmallestKey-LargestKey].

@petermattis
Copy link
Collaborator

Something else I wanted to mention. We don't really document this anywhere, but in the commit/PR title, the bit before the colon (i.e. the "scope") is supposed to be the directory containing the major portion of the change. In this case it should be internal/manifest: rather than version:. For your other PR it should be internal/manifest: rather than version_set:. There is a search/grouping facility on roachdash for issues when they follow this practice.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


internal/manifest/version.go, line 482 at r1 (raw file):

		//
		// Partitioned flushes are currently not implemented in Pebble, but this case
		// is handled here for compatibility with future versions of pebble.

I was going to suggest adding something like:
Key-partitioned intra-L0 compactions, also not currently implemented, similarly result in overlapping seqnum files that satisfy the same invariant.

But then realized that the invariant there is even weaker and can't be checked without peering into the files, so inclined to say that we should not do the overlap checking below.
Consider the case of two sublevel key intervals [a, c] [d, e]

  [a, c]           [d, e]
  f3[10,15]      f4[11,16]
  f1[0,5]        f2[1,6]

An intra-L0 compaction is allowed to compact f2, f3, f4 which will produce an output file, say f5 with key range [a, e] and seqnum range [1,16]. Both the keys and the seqnum intervals overlap. There are no keys in f5 in the interval [a, c] that overlap with the seqnum [0, 5] which is why the sublevel invariant still holds.


___
*[internal/manifest/version.go, line 516 at r1](https://reviewable.io/reviews/cockroachdb/pebble/591#-M3fPPkT4B4dJYD0gKnJ:-M3fPPkT4B4dJYD0gKnK:b793nhx) ([raw file](https://github.com/cockroachdb/pebble/blob/92c153d65426f92e617cc0f864908de87d099429/internal/manifest/version.go#L516)):*
> ```Go
> 				if f.SmallestSeqNum == f.LargestSeqNum {
> 					// Ingested file.
> 					continue
> ```

why the special case for ingested files? Since the files are ordered with non-decreasing `LargestSeqNum` isn't it sufficient to just use the if-condition below for all files?


<!-- Sent from Reviewable.io -->

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


internal/manifest/version.go, line 482 at r1 (raw file):

Previously, sumeerbhola wrote…

I was going to suggest adding something like:
Key-partitioned intra-L0 compactions, also not currently implemented, similarly result in overlapping seqnum files that satisfy the same invariant.

But then realized that the invariant there is even weaker and can't be checked without peering into the files, so inclined to say that we should not do the overlap checking below.
Consider the case of two sublevel key intervals [a, c] [d, e]

  [a, c]           [d, e]
  f3[10,15]      f4[11,16]
  f1[0,5]        f2[1,6]

An intra-L0 compaction is allowed to compact f2, f3, f4 which will produce an output file, say f5 with key range [a, e] and seqnum range [1,16]. Both the keys and the seqnum intervals overlap. There are no keys in f5 in the interval [a, c] that overlap with the seqnum [0, 5] which is why the sublevel invariant still holds.

Interesting. Seems like there are very few invariants that can be checked.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


internal/manifest/version.go, line 482 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Interesting. Seems like there are very few invariants that can be checked.

Yeah. That's also in line with rocksdb; it practically does no invariant checks in L0 (except that the files after a sort are sorted in the order the sort produces them in). I'll remove the overlap cases then, and keep the ingestion checks around.

@itsbilal itsbilal force-pushed the version-relax-seqnum-invariant branch from 92c153d to 6593c22 Compare March 30, 2020 16:18
@itsbilal itsbilal changed the title version: Relax SeqNum overlap invariant in L0 assuming keys dont overlap internal/manifest: Relax SeqNum overlap invariant in L0 Mar 30, 2020
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews and for pointing this out @sumeerbhola ! I've significantly relaxed the seqnum overlap invariants now. Also fixed the PR title and description. PTAL!

Reviewable status: 0 of 7 files reviewed, 6 unresolved discussions (waiting on @jbowens, @petermattis, and @sumeerbhola)


internal/manifest/version.go, line 482 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Yeah. That's also in line with rocksdb; it practically does no invariant checks in L0 (except that the files after a sort are sorted in the order the sort produces them in). I'll remove the overlap cases then, and keep the ingestion checks around.

Done.


internal/manifest/version.go, line 516 at r1 (raw file):

Previously, sumeerbhola wrote…

why the special case for ingested files? Since the files are ordered with non-decreasing LargestSeqNum isn't it sufficient to just use the if-condition below for all files?

Not applicable anymore.


internal/manifest/version.go, line 596 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: for cases where we're trying to find the smallest or largest, I like to arrange the comparison to match the subsequent assignment:

if smallestOverlapRangeFlushedSeqNum > f.SmallestSeqNum {
  smallestOverlapRangeFlushedSeqNum = f.SmallestSeqNum
}

Not applicable anymore.


internal/manifest/version.go, line 605 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm confused about what overlapRangeStartIdx is being used for. It seems to either be unnecessary, or could be used to optimize expandOverlapRange. Perhaps it is too late on Friday afternoon and I'm missing something.

Not applicable anymore.


internal/manifest/version.go, line 607 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Rather than having checkOverlapRangeForKeyOverlap mutate the slice it is passed, I think it would be cleaner and safer to have that function make a copy internally. This would get rid of a small amount of duplication, and ensure that a new caller of checkOverlapRangeForKeyOverlap doesn't have to be aware of this requirement.

Not applicable anymore.


internal/manifest/testdata/version_check_ordering, line 104 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I realize you didn't touch this error message in this change, but it reads a bit strangely. How does a "flushed file" have an "ingested file coincident"? I think this should be something like: L0 flushed file 000007 has smallest seqnum coincident with ingested file: 15 vs 15-17.

Done. I always found that error message weird too, so it's pretty worthwhile to address it now.


tool/testdata/manifest_dump, line 173 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think we should use the format used in manifest dump: <#SmallestSeq-#LargestSeq>[SmallestKey-LargestKey].

Not applicable anymore.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 7 files reviewed, 4 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


internal/manifest/testdata/version_check_ordering, line 35 at r2 (raw file):

  k.SET.16-n.SET.19
----
OK

Are these new test cases interesting given that the invariant check is so minimal now?


internal/manifest/testdata/version_check_ordering, line 116 at r2 (raw file):

  a.SET.3-d.SET.5
----
L0 flushed file 000002 has smallest sequence number coincident with an ingested file : 3-5 vs 3

Per my other comment, perhaps we should display the sequence numbers here as <#3-#5> and <#3-?> or something like that.


tool/testdata/manifest_dump, line 173 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Not applicable anymore.

I think it might still be worthwhile to represent key ranges as [<start>-<end>] (i.e. add the square braces here).

Changes Manifest.CheckOrdering to allow L0 SSTables to overlap
in sequence numbers.

Implemented for better compatibility with future versions of
Pebble which could support partitioned flushes and L0 sublevels.

Fixes cockroachdb#587.
@itsbilal itsbilal force-pushed the version-relax-seqnum-invariant branch from 6593c22 to 10a1170 Compare March 30, 2020 18:50
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @jbowens, @petermattis, and @sumeerbhola)


internal/manifest/testdata/version_check_ordering, line 35 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Are these new test cases interesting given that the invariant check is so minimal now?

Not as much, but it's still worthwhile to show some of the real-world complexity that can emerge from this.


internal/manifest/testdata/version_check_ordering, line 116 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Per my other comment, perhaps we should display the sequence numbers here as <#3-#5> and <#3-?> or something like that.

Done. That makes sense.


tool/testdata/manifest_dump, line 173 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think it might still be worthwhile to represent key ranges as [<start>-<end>] (i.e. add the square braces here).

Done.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Reviewable status: 0 of 8 files reviewed, 2 unresolved discussions (waiting on @jbowens and @sumeerbhola)

@itsbilal
Copy link
Member Author

TFTR!

@itsbilal itsbilal merged commit d6182fe into cockroachdb:master Mar 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants