internal/manifest: Introduce L0SubLevels data structure #614

itsbilal · 2020-04-07T21:47:27Z

This change introduces L0SubLevels, a data structure to encaptulate
all information to compute L0 Sublevels, flush split keys, and
base / intra-L0 compactions. This data structure will be generated
every time there's a file addition/deletion in L0, the integration
bits will come in future PRs.

The methods to pick base / intra-L0 compactions will come in a
separate PR.

Captures some of the largest code pieces of #563, and effectively
replaces a part of that PR. Sumeer wrote the bulk of this code as
part of his prototype, so thanks Sumeer for his work.

Most of these functions are dead code at the moment, only invoked
by the provided datadriven and non-datadriven tests.

petermattis · 2020-04-07T21:47:33Z

This change is

sumeerbhola

I am going to mainly lurk in this review, since I know the code. I suspect it will need more code comments to be easier to digest. Here are some initial comments.

Reviewed 1 of 5 files at r1.
Reviewable status: 0 of 5 files reviewed, 17 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 17 at r1 (raw file):

// TODO(sbhola): work items:
// - Integration with Pebble (probably as a followup to #546)
// - TPCC import experiments

can you change the first todo to TODO(bilal) and remove the second one.

internal/manifest/l0_sublevels.go, line 23 at r1 (raw file):

// occurs because the union of file boundary keys is used to pick intervals.
// However the largest key in a file is not inclusive, so when it is used as
// an interval, the actual key is ImmediateSuccessor(key). We don't have the

This is likely to be confusing, so before the We don't have ... sentence can you add an example like the following:
Consider two files with boundaries, f1 [a, c], f2 [b, f]. The interval boundary keys are a, b, IS(c), IS(f), which makes intervals [a, b), [b, IS(c)), [IS(c), IS(f)), where f1 perfectly overlaps with the the first two intervals and f2 with the second two intervals.

internal/manifest/l0_sublevels.go, line 85 at r1 (raw file):

}

// A wrapper around FileMetadata.

I think Peter wanted these extra fields folded into FileMetadata, because when integrating this we'll want L0SubLevels to return [][]*FileMetadata( for integrating with the read path). Unlike the prototype which is maintaining both in Version.

internal/manifest/l0_sublevels.go, line 104 at r1 (raw file):

// A key interval of the form [start, end). The end is not represented here since it
// is implicit in the start of the next interval. The last interval is an exception
// but we don't need to ever lookup the end of that interval. The set of intervals

I don't think we even need that last interval -- it will have 0 files. That will get rid of this confusing comment.

internal/manifest/l0_sublevels.go, line 115 at r1 (raw file):

	// sublevels in this interval than files undergoing compaction, which may
	// allow for this interval to participate in another compaction, but for now
	// we eschew such complexity).

as far as I know, this seqnum 0 case only occurs in RocksDB, so perhaps this comment is not relevant, since ongoing compactions are irrelevant for compatibility.

internal/manifest/l0_sublevels.go, line 131 at r1 (raw file):

	// to true. Note that this is a pessimistic filter: the file that widened
	// [internalFilesMinIntervalIndex, internalFilesMaxIntervalIndex] may be at a high
	// sub-level and may not need to be included in the compaction.

I think this comment could be rephrased and one can include an example like the one I used in the walkthrough. The issue here is that we stop once we have a viable candidate (we really can't afford to consider all intervals as candidates, at every picking request), so we want to first consider candidates further away from ongoing compactions, since they are more likely to give us a deeper compaction.

internal/manifest/l0_sublevels.go, line 179 at r1 (raw file):

	_ = log.Output(2, fmt.Sprintf(format, args...))
	os.Exit(1)
}

I just copy-pasted this logging stuff quickly in the prototype -- there may be a more principled solution.

internal/manifest/l0_sublevels.go, line 187 at r1 (raw file):

// sublevels having newer keys as opposed to lower indexed levels.
//
// There is no limit to the number of sublevels that can exist in L0 at any time

nit: comma after "time".

internal/manifest/l0_sublevels.go, line 209 at r1 (raw file):

	flushSplitUserKeys      [][]byte
	flushSplitFraction      []float64
	flushSplitIntervalIndex []int

I was using flushSplitFraction and flushSplitIntervalIndex to print to log, to figure out the root cause of wide files. They didn't indicate anything out of the ordinary, so I'd suggest removing them.

internal/manifest/l0_sublevels.go, line 212 at r1 (raw file):

	// for debugging
	CompactingFilesAtCreation []*FileMetadata

This can be removed to -- it was a temporary debugging aid.

internal/manifest/l0_sublevels.go, line 469 at r1 (raw file):

// Only for temporary debugging in the absence of proper tests.
//
// TODO(itsbilal): Remove this after integration is complete.

I suspect we'll want to run with checking turned on in production -- it is better to fail before a compaction corrupts the LSM invariants.

internal/manifest/l0_sublevels.go, line 585 at r1 (raw file):

// UpdateStateForStartedCompaction updates internal L0SubLevels state for a
// recently started compaction. isBase specifies if this is a base compaction;
// if false, this is assumed to be an L0 compaction. The specified compaction

s/L0/intra-L0/

internal/manifest/l0_sublevels.go, line 701 at r1 (raw file):

// first interval we can construct a compaction for is the compaction that is
// started. There can be multiple heuristics in choosing the ordering of the
// intervals -- the code uses one heuristic, but experimentation is probably

... uses one heuristic that worked well for a large TPCC import, but more experimentation is probably needed.

internal/manifest/l0_sublevels.go, line 721 at r1 (raw file):

// shape. But this non-overlapping sequence number is already relaxed in
// RocksDB -- sstables are primarily ordered by their largest sequence

do we need a TODO to confirm this assertion?

internal/manifest/l0_sublevels.go, line 892 at r1 (raw file):

	// optional activity so when it fails we can fallback to the last
	// successful candidate. Currently the code keeps adding until it can't
	// add more, but we could optionally stop based on

the "Currently ..." comment is stale.

internal/manifest/l0_sublevels.go, line 934 at r1 (raw file):

		if lastCandidate.seedIntervalStackDepthReduction >= minCompactionDepth &&
			cFiles.fileBytes > 100<<20 &&
			(float64(cFiles.fileBytes)/float64(lastCandidate.fileBytes) > 1.5 || cFiles.fileBytes > 500<<20) {

the heuristic in this line could use a comment -- I pulled the 1.5 constant out of thin air. The goal was to grow if the increase rate was low enough (which is the cost of decreasing stack depth by 1), until we hit the high threshold.

internal/manifest/l0_sublevels.go, line 1315 at r1 (raw file):

		// We pick the longest sequence between firstIndex
		// and lastIndex of non-compacting files -- this is represented by
		// [candidateNonCompactingFirst, candidateNonCompactingLast].

The logic here is tricky and can probably use more comments. Specifically, the "longest sequence" comment is incomplete -- if we have already picked some files in this sublevel for this compaction, the sequence we use has to be the one that includes those already picked files, even if it is not the longest one. This is the reason for candidateHasAlreadyPickedFiles below.

I wouldn't be surprised if there is still a bug lurking in extendCandidateToRectangle -- it was a source of multiple bugs during my experiments.

itsbilal · 2020-04-08T17:00:32Z

Split off the compaction picking / generating methods into #615. Thanks for the review @sumeerbhola , I'll update both PRs with that soon.

itsbilal

Reviewable status: 0 of 5 files reviewed, 16 unresolved discussions (waiting on @jbowens, @petermattis, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 17 at r1 (raw file):

Previously, sumeerbhola wrote…

can you change the first todo to TODO(bilal) and remove the second one.

Done.

internal/manifest/l0_sublevels.go, line 23 at r1 (raw file):

Previously, sumeerbhola wrote…

This is likely to be confusing, so before the We don't have ... sentence can you add an example like the following:
Consider two files with boundaries, f1 [a, c], f2 [b, f]. The interval boundary keys are a, b, IS(c), IS(f), which makes intervals [a, b), [b, IS(c)), [IS(c), IS(f)), where f1 perfectly overlaps with the the first two intervals and f2 with the second two intervals.

Done.

internal/manifest/l0_sublevels.go, line 85 at r1 (raw file):

Previously, sumeerbhola wrote…

I think Peter wanted these extra fields folded into FileMetadata, because when integrating this we'll want L0SubLevels to return [][]*FileMetadata( for integrating with the read path). Unlike the prototype which is maintaining both in Version.

Done. That doesn't involve removing this struct from Version though, right? Since we'd still need extra info here (that's not in FileMetadata) to pick compactions.

internal/manifest/l0_sublevels.go, line 104 at r1 (raw file):

Previously, sumeerbhola wrote…

I don't think we even need that last interval -- it will have 0 files. That will get rid of this confusing comment.

We still need it to know the end of the interval before it. I clarified this commend with that info.

internal/manifest/l0_sublevels.go, line 115 at r1 (raw file):

Previously, sumeerbhola wrote…

as far as I know, this seqnum 0 case only occurs in RocksDB, so perhaps this comment is not relevant, since ongoing compactions are irrelevant for compatibility.

Done.

internal/manifest/l0_sublevels.go, line 131 at r1 (raw file):

Previously, sumeerbhola wrote…

I think this comment could be rephrased and one can include an example like the one I used in the walkthrough. The issue here is that we stop once we have a viable candidate (we really can't afford to consider all intervals as candidates, at every picking request), so we want to first consider candidates further away from ongoing compactions, since they are more likely to give us a deeper compaction.

Done.

internal/manifest/l0_sublevels.go, line 179 at r1 (raw file):

Previously, sumeerbhola wrote…

I just copy-pasted this logging stuff quickly in the prototype -- there may be a more principled solution.

Done. Moved to returning errors instead, unless it's a good candidate for a panic.

internal/manifest/l0_sublevels.go, line 209 at r1 (raw file):

Previously, sumeerbhola wrote…

I was using flushSplitFraction and flushSplitIntervalIndex to print to log, to figure out the root cause of wide files. They didn't indicate anything out of the ordinary, so I'd suggest removing them.

Done.

internal/manifest/l0_sublevels.go, line 212 at r1 (raw file):

Previously, sumeerbhola wrote…

This can be removed to -- it was a temporary debugging aid.

Done.

internal/manifest/l0_sublevels.go, line 469 at r1 (raw file):

Previously, sumeerbhola wrote…

I suspect we'll want to run with checking turned on in production -- it is better to fail before a compaction corrupts the LSM invariants.

Done. Updated the TODO with this instead. (All comments after this point were addressed in #615, not here).

internal/manifest/l0_sublevels.go, line 585 at r1 (raw file):

Previously, sumeerbhola wrote…

s/L0/intra-L0/

Done.

internal/manifest/l0_sublevels.go, line 701 at r1 (raw file):

Previously, sumeerbhola wrote…

... uses one heuristic that worked well for a large TPCC import, but more experimentation is probably needed.

Done.

internal/manifest/l0_sublevels.go, line 721 at r1 (raw file):

Previously, sumeerbhola wrote…

// shape. But this non-overlapping sequence number is already relaxed in
// RocksDB -- sstables are primarily ordered by their largest sequence

do we need a TODO to confirm this assertion?

That's a separate, general task that I have planned for later on - confirm if RocksDB is able to read a split flush under this sort of a setup. Don't think it warrants a todo in Pebble.

internal/manifest/l0_sublevels.go, line 892 at r1 (raw file):

Previously, sumeerbhola wrote…

the "Currently ..." comment is stale.

Done.

internal/manifest/l0_sublevels.go, line 934 at r1 (raw file):

Previously, sumeerbhola wrote…

the heuristic in this line could use a comment -- I pulled the 1.5 constant out of thin air. The goal was to grow if the increase rate was low enough (which is the cost of decreasing stack depth by 1), until we hit the high threshold.

Done.

internal/manifest/l0_sublevels.go, line 1315 at r1 (raw file):

Previously, sumeerbhola wrote…

The logic here is tricky and can probably use more comments. Specifically, the "longest sequence" comment is incomplete -- if we have already picked some files in this sublevel for this compaction, the sequence we use has to be the one that includes those already picked files, even if it is not the longest one. This is the reason for candidateHasAlreadyPickedFiles below.

I wouldn't be surprised if there is still a bug lurking in extendCandidateToRectangle -- it was a source of multiple bugs during my experiments.

This is a general TODO for the next PR (#615) then, I guess. Ill go through this code again and add more comments, thanks for the suggestion.

petermattis

Reviewable status: 0 of 5 files reviewed, 33 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 1 at r2 (raw file):

package manifest

Missing copyright notice.

internal/manifest/l0_sublevels.go, line 19 at r2 (raw file):

// file overlaps perfectly with a sequence of intervals. This perfect overlap
// occurs because the union of file boundary keys is used to pick intervals.
// However the largest key in a file is not inclusive, so when it is used as

s/not inclusive/exclusive/g

internal/manifest/l0_sublevels.go, line 37 at r2 (raw file):

// being the index of the start key of the interval.
//
// In addition to helping with compaction choosing, we use interval indices

s/compaction choosing/compaction picking/g?

internal/manifest/l0_sublevels.go, line 86 at r2 (raw file):

		}
	}
	return keys[0 : j+1]

You can omit the 0 here: return keys[:j+1].

internal/manifest/l0_sublevels.go, line 207 at r2 (raw file):

) (*L0SubLevels, error) {
	s := &L0SubLevels{cmp: cmp, format: formatter}
	s.filesByAge = make([]*FileMetadata, len(files))

It isn't obvious to me why you have to make a copy of files here. You don't seem to be mutating filesByAge. You are mutating the *FileDescriptor. I believe this line could be s.filesByAge = files. If it can't, please add a comment above the filesByAge field explaining why.

internal/manifest/l0_sublevels.go, line 211 at r2 (raw file):

	for i := range files {
		s.filesByAge[i] = files[i]
		files[i].subLevelIndex = i

The subLevelIndex field isn't used after being set here. Is it needed?

The name of this field is a bit confusing as it isn't the sublevel, but the index of the file in the filesByAge slice.

internal/manifest/l0_sublevels.go, line 245 at r2 (raw file):

		}
		f.maxIntervalIndex--
		interpolatedBytes := f.Size / uint64(f.maxIntervalIndex-f.minIntervalIndex+1)

This estimation is basic. Would we get a benefit from a more accurate estimation of the overlapping bytes in an sstable. This already exists in the sstable/Reader.EstimatedDiskUsage() function, though there would be a small challenge in plumbing access to this method here and we might need to be wary of the cost of that method.

internal/manifest/l0_sublevels.go, line 249 at r2 (raw file):

		subLevel := 0
		if f.Compacting {
			if !f.IsBaseCompacting && !f.IsIntraL0Compacting {

If f.Compacting is true, f.IsBaseCompacting == !f.IsIntraL0Compacting. Perhaps you can get rid of one of Is{Base,IntraL0}Compacting to eliminate the possibility of this erroneous state from ever occurring.

internal/manifest/l0_sublevels.go, line 275 at r2 (raw file):

				interval.topOfStackNonCompactingFileCount++
			}
			interval.fileBytes += interpolatedBytes

Perhaps s/fileBytes/estimatedBytes to remind ourselves that this is an estimate.

internal/manifest/l0_sublevels.go, line 313 at r2 (raw file):

		}
		if cumulativeBytes > flushSplitMaxBytes && (len(s.flushSplitUserKeys) == 0 ||
			!bytes.Equal(interval.startKey.key.UserKey, s.flushSplitUserKeys[len(s.flushSplitUserKeys)-1])) {

I believe the bytes.Equal comparison here is done to ensure that we're never adding a duplicate key to flushSplitUserKeys. Could we achieve the same by ensuring that orderedIntervals[i].startKey.key.UserKey do not have duplicates?

internal/manifest/l0_sublevels.go, line 329 at r2 (raw file):

		len(s.filesByAge), len(s.subLevels), len(s.orderedIntervals), len(s.flushSplitUserKeys))
	for i := range s.flushSplitUserKeys {
		fmt.Fprintf(&buf, "%s,",s.flushSplitUserKeys[i])

I think you want to use s.format for formatting the user keys.

internal/manifest/l0_sublevels_test.go, line 1 at r2 (raw file):

package manifest

Missing copyright comment.

internal/manifest/l0_sublevels_test.go, line 183 at r2 (raw file):

func BenchmarkL0SubLevelsInit(b *testing.B) {
	v, err := readManifest("testdata/MANIFEST_import")

I suppose this is the primary reason for adding testdata/MANIFEST_import. That is a hefty file to be adding for benchmark purposes. Could we instead construct a MANIFEST with similar properties?

internal/manifest/version.go, line 67 at r2 (raw file):

	// For L0 files only. Protected by DB.mu. Used to generate L0 sublevels and
	// pick L0 compactions.
	IsBaseCompacting    bool

Does this field need to be exported?

These fields need some comments. How does IsBaseCompacting differ from Compacting? Ditto for IsIntraL0Compacting?

internal/manifest/testdata/l0_sublevels, line 10 at r2 (raw file):

  0009:f.SET.10-i.SET.10
  0010:f.SET.11-g.SET.11
L6

What is the point of defining L6 in these tests? As far as I can tell it is never used. Am I missing something?

internal/manifest/testdata/l0_sublevels, line 45 at r2 (raw file):

----
file count: 5, sublevels: 4, intervals: 9, flush keys: 4
split:f,g,h,p,

Nit: add a space after split: and there shouldn't be a trailing comma.

It is mildly confusing to me to have flush keys on the previous line and then split here. I think it might be mildly clearer to use the Go slice format:

split: [f, g, h, p]

If we want to indicate the cardinality, we could do something like: split(4): [f, g, h, p].

internal/manifest/testdata/l0_sublevels, line 51 at r2 (raw file):

0.0: file count: 2, bytes: 512, width (mean, max): 1, 2, interval range: [3, 7]
compacting file count: 0, base compacting intervals: 
Sublevel 0: 000006:f#4,1-g#5,1, 000011:n#8,1-p#10,1

There are two inconsistencies here. Above we display the sublevel as 0.3, 0.2, etc. Here it is named Sublevel 0. I think we should use the 0.<sublevel> format in both places.

The second inconsistency is the order in which the sublevels are displayed.

Should the details of the files in each sublevel be interleaved with the overview above?

Formatting all of the file details in a single line per sublevel will be unreadable if there are a large number of files. Combining all of these points, how about something like:

0.3: file count: 1, bytes: 256, width (mean, max): 4, 4, interval range: [0, 3]
  000010:f#11,1-g#11,1
0.2: file count: 1, bytes: 256, width (mean, max): 7, 7, interval range: [1, 7]
  000009:f#10,1-p#10,1
0.1: file count: 1, bytes: 256, width (mean, max): 4, 4, interval range: [2, 5]
  000005:f#6,1-h#9,1
0.0: file count: 2, bytes: 512, width (mean, max): 1, 2, interval range: [3, 7]
  000006:f#4,1-g#5,1
  000011:n#8,1-p#10,1

sumeerbhola

Reviewable status: 0 of 5 files reviewed, 33 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 207 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It isn't obvious to me why you have to make a copy of files here. You don't seem to be mutating filesByAge. You are mutating the *FileDescriptor. I believe this line could be s.filesByAge = files. If it can't, please add a comment above the filesByAge field explaining why.

filesByAge used to be []fileMeta which is why the prototype code needed to copy.

internal/manifest/l0_sublevels.go, line 211 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The subLevelIndex field isn't used after being set here. Is it needed?

The name of this field is a bit confusing as it isn't the sublevel, but the index of the file in the filesByAge slice.

It used to be called fileMeta.index. It is extensively used in the compaction picking logic.

internal/manifest/l0_sublevels.go, line 245 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This estimation is basic. Would we get a benefit from a more accurate estimation of the overlapping bytes in an sstable. This already exists in the sstable/Reader.EstimatedDiskUsage() function, though there would be a small challenge in plumbing access to this method here and we might need to be wary of the cost of that method.

Maybe a TODO? I think we would need to keep an incremental byte estimates data-structure and only adjust for added and removed files.

internal/manifest/l0_sublevels.go, line 313 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I believe the bytes.Equal comparison here is done to ensure that we're never adding a duplicate key to flushSplitUserKeys. Could we achieve the same by ensuring that orderedIntervals[i].startKey.key.UserKey do not have duplicates?

IMO, that would be messier since we'd change the invariant that a file exactly overlaps with a sequence of intervals. We've contained special handling of userKeys to the split keys logic.

petermattis

Reviewable status: 0 of 5 files reviewed, 33 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 211 at r2 (raw file):

Previously, sumeerbhola wrote…

It used to be called fileMeta.index. It is extensively used in the compaction picking logic.

Would l0Index be a better name?

internal/manifest/l0_sublevels.go, line 245 at r2 (raw file):

Previously, sumeerbhola wrote…

Maybe a TODO? I think we would need to keep an incremental byte estimates data-structure and only adjust for added and removed files.

A TODO is good. I didn't mean to suggest changing this here.

internal/manifest/l0_sublevels.go, line 313 at r2 (raw file):

Previously, sumeerbhola wrote…

IMO, that would be messier since we'd change the invariant that a file exactly overlaps with a sequence of intervals. We've contained special handling of userKeys to the split keys logic.

Ack. Ok.

sumeerbhola

Reviewable status: 0 of 5 files reviewed, 33 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 245 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

A TODO is good. I didn't mean to suggest changing this here.

This interpolation worked ok because the ingested sstables were usually narrow. But it has been in the back of my mind that L0SubLevels (and the compaction picker changes) should also be able to handle a burst in write throughput coming through normal flushes. Say we have accumulated 10 files in L0 that all overlap with each other, and want the 11th file being flushed to be properly split -- the interpolated estimate from the first 10 files is going to be wildly wrong if they all pair-wise overlap with each other. So we will probably require a better estimate to handle this case.

itsbilal

Thanks for the detailed review!

Reviewable status: 0 of 5 files reviewed, 32 unresolved discussions (waiting on @jbowens, @petermattis, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 1 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Missing copyright notice.

Done.

internal/manifest/l0_sublevels.go, line 19 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/not inclusive/exclusive/g

Done.

internal/manifest/l0_sublevels.go, line 37 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/compaction choosing/compaction picking/g?

Done.

internal/manifest/l0_sublevels.go, line 86 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

You can omit the 0 here: return keys[:j+1].

Done.

internal/manifest/l0_sublevels.go, line 207 at r2 (raw file):

Previously, sumeerbhola wrote…

filesByAge used to be []fileMeta which is why the prototype code needed to copy.

Ah right, good catch. It's an artifact from a past iteration.

internal/manifest/l0_sublevels.go, line 211 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Would l0Index be a better name?

l0Index makes more sense, and has more meaning outside of this data structure. Done.

internal/manifest/l0_sublevels.go, line 245 at r2 (raw file):

Previously, sumeerbhola wrote…

This interpolation worked ok because the ingested sstables were usually narrow. But it has been in the back of my mind that L0SubLevels (and the compaction picker changes) should also be able to handle a burst in write throughput coming through normal flushes. Say we have accumulated 10 files in L0 that all overlap with each other, and want the 11th file being flushed to be properly split -- the interpolated estimate from the first 10 files is going to be wildly wrong if they all pair-wise overlap with each other. So we will probably require a better estimate to handle this case.

Added a TODO. I agree that this estimation is simple and overlooks cases like the one Sumeer mentioned.

internal/manifest/l0_sublevels.go, line 249 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

If f.Compacting is true, f.IsBaseCompacting == !f.IsIntraL0Compacting. Perhaps you can get rid of one of Is{Base,IntraL0}Compacting to eliminate the possibility of this erroneous state from ever occurring.

Done.

internal/manifest/l0_sublevels.go, line 275 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps s/fileBytes/estimatedBytes to remind ourselves that this is an estimate.

Done.

internal/manifest/l0_sublevels.go, line 329 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think you want to use s.format for formatting the user keys.

Done.

internal/manifest/l0_sublevels_test.go, line 1 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Missing copyright comment.

Done.

internal/manifest/l0_sublevels_test.go, line 183 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I suppose this is the primary reason for adding testdata/MANIFEST_import. That is a hefty file to be adding for benchmark purposes. Could we instead construct a MANIFEST with similar properties?

There's another test that the next PR adds, that relies on being able to pick base and intra-L0 compactions repeatedly from this large file. Agree that it is a pretty hefty file though. Maybe I could write a method to dynamically generate a smaller file for this benchmark?

internal/manifest/version.go, line 67 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Does this field need to be exported?

These fields need some comments. How does IsBaseCompacting differ from Compacting? Ditto for IsIntraL0Compacting?

It needs to be exported since it'll be used in addInProgressCompaction/removeInProgressCompaction in compaction.go. Added a comment for IsIntraL0Compacting, and removed IsBaseCompacting altogether.

internal/manifest/testdata/l0_sublevels, line 10 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What is the point of defining L6 in these tests? As far as I can tell it is never used. Am I missing something?

It's used in the next PR, since the compaction picking logic relies on knowing the state of Lbase files.

internal/manifest/testdata/l0_sublevels, line 45 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: add a space after split: and there shouldn't be a trailing comma.

It is mildly confusing to me to have flush keys on the previous line and then split here. I think it might be mildly clearer to use the Go slice format:
split: [f, g, h, p]
If we want to indicate the cardinality, we could do something like: split(4): [f, g, h, p].

Done

internal/manifest/testdata/l0_sublevels, line 51 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

There are two inconsistencies here. Above we display the sublevel as 0.3, 0.2, etc. Here it is named Sublevel 0. I think we should use the 0.<sublevel> format in both places.

The second inconsistency is the order in which the sublevels are displayed.

Should the details of the files in each sublevel be interleaved with the overview above?

Formatting all of the file details in a single line per sublevel will be unreadable if there are a large number of files. Combining all of these points, how about something like:
0.3: file count: 1, bytes: 256, width (mean, max): 4, 4, interval range: [0, 3]
  000010:f#11,1-g#11,1
0.2: file count: 1, bytes: 256, width (mean, max): 7, 7, interval range: [1, 7]
  000009:f#10,1-p#10,1
0.1: file count: 1, bytes: 256, width (mean, max): 4, 4, interval range: [2, 5]
  000005:f#6,1-h#9,1
0.0: file count: 2, bytes: 512, width (mean, max): 1, 2, interval range: [3, 7]
  000006:f#4,1-g#5,1
  000011:n#8,1-p#10,1

That's a pretty good formatting suggestion. Implemented it as-is. The old one had the duplicated sublevel output since String was supposed to be less verbose than what we wanted here, but I fixed that by adding a describe(verbose bool) method that is called here instead. Thanks!

petermattis

Reviewable status: 0 of 5 files reviewed, 18 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 19 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done.

I'm confused reading this comment. FileMetadata.Largest is inclusive. Is the intention to make the end key of an interval exclusive? Why is this important? This might be called out somewhere, but I'm not seeing it right now.

internal/manifest/l0_sublevels.go, line 29 at r3 (raw file):

// comparisons in the following manner:
// - intervalKey{k, false} < intervalKey{k, true}
// - k1 < k2 => intervalKey{k1, _} < intervalKey{k2, _}.

This reminds me of the construction performed in compaction.go to determine the largest key. Not quite the same, though.

It is important for these intervals to be on internal keys? Given atomic compaction boundary constraints if a user-key straddles two adjacent tables we have to compact both at the same time, right? If that is the case, perhaps we can use MakeSearchKey(f.Smallest.UserKey) and MakeInternalKey(f.Largest.UserKey, 0, 0).

If it is important to keep these intervals on internal keys, it would be great to spell out why.

internal/manifest/l0_sublevels_test.go, line 183 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

There's another test that the next PR adds, that relies on being able to pick base and intra-L0 compactions repeatedly from this large file. Agree that it is a pretty hefty file though. Maybe I could write a method to dynamically generate a smaller file for this benchmark?

I think this is fine for now. We can revisit in the future.

sumeerbhola

Reviewable status: 0 of 5 files reviewed, 19 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 30 at r2 (raw file):

// keys produced would be intervalKey{a, false}, intervalKey{b, false},
// intervalKey{e, true} and intervalKey{g, true}, resulting in intervals
// [a, b], [b, (e, true)], [(e, true), (g, true)]. The first file overlaps

this example needs some correction. The file bounds need to be inclusive at the end. And the interval bounds need be exclusive at the end.
And can you include something like [e, f] to show e as both (e, false) and (e, true).

internal/manifest/l0_sublevels.go, line 29 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This reminds me of the construction performed in compaction.go to determine the largest key. Not quite the same, though.

It is important for these intervals to be on internal keys? Given atomic compaction boundary constraints if a user-key straddles two adjacent tables we have to compact both at the same time, right? If that is the case, perhaps we can use MakeSearchKey(f.Smallest.UserKey) and MakeInternalKey(f.Largest.UserKey, 0, 0).

If it is important to keep these intervals on internal keys, it would be great to spell out why.

I don't think it is necessary for the intervals to be on internal keys. Regarding atomic compaction boundaries: compaction.expandInputs is a no-op for level 0, and we happen to get the atomic compaction boundaries for level 0 because Version.Overlaps uses user keys, but I don't think atomic compaction boundaries are necessary for L0. I think they are only necessary for levels where we have split a range tombstone across output sstables, which is not the case with L0 (even after the future flush changes outputting multiple sstables). That said, there doesn't seem to be any harm in the stronger invariant unless we have severe hotspots on individual keys.

petermattis

Reviewable status: 0 of 5 files reviewed, 19 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 29 at r3 (raw file):

I think they are only necessary for levels where we have split a range tombstone across output sstables, which is not the case with L0 (even after the future flush changes outputting multiple sstables).

Are we going to restrict sstable splits during L0 flushing to never split within a range tombstone? Perhaps I'm misunderstanding your statement here.

That said, there doesn't seem to be any harm in the stronger invariant unless we have severe hotspots on individual keys.

My preference is to stay with the same invariant in each level unless we find a strong reason not to. Sounds like we're not seeing a strong reason here.

itsbilal

Reviewable status: 0 of 5 files reviewed, 19 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 19 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm confused reading this comment. FileMetadata.Largest is inclusive. Is the intention to make the end key of an interval exclusive? Why is this important? This might be called out somewhere, but I'm not seeing it right now.

The comment was incorrect, that's probably what tripped you here. The largest key in a file's metadata is inclusive. The exclusive end keys in intervals allow for intervals with no gaps in between.

internal/manifest/l0_sublevels.go, line 30 at r2 (raw file):

Previously, sumeerbhola wrote…

this example needs some correction. The file bounds need to be inclusive at the end. And the interval bounds need be exclusive at the end.
And can you include something like [e, f] to show e as both (e, false) and (e, true).

Done. Good catch, also corrected the "file bounds are exclusive" comment above.

internal/manifest/l0_sublevels.go, line 29 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think they are only necessary for levels where we have split a range tombstone across output sstables, which is not the case with L0 (even after the future flush changes outputting multiple sstables).

Are we going to restrict sstable splits during L0 flushing to never split within a range tombstone? Perhaps I'm misunderstanding your statement here.

That said, there doesn't seem to be any harm in the stronger invariant unless we have severe hotspots on individual keys.

My preference is to stay with the same invariant in each level unless we find a strong reason not to. Sounds like we're not seeing a strong reason here.

I changed it to user keys. Uncovered a bug on Friday in the iterator PR that was occurring due to sublevels being determined by interval keys. Consider this scenario:

0.1:
   00007:[b.SET.6-j.SET.8]
0.0:
   00009:[a.SET.10-b.SET.10]
   00003:[e.SET.5-j.SET.7]

In this case, b.SET.10 sorts before b.SET.6 due to the higher trailer, so SSTs 00009 and 00007 have no overlap according to our (old) sublevel logic. But the existence of 00003 pushes 00007 higher up in the sublevels, and now b.SET.6 appears to shadow b.SET.10. Also added a test with this example.

sumeerbhola

Reviewable status: 0 of 5 files reviewed, 19 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 29 at r3 (raw file):

Are we going to restrict sstable splits during L0 flushing to never split within a range tombstone? Perhaps I'm misunderstanding your statement here.

I didn't say it correctly. I meant to say that should probably not "improperly" truncate tombstones (as defined in https://github.com/cockroachdb/pebble/blob/master/docs/range_deletions.md#improperly-truncated-range-deletes -- if reference needed by @jbowens).

The prototype code does split range tombstones, but a user key does not straddle two output sstables produced by the same flush, so they are properly truncated
(https://github.com/cockroachdb/pebble/pull/563/files#diff-ebe3c6610b799c65b7e2b444dc20510a for the Fragmenter change, and the caller in compaction.go

pebble/compaction.go

Line 1549 in 57e44da

c.rangeDelFrag.FlushToNoStraddling(key.UserKey)

)

petermattis

Reviewable status: 0 of 5 files reviewed, 20 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 29 at r3 (raw file):

I didn't say it correctly. I meant to say that should probably not "improperly" truncate tombstones (as defined in https://github.com/cockroachdb/pebble/blob/master/docs/range_deletions.md#improperly-truncated-range-deletes -- if reference needed by @jbowens).

Got it. I haven't reviewed the compaction picking part of this prototype, so I was missing FlushToNoStraddling.

internal/manifest/l0_sublevels.go, line 48 at r4 (raw file):

// numbers and so avoid expensive byte slice comparisons.
type intervalKey struct {
	key       []byte

Now that key is a user key, we could replace isLargest by making the type of key a base.InternalKey and using a defined trailer to indicate isLargest. The advantage of doing this is that we get the comparison routine "for free", along with formatting routines. I probably wouldn't simply replace intervalKey with base.InternalKey everywhere as I feel that might be more confusing. Concretely, I'm proposing:

type intervalKey struct {
  // key.Trailer != 0 indicates !isLargest
  // key.Trailer == 0 indicates isLargest
  key base.InternalKey
}

itsbilal

Reviewable status: 0 of 5 files reviewed, 20 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 48 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Now that key is a user key, we could replace isLargest by making the type of key a base.InternalKey and using a defined trailer to indicate isLargest. The advantage of doing this is that we get the comparison routine "for free", along with formatting routines. I probably wouldn't simply replace intervalKey with base.InternalKey everywhere as I feel that might be more confusing. Concretely, I'm proposing:
type intervalKey struct {
  // key.Trailer != 0 indicates !isLargest
  // key.Trailer == 0 indicates isLargest
  key base.InternalKey
}

I'm less sold on reusing the internal key comparison routine, because it would clutter sortAndDedup with a lot of != 0 and == 0 logic. It feels cleaner to be able to define a comparator that does just what we want it to do here., and then our dedup logic can just rely on our own comparator's equality. That said, I don't feel strongly either way; happy to implement that too if you feel like that'd be better.

petermattis

A few small additional comments, but I think this is good to merge.

Reviewable status: 0 of 5 files reviewed, 24 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 48 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I'm less sold on reusing the internal key comparison routine, because it would clutter sortAndDedup with a lot of != 0 and == 0 logic. It feels cleaner to be able to define a comparator that does just what we want it to do here., and then our dedup logic can just rely on our own comparator's equality. That said, I don't feel strongly either way; happy to implement that too if you feel like that'd be better.

I don't have a strong preference. I'm ok with leaving this as-is for now. If the alternative approach becomes more promising for some reason, we can change this in the future.

internal/manifest/l0_sublevels.go, line 120 at r4 (raw file):

	// True if another interval that has a file extending into this interval is
	// undergoing a compaction into Lbase. In other words, this bool is true
	// if any interval in [internalFilesMinIntervalIndex,

Should this be s/internalFiles/files/g?

internal/manifest/l0_sublevels.go, line 362 at r4 (raw file):

		}
		fmt.Fprintf(&buf, "0.%d: file count: %d, bytes: %d, width (mean, max): %d, %d, interval range: [%d, %d]\n",
			i, len(s.Files[i]), totalBytes, sumIntervals/len(s.Files[i]), maxIntervals, s.Files[i][0].minIntervalIndex,

Nit: should the mean be formatted with %0.1f?

internal/manifest/testdata/l0_sublevels, line 9 at r4 (raw file):

file count: 3, sublevels: 3, intervals: 5
flush split keys(3): [b, e, j]
0.2: file count: 1, bytes: 256, width (mean, max): 2, 2, interval range: [0, 1]

file count is simply the number of files listed below, right? I could see leaving it out. Ditto with the summary file count above and the sublevel count. Doesn't have to happen in this PR.

What is the benefit of displaying the mean and max interval width? Is this to look for cases where there are wide intervals which are "bad".

internal/manifest/testdata/l0_sublevels, line 10 at r4 (raw file):

flush split keys(3): [b, e, j]
0.2: file count: 1, bytes: 256, width (mean, max): 2, 2, interval range: [0, 1]
	000009:a#10,1-b#10,1

Should we include the width and interval range for each file? If there is a file which covers a large number of intervals, this would highlight that.

sumeerbhola

Reviewable status: 0 of 5 files reviewed, 24 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)

internal/manifest/testdata/l0_sublevels, line 10 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Should we include the width and interval range for each file? If there is a file which covers a large number of intervals, this would highlight that.

IMO the verbose option that prints every file is going to spam the log when we have large numbers of files in L0 -- we had > 2000 files in L0 during the large import. The prototype only printed the wide files, if any, with each sublevel. There were usually none, and the maximum I saw was 4 wide files at any point in time in L0.

itsbilal

TFTR! Addressed most of your comments.

Reviewable status: 0 of 5 files reviewed, 23 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)

internal/manifest/l0_sublevels.go, line 120 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Should this be s/internalFiles/files/g?

Good catch. Done.

internal/manifest/l0_sublevels.go, line 362 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: should the mean be formatted with %0.1f?

Done.

internal/manifest/testdata/l0_sublevels, line 9 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

file count is simply the number of files listed below, right? I could see leaving it out. Ditto with the summary file count above and the sublevel count. Doesn't have to happen in this PR.

What is the benefit of displaying the mean and max interval width? Is this to look for cases where there are wide intervals which are "bad".

File counts are useful for when verbose = false and we don't print out all the files. The mean / max widths help narrow down sublevels that have the wider files we'd need to investigate more.

itsbilal · 2020-04-21T17:04:08Z

I'll merge this, barring no more objections. Also turns out the Github issues robbed the latest commit of a travis build, so I'll do a no-change force push.

This change introduces L0SubLevels, a data structure to encaptulate all information to compute L0 Sublevels, flush split keys, and base / intra-L0 compactions. This data structure will be generated every time there's a file addition/deletion in L0, the integration bits will come in future PRs. The methods to pick base / intra-L0 compactions will come in a separate PR. Captures some of the largest code pieces of cockroachdb#563, and effectively replaces a part of that PR. Sumeer wrote the bulk of this code as part of his prototype, so thanks Sumeer for his work. Most of these functions are dead code at the moment, only invoked by the provided datadriven and non-datadriven tests.

Merging cockroachdb#614 and cockroachdb#635 in quick succession broke the master build. This change updates L0SubLevels with the type rename.

Merging #614 and #635 in quick succession broke the master build. This change updates L0SubLevels with the type rename.

itsbilal requested review from jbowens, petermattis and sumeerbhola April 7, 2020 21:47

itsbilal self-assigned this Apr 7, 2020

itsbilal force-pushed the l0-sublevels-phase-1 branch 4 times, most recently from fee6a52 to cbbeb50 Compare April 8, 2020 16:00

itsbilal mentioned this pull request Apr 8, 2020

internal/manifest: Add L0SubLevels methods to pick compactions #615

Merged

sumeerbhola requested changes Apr 8, 2020

View reviewed changes

itsbilal force-pushed the l0-sublevels-phase-1 branch 2 times, most recently from cd82ce8 to 263dccc Compare April 8, 2020 19:08

itsbilal commented Apr 8, 2020

View reviewed changes

petermattis reviewed Apr 10, 2020

View reviewed changes

sumeerbhola reviewed Apr 10, 2020

View reviewed changes

petermattis reviewed Apr 10, 2020

View reviewed changes

sumeerbhola reviewed Apr 11, 2020

View reviewed changes

itsbilal force-pushed the l0-sublevels-phase-1 branch from 263dccc to 47f48c9 Compare April 13, 2020 22:04

itsbilal commented Apr 13, 2020

View reviewed changes

itsbilal force-pushed the l0-sublevels-phase-1 branch from 47f48c9 to ad67b0e Compare April 15, 2020 21:45

itsbilal mentioned this pull request Apr 15, 2020

*: Wire up L0SubLevels with iterator and version creation #624

Merged

itsbilal force-pushed the l0-sublevels-phase-1 branch from ad67b0e to a95224c Compare April 16, 2020 15:20

petermattis reviewed Apr 17, 2020

View reviewed changes

sumeerbhola requested changes Apr 19, 2020

View reviewed changes

petermattis reviewed Apr 19, 2020

View reviewed changes

itsbilal commented Apr 20, 2020

View reviewed changes

itsbilal force-pushed the l0-sublevels-phase-1 branch from a95224c to 69a8fb6 Compare April 20, 2020 15:54

sumeerbhola reviewed Apr 20, 2020

View reviewed changes

petermattis reviewed Apr 20, 2020

View reviewed changes

itsbilal commented Apr 20, 2020

View reviewed changes

petermattis approved these changes Apr 21, 2020

View reviewed changes

sumeerbhola reviewed Apr 21, 2020

View reviewed changes

itsbilal force-pushed the l0-sublevels-phase-1 branch from 69a8fb6 to a128988 Compare April 21, 2020 15:39

itsbilal commented Apr 21, 2020

View reviewed changes

itsbilal requested a review from sumeerbhola April 21, 2020 17:03

itsbilal force-pushed the l0-sublevels-phase-1 branch from a128988 to f712f8b Compare April 21, 2020 17:04

itsbilal merged commit 6a43207 into cockroachdb:master Apr 21, 2020

itsbilal deleted the l0-sublevels-phase-1 branch April 21, 2020 18:24

itsbilal mentioned this pull request Apr 21, 2020

internal/manifest: Fix merge skew with Formatter -> FormatKey change #636

Merged

itsbilal added a commit that referenced this pull request Apr 21, 2020

internal/manifest: Fix merge skew with Formatter -> FormatKey change

b976c25

Merging #614 and #635 in quick succession broke the master build. This change updates L0SubLevels with the type rename.

itsbilal mentioned this pull request May 28, 2020

perf: Introduce sublevels in L0 #609

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal/manifest: Introduce L0SubLevels data structure #614

internal/manifest: Introduce L0SubLevels data structure #614

itsbilal commented Apr 7, 2020 •

edited

Loading

petermattis commented Apr 7, 2020

sumeerbhola left a comment

itsbilal commented Apr 8, 2020

itsbilal left a comment

petermattis left a comment

sumeerbhola left a comment

petermattis left a comment

sumeerbhola left a comment

itsbilal left a comment

petermattis left a comment

sumeerbhola left a comment

petermattis left a comment

itsbilal left a comment

sumeerbhola left a comment

petermattis left a comment

itsbilal left a comment

petermattis left a comment

sumeerbhola left a comment

itsbilal left a comment

itsbilal commented Apr 21, 2020

internal/manifest: Introduce L0SubLevels data structure #614

internal/manifest: Introduce L0SubLevels data structure #614

Conversation

itsbilal commented Apr 7, 2020 • edited Loading

petermattis commented Apr 7, 2020

sumeerbhola left a comment

Choose a reason for hiding this comment

itsbilal commented Apr 8, 2020

itsbilal left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

itsbilal commented Apr 21, 2020

itsbilal commented Apr 7, 2020 •

edited

Loading