db: investigate alignment of key bounds in compactions #2156

sumeerbhola · 2022-12-01T15:43:10Z

Compactions from one non-overlapping level to another have a shape constraint in that the higher level's key span cannot be wider than the lower level. In compactions out of L0, there can be multiple sub-levels involved and the same key span constraint has to apply to all the successive (sub)levels from low to high, e.g.

    L0.2     e--------o
    L0.1    d-----------q
    L0.0  c----------------r
    L2   b------------------t

We are also planning to implement multi-level compactions across non-overlapping levels #136

There is an efficiency loss with this constraint in that if we are rewriting (r,t] and [b,c) in L2, we have wasted that rewrite cost by leaving keys in those key spans at the higher levels.

Once we have virtual sstables (db: split physical sstables into virtual sstables at ingest time #1683) we can expand the higher levels to exactly match the lowest level's key span and virtualize the sstables at those higher levels that are partially consumed. This can increase two costs (a) (slightly) higher space amplification, (b) extra comparisons when iterating over a virtual sstable.
[added 12/6/22] Aside from virtual ssts, we can write ssts during compactions to have perfect alignment with the grandparent level. See http://rocksdb.org/blog/2022/10/31/align-compaction-output-file.html that was brought up in a comment below.

We should first investigate how much data we are leaving behind, that overlaps with the lowest level's key span, for common workloads (ycsb/kv0/tpcc). It is possible that the current grandparent based splitting logic is resulting in fairly aligned files.

The text was updated successfully, but these errors were encountered:

jbowens · 2022-12-01T15:56:15Z

It is possible that the grandparent based splitting logic is resulting in aligned files.

Relatedly, I think there might be gains to be had by tuning the grandparent splitting. The grandparent splitting uses 10 times the output level's target file size as the max overlap:

// maxGrandparentOverlapBytes is the maximum bytes of overlap with level+1
// before we stop building a single file in a level-1 to level compaction.
func maxGrandparentOverlapBytes(opts *Options, level int) uint64 {
	return uint64(10 * opts.Level(level).TargetFileSize)
}

This seems like this would only result in a L_i file being aligned with a L_i+1 ~every five L_i+1 files, given the doubling of the target file size. It's not clear to me what negative consequence there is to splitting more aggressively to align with grandparents? We don't want too small of files, but maybe a heuristic that takes into account the current output file's size could prevent that.

sumeerbhola · 2022-12-01T22:02:20Z

L_i+1 files being narrower in the key span by 5x seems desirable. When L_i is full, and we pick one file from that level to compact down, it is good that the next lower level has narrower files since then we don't have to expand the compaction bounds by much. I may be missing something.

jbowens · 2022-12-05T17:30:49Z

That makes sense. I didn't have a well articulated idea, just a sense that only considering the grandparent bounds in this way is leaving an opportunity to avoid unnecessary overlap with the grandparent. I haven't given the RocksDB blog post that @irfansharif posted a thorough read, but it seems like a similar concept.

jbowens · 2023-01-20T19:38:32Z

A prototype shows a ~12% w-amp improvement on a small kv0 workload:

                         │ old-kvshort.txt │         new-kvshort-3.txt         │
                         │      wamp       │    wamp     vs base               │
Replay/kv0short/WriteAmp       5.587 ± 12%   4.891 ± 4%  -12.44% (p=0.002 n=6)

and fewer compactions:

                                 │ old-kvshort.txt │         new-kvshort-3.txt          │
                                 │   compactions   │ compactions  vs base               │
Replay/kv0short/CompactionCounts       452.0 ± 28%    393.5 ± 9%  -12.94% (p=0.002 n=6)

                                 │ old-kvshort.txt │         new-kvshort-3.txt          │
                                 │     default     │   default    vs base               │
Replay/kv0short/CompactionCounts       451.5 ± 28%   393.5 ± 10%  -12.85% (p=0.002 n=6)

Introduce a new type `frontiers`, designed to monitor several different user key frontiers during a compaction. When a user key is encountered that equals or exceeds the configured frontier, the code that specified the frontier is notified and given an opportunity to set a new frontier. Internally, `frontiers` uses a heap (code largely copied from the merging iterator's heap) to avoid N key comparisons for every key. This commit refactors the `limitFuncSplitter` type to make use of `frontiers`. The `limitFuncSplitter` type is used to split flushes to L0 flush split keys, and to split both flushes and compactions to avoid excessive overlap with grandparent files. This change is motivated by cockroachdb#2156, which will introduce an additional compaction-output splitter that must perform key comparisons against the next key to decide when to split a compaction. Additionally, the `frontiers` type may also be useful for other uses, such as applying key-space-dependent logic during a compaction (eg, compaction-time GC, disaggregated storage locality policies, or keyspan-restricted snapshots cockroachdb#1810).

Introduce a new type `frontiers`, designed to monitor several different user key frontiers during a compaction. When a user key is encountered that equals or exceeds the configured frontier, the code that specified the frontier is notified and given an opportunity to set a new frontier. Internally, `frontiers` uses a heap (code largely copied from the merging iterator's heap) to avoid N key comparisons for every key. This commit refactors the `limitFuncSplitter` type to make use of `frontiers`. The `limitFuncSplitter` type is used to split flushes to L0 flush split keys, and to split both flushes and compactions to avoid excessive overlap with grandparent files. This change is motivated by #2156, which will introduce an additional compaction-output splitter that must perform key comparisons against the next key to decide when to split a compaction. Additionally, the `frontiers` type may also be useful for other uses, such as applying key-space-dependent logic during a compaction (eg, compaction-time GC, disaggregated storage locality policies, or keyspan-restricted snapshots #1810).

blathers-crl bot added A-storage T-storage labels Dec 1, 2022

sumeerbhola changed the title ~~db: investigate virtual sstables to align key bounds in compactions~~ db: investigate alignment of key bounds in compactions Dec 6, 2022

jbowens self-assigned this Jan 18, 2023

jbowens mentioned this issue Jan 20, 2023

db: refactor compaction splitting to reduce key comparisons #2259

Merged

jbowens mentioned this issue Feb 9, 2023

db: implement relaxed target file size splitting heuristic #2316

Merged

jbowens closed this as completed in #2316 Feb 15, 2023

sumeerbhola mentioned this issue Jul 19, 2023

db: potentially trade write amp for space amp during compactions through vsstables #2752

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: investigate alignment of key bounds in compactions #2156

db: investigate alignment of key bounds in compactions #2156

sumeerbhola commented Dec 1, 2022 •

edited

Loading

jbowens commented Dec 1, 2022

sumeerbhola commented Dec 1, 2022

jbowens commented Dec 5, 2022

jbowens commented Jan 20, 2023

db: investigate alignment of key bounds in compactions #2156

db: investigate alignment of key bounds in compactions #2156

Comments

sumeerbhola commented Dec 1, 2022 • edited Loading

jbowens commented Dec 1, 2022

sumeerbhola commented Dec 1, 2022

jbowens commented Dec 5, 2022

jbowens commented Jan 20, 2023

sumeerbhola commented Dec 1, 2022 •

edited

Loading