Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Introduce sublevels in L0 #609

Closed
itsbilal opened this issue Apr 3, 2020 · 9 comments
Closed

perf: Introduce sublevels in L0 #609

itsbilal opened this issue Apr 3, 2020 · 9 comments
Assignees

Comments

@itsbilal
Copy link
Member

itsbilal commented Apr 3, 2020

This issue falls under the general umbrella of compaction improvements suggested in #552.

Motivation

The LSM tree is made up of multiple levels, and SSTables within a level are not allowed to overlap one another. This invariant is relaxed for the topmost level (L0), where SSTables are allowed to overlap in key ranges (and as of #591, seqnum ranges as well). The side effect of this is that all SSTables in L0 must be merged during read time, as each SSTable contributes one unit to read amplification.

If L0 gets a lot of SSTables, and these SSTables are particularly wide in key ranges, they end up overlapping with many other L0 SSTables as well as with many SSTables in LBase (the next non-empty level "below" L0). This results in really wide and slow compactions (as all overlapping L0 and LBase SSTables must be compacted together). In addition, concurrent L0 -> LBase compactions are reduced.

These situations are observed during large IMPORT operations, where a lot of L0 SSTables are created too quickly, quicker than can be drained by L0 -> LBase compactions (and subsequent compactions). This increases read amplification and increases latency across the board until the LSM tree goes back towards the more "natural" shape of having ~90% of its bytes in the bottommost level.

This would be a simple example of an LSM tree with multiple overlapping L0 SSTables.

L0  a..............r
      b..............y
              m........z
    a......f

L2    b.....g m..q r...z

L3  ...
...

Assume that newer L0 SSTables are shown above older ones.

This issue introduces a design that has already been prototyped/experimented in #563. If L0 is organized into sublevels, where the regular level invariants hold within a sublevel (no key range overlaps), and the sublevels are organized in an order where "newer" sublevels (defined as sublevels with higher indexes) shadow keys in older sublevels, we are able to reduce read amplification by using levelIters for each sublevel. The above example could then be written as:

level sublevel  sstables
L0    2         a..............r
      1           b..............y
      0         a......f  m........z

L2                b.....g m..q r...z

L3  
...

This has reduced read amplification by 1, as a..f and m..z can occupy the same sublevel. L0's contribution to read amplification can therefore be defined as the highest height of the stack of overlapping L0 SSTables (arranged in sublevels) at any given key. This will always be less than or equal to the number of sublevels.

However, we still have an issue with wide SSTables and compactions. If the SST in sublevel 1 is chosen for compaction, we inevitably end up choosing all of L0 and Lbase in the same compaction. To keep compaction shorter and improve concurrency of compactions, we introduce flush split keys. At flush time, each SSTable would be partitioned into multiple SSTables at some defined user keys known as flush split keys. More on how these flush split keys can be picked is under Implementation Design.

In the above case, let's assumem and r are flush split keys. This results in this LSM tree:


level sublevel  sstables
L0    2         a........l m..q rr
      1           b......l m..q r...y
      0         a......f   m..q r....z

L2                b.....g  m..q r....z

L3  
...

Such an LSM tree would have the same amount of read amplification as we saw earlier, while also allowing multiple, smaller L0 -> LBase compactions to proceed in parallel. In particular, 3 parallel L0 -> LBase compactions can proceed in this example: all the L0 and L2 SSTables in the key range a..l, all the ones in m..q, and all the ones in r..z.

This vertical rectangular shape of compactions is what we want to find, since it maximizes concurrency and effectiveness. Effectiveness (or compaction score) can be defined as the number of files reduced; therefore, the key interval that is the "tallest" would be the most effective one to compact. Consider this example:

level sublevel  sstables
L0    4                    m.o
      3                     n.q
      2         a........l m..q rr
      1           b......l m..q r...y
      0         a......f   m..q r....z

L2                b.....g  m..q r....z

L3  
...

In this case compacting all SSTs in m..q starting from the m..o SST in sublevel 4 downwards would be the most effective and still narrow, since it would reduce stack depth. In this case this compaction could single-handedly eliminate sublevels 3 and 4, assuming it's the only operation taking place.

Implementation design decisions

Sublevels can be deduced from a set of L0 SSTs by iterating through them from oldest to newest, and for each of these files, seeing what's the highest assigned sublevel for an overlapping L0 SST (if any), and assigning it that sublevel plus 1.

To reduce the number of key comparisons, an ordered list of intervals keys could be produced, as done in #563. These would be the union set of all SSTable boundary keys. After this is produced, all subsequent operations around SSTable boundary comparisons could just deal with index comparisons wrt. the ordered list, instead of having to do more extensive key comparisons.

Compactions can be chosen by picking seed files out of tall intervals (intervals in which many SSTables overlap). For base compactions, we'd iterate downward (towards the 0 sublevel), picking all overlapping SSTables along the way, and for intra-L0 compactions, we'd proceed in the other direction (towards the highest indexed sublevel). These compactions can optionally be expanded in the other direction (higher indexed sublevels for base compactions, lower indexed ones for intra-L0) by progressively adding SSTables in that interval until we can't add one due to concurrent compactions.

Flush split keys could be chosen by iterating through interval keys in order, from lowest to highest, keeping track of all bytes in files observed so far (with an estimated byte count for any files that we're partially through) since the last flush split key, and publishing a split key when that byte count exceeds a specified threshold.

Currently the prototype in #563 recreates the entire sublevel data structure after any addition or deletion to the L0 tables. There's room for performance improvement by leveraging the map of added/deleted files in a BulkVersionEdit to modify the sublevel data structure instead of recreating it from scratch. These updates would be required:

  1. Adding/deleting from the set of ordered keys, then going around and updating all indexes in the data structure to the new indexes

  2. Updating all booleans denoting ongoing compactions, based on any new info about ongoing compactions.

  3. Recalculating all flush split keys. It would be more efficient to add and remove flush split keys on a one-off basis instead of calculating them all from scratch, resulting in fewer overlapping SSTables.

Remaining challenges

  • Integration with the rest of Pebble
  • Demonstrating a performance improvement under a practical heavy import workload
  • Manual testing
  • Unit testing
@itsbilal itsbilal self-assigned this Apr 3, 2020
@itsbilal
Copy link
Member Author

itsbilal commented Apr 3, 2020

The writeup, outside of the motivation section, is incomplete. I want to expand the implementation section a lot more based on code reading and observations, but this is a start.

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Apr 4, 2020

Currently the prototype in #563 recreates the entire sublevel data structure after any addition or deletion to the L0 tables. There's room for performance improvement by leveraging the map of added/deleted files in a BulkVersionEdit to modify the sublevel data structure instead of recreating it from scratch.

I suggest not bothering with an L0SubLevels that can handle addition/removal of files since it will add significant complexity -- the dense numbering of the intervals is critical to how the code is written. As noted in https://github.com/sumeerbhola/pebble/blob/sublevel/internal/manifest/l0_sublevels_test.go#L144-L148
with 609 files in L0 it took < 1ms to initialize. The TPCC import reached > 2000 files in L0 so we would need to of course benchmark with even higher numbers.
The main costs in initialization are (a) the sorting of the intervalKeys, (b) allocation of slices. If needed:

  • (b) can be reduced with Sync.Pool.
  • (a) can be reduced by maintaining a ref-counted sorted set of intervalKeys (ref-counted because more than 1 file may use that endpoint), and when files are added/removed adjust this set and maintain it incrementally in sorted order. Assignment of the dense numbers is then a single pass through the current set. The key comparison for finding the start and end interval of each file can also be eliminated using this data-structure.

Demonstrating a performance improvement under a practical heavy import workload

Do we have a list of additional workloads to try, beyond the 10 node TPCC import that I ran with? It would be easy to run them with the existing prototype.

@itsbilal
Copy link
Member Author

itsbilal commented Apr 6, 2020

@sumeerbhola It would be possible to build a mapping of old interval indices to new interval indices, and then go around the data structure and update all intervals, right? That way you still maintain the density, and the rest of the algorithm (after the translation) still works as you'd expect.

The reason why I want to preserve the instance of the data structure is to prevent our view of "flush split keys" from changing too significantly every time there's a new revision. Correct me if my mental model is incorrect, but I can see a small "shift" (in any direction) in flush split keys to be pretty detrimental to the point of having those keys in the first place; since flushes after the split would have SSTs that bridge the older split key induced gaps, greatly increasing the surface area for any compactions.

It would be more preferable to add or remove keys on a one-off basis to maintain the target sizes, instead of recalculating it all from scratch without any preserved state of the past world.

@petermattis
Copy link
Collaborator

The reason why I want to preserve the instance of the data structure is to prevent our view of "flush split keys" from changing too significantly every time there's a new revision. Correct me if my mental model is incorrect, but I can see a small "shift" (in any direction) in flush split keys to be pretty detrimental to the point of having those keys in the first place; since flushes after the split would have SSTs that bridge the older split key induced gaps, greatly increasing the surface area for any compactions.

My mental model of how the flush split keys evolve is hazy. I can definitely see them shifting after a flush or L0 compaction, but I'm not picturing how significant that shifting will be. I think it is worthwhile to gather some evidence first as to whether this is a real problem or not. It is also possible that trying to incrementally maintain the flush split keys causes problems itself.

It would be more preferable to add or remove keys on a one-off basis to maintain the target sizes, instead of recalculating it all from scratch without any preserved state of the past world.

You will have to recalculate the state from scratch when opening a DB.

Incremental maintenance of a data structure can be a lot more challenging than rebuilding it from scratch whenever there is a change in the underlying data. Consider a balanced tree vs a sorted array. There be dragons with the incremental maintenance, and since we'll need to be able to rebuild from scratch anyways, my preference would be to start with the rebuild from scratch approach, get it working and rock solid, and then add the incremental maintenance if it proves necessary.

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Apr 7, 2020

It would be possible to build a mapping of old interval indices to new interval indices, and then go around the data structure and update all intervals, right? That way you still maintain the density, and the rest of the algorithm (after the translation) still works as you'd expect.

Yes. I was just giving that as an example -- there is more: e.g. each fileInterval has a subLevelAndFileList that will shift as sublevels of files will change (some files will fall down to lower sublevels). There is also the possibility of seqnum 0 files being added that overlap with existing files that will change the sublevel structure from the bottom. The special case of adding a file that has higher sequence number than all current files, say due to a flush, may be doable, but I worry that even that will introduce (unnecessary) complexity.

Regarding stability of "flush split keys", I may be misunderstanding what you are suggesting, but an incremental data-structure does nothing there since it behaves as if it was constructed from scratch. Also, stability is easier to achieve by just using the same split keys until say N% of the files in L0 have changed -- i.e., keep a copy of the split keys somewhere. But can you try to construct an example when all files have non-coincident end points to try to see what bad things happen when split keys keep shifting slightly -- I am unable to see the problem with that (the biasing towards the triangular shape of compaction picking should prevent issues). Also there is a tradeoff between stability and quickly responding rapidly to a change in key distribution -- a cold part of the interval space may now be receiving a lot of data (we see that with ingestion).

@itsbilal
Copy link
Member Author

itsbilal commented Apr 7, 2020

Yeah, the complexity will increase pretty substantially for every bit that'll need updating. I'll can the translation idea for now. Thanks for elaborating on the challenges with this approach, really appreciated. I'll also go ahead and update the issue description.

Regarding flush split key stability, I'll try to benchmark up a scenario where this is problematic. The issue I saw with the current approach was that the flush split keys were calculated in a left-to-right iteration with no incorporation of past split keys. We could just go the route where we preserve past split keys unless there's a significant change, and rebuild the rest of the data structure at every manifest revision anyway.

Either way this discussion doesn't block my work anytime soon. I'm going to polish up the current approach, but this is food for thought for later on.

Here's a simplified scenario I was thinking of. Assume minCompactionDepth = 3.

level sublevel  sstables
L0    3         a.....g h...n p......y
      2           c...g
      1         a........l m..p  rr
      0           b......l m...q r...y

L2                b.....g  m...q r....z

L3  
...

For simplicity we can assume each sublevel depicted corresponds to one flush. (In practice, sstables h..n and p..y will go into sublevel 2 not 3, but that doesn't make a difference here).

Note that the first two flushes had split keys m and r, so they give us the nice rectangular shape in sublevels 0 and 1. Then a flush comes around with keys only in between c..g, introducing the successor of g (which I'll simplify to be h) to our interval key space and making it a split key candidate for future flushes, and turns out, the interpolated bytes before h are high enough to make it a flush split key. m is no longer a split key since the cumulative byte count was reset at h, letting the flush at sublevel 3 overlap the boundary at m in lower sublevels. A similar pattern repeats around r.

Now, it's increasingly difficult to pick a base compaction of stack height 3 without incorporating most of L0 into it, negating the main benefit of flush split keys.

In a hypothetical scenario where you do account for past split keys, the algorithm would be smart enough to just introduce g/h without changing anything else, with the realization that h..m would have an estimated byte count lower than the target, but that's okay for now. h..m could potentially be merged with the next interval if it has too few bytes, and that's still okay since it's a one-off interval merge as opposed to a completely shifted view of flush split keys.

What do you think? Let me know if this is not worth worrying about. I'll try to see if it comes up in practice at all. Again, this discussion isn't a high priority at all, but it's something that popped up in my mind.

@sumeerbhola
Copy link
Collaborator

with the realization that h..m would have an estimated byte count lower than the target, but that's okay for now

The trouble with such heuristics is that it gets tricky to tune things like the "okay for now" -- it can very easily get worse.

More importantly, this example is an L0 that can be represented as a "thin rectangle" -- the number of files is not significantly greater than the number of sublevels. And it is an easy "thin rectangle" since there are few files -- picking up 20 L0 files in an L0 compaction is still a fast enough L0=>Lbase compaction. My experimental observation is that sublevels, and the associated compaction algorithms, cause a backlog in L0 to get shaped into a "fat rectangle", say 20 sublevels and 1000 files, so finding compaction concurrency despite shifting split points is not typically a problem.

@itsbilal
Copy link
Member Author

I realized I never really linked PRs here, so here are some of the larger ones that are a part of this change:
#614 (merged)
#670 (merged)
#675 (in-progress)

Merging all of the above and compiling it with Cockroach with this patch to enable sublevel compactions, flush splits, and sublevel-driven PreIngestDelay, lets us finish a TPCC stock table import in 1h15m - 1h20m consistently with no issues. The same import takes ~6h with rocksdb on 20.1.

@itsbilal
Copy link
Member Author

itsbilal commented Aug 7, 2020

#614, #670, #675 have been merged, and sublevel compactions have been turned on in Cockroach as of cockroachdb/cockroach#50371 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants