Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: split physical sstables into virtual sstables at ingest time #1683

Closed
sumeerbhola opened this issue May 4, 2022 · 8 comments
Closed
Assignees
Labels
A-write-amp potential to reduce write amplification

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented May 4, 2022

This issue was originally about the concept of virtual sstables in general, as well as the ingestion-time uses of it. Currently, this issue is about the creation of virtual sstables (which were implemented in #2288 , #2352, and other PRs) from physical sstables at ingest time, when it is beneficial to do so to slot ingested sstables at lower levels. The original description of the issue, as well as the detailed technical description is retained below the horizontal line.

See #2336 for the former meta-issue describing different parts of this project.


Virtual sstables separate the logical key bounds of an sstable from the physical key bounds of the underlying sstable. They allow multiple virtual ssts to share the same underlying physical sst. Virtual ssts are part of the disaggregated shared storage design cockroachdb/cockroach#70419 since they allow us to have ssts that conform to CockroachDB range bounds in lower levels of the LSM.
However, they can also be beneficial in the current Pebble setup which does not share ssts across Pebble instances: in a customer issue https://github.com/cockroachlabs/support/issues/1558 we noticed that a significant number of ssts were being ingested into L0, which caused Pebble overload. These ssts were due to CockroachDB snapshots (see cockroachdb/cockroach#80589 and cockroachdb/cockroach#80607). A reproduction revealed that snapshot ingestion has no data overlap preventing these from being ingested into L6, and all the overlap occurs due to file key bounds.

By introducing virtual ssts we can improve the level assignment for ingestion by:

  • finding the lowest level with data overlap, say i.
  • ingest into i-1. If there is a file boundary overlap, split the sst into 2 virtual ssts, leaving a gap between them that the ingested sst can fit into. The VersionEdit for the ingest would also include this split.
  • At most one sst is split per ingested sst. We can reduce the risk of many small virtual ssts by:
    • For CockroachDB's snapshot ingestion, there is one large sst (up to 512MB) and many tiny ones. We can choose the apply this splitting logic only for the large sst. It is ok for the tiny ssts to be ingested into L0.
    • Split only if the ingested sst is at least half the size of the sst being split. So if we have a smaller ingested sst, we will pick a higher level to split at (where the ssts are smaller). The lifetime of virtual ssts at a higher level is smaller, so there is lower risk of littering the LSM with long-lived small virtual ssts.

Effect on DB state data-structures (Version, FileMetadata, VersionEdit ...)

  • VersionEdit uses FileNum (uint64) as a key. We could potentially have a pair of ints (FileNum, VirtualSSTNum) instead. Given that we only have 55 bits for key seqnums, using 64 bits for FileNums (even though it includes WALs etc.) is not necessary. We could use 56 bits for the physical sst and reserve 8 bits for splitting into virtual ssts. Each split would reserve half of the split bits for future splits (like numbering the nodes in a binary tree): 128-255 for one child and 0-127 for the other child and so on. This does mean that some virtual ssts can no longer be split, because we have run out of bits for the two child ssts. This limitation seems acceptable. FileMetadata would keep FileNum and the number of bits already used when this virtual sst was generated.

  • Virtual ssts would be permitted only in levels L1-L6. There is some benefit in having virtual ssts in L0 since it allows for reducing the sublevels, but we will forego that benefit (it avoids having to do anything about SmallestSeqNum, as discussed below).

  • btree comparisons for the FileMetadata in each level uses the smallest key or seqnum (for L0). We are not using virtual ssts in L0, so the smallest seqnum does not need to be changed. The smallest key will be dealt with below.

  • FileMetadata ref counting: FileMetadata.refs will need to be a pointer to refs so that the same refs can be shared across all virtual ssts with the same physical sst.

  • FileMetadata stats: There is Size and TableStats. These are used in metrics and in compaction heuristics (e.g. NumEntries, NumDeletions, PointDeletionsBytesEstimate, RangeDeletionsBytesEstimate). These don't need to be very accurate and we can do linear interpolation to split them based on the block count of the original sstable and the block numbers corresponding to the start and end of the virtual sst.

  • FileMetadata spans:

    • SmallestSeqNum, LargestSeqNum: We can keep these as unchanged (for the physical sst). They are only useful for L0, where we don't have virtual ssts.
    • {Smallest,Largest}x{_, PointKey, RangeKey}, HasPointKeys, HasRangeKeys: Since we already have a positioned iterator in overlapWithIterator, to look for data overlap, we can precisely compute these for each virtual sst.
  • Iterators: Since the FileMetadata spans are representing the virtual sst, the iterator tree is mostly unaffected. Only the leaf sst iterator needs tweaking. The sst iterator would do some additional filtering based on Smallest, Largest. This filtering can be optimized away if the virtual sst is the same as the physical sst. And it can be optimized away for blocks which are fully contained in the virtual sst (like we do now for iterator bounds).

Next steps:

  • Experiment to see if this would help for other ingest scenarios: e.g. index backfill.
  • Prototype.
@sumeerbhola
Copy link
Collaborator Author

Regarding the experiment for other ingest scenarios. Ran the large import from https://docs.google.com/document/d/1fLDOfwVTWH6Ty5pmIpKuMOT-H2a-4-HK6I3Rk76rVlA/edit#heading=h.kgzoyzc5gcqj
Everything is ingesting into L0, because of data overlap with L0 or Lbase. So virtual ssts will not help there.

I220505 13:33:24.625235 291 kv/kvserver/replica_raftstorage.go:1102 ⋮ [n1,s1,r328/1:‹/Table/104/1/"v{ElASP…-hElRM…}›,raft] 2760  Ingest (‹apply›): bytes:1.2 MiB, l0:1.2 MiB, details: (b:1.2 MiB il:0 dol:0 bl:4) 
I220505 13:33:34.738311 261 kv/kvserver/replica_raftstorage.go:1102 ⋮ [n1,s1,r281/5:‹/Table/104/1/"{tYiPQ…-uAVzX…}›,raft] 2858  Ingest (‹apply›): bytes:1.1 MiB, l0:1.1 MiB, details: (b:1.1 MiB il:0 dol:4 bl:4) 

@jbowens
Copy link
Collaborator

jbowens commented May 6, 2022

FileMetadata stats: There is Size and TableStats. These are used in metrics and in compaction heuristics (e.g. NumEntries, NumDeletions, PointDeletionsBytesEstimate, RangeDeletionsBytesEstimate). These don't need to be very accurate and we can do linear interpolation to split them based on the block count of the original sstable and the block numbers corresponding to the start and end of the virtual sst.

I think it would also be okay to let the table stats asynchronously recompute the TableStats using the narrower bounds, which might naturally fall out of these new 'virtual sstables' being included in the ingest version edit passed to updateTableStatsLocked. This doesn't address the Size though.

I still wonder about compaction heuristics and these 'thin' virtual sstables. Specifically, if a file is split many times by many ingested sstables, each of these virtual sstables will be tiny. Small sstables are less likely to be compacted downwards than other sstables, because the min-overlapping ratio heuristic since they require more I/O to compact relative to the amount of data they're able to clear from the higher level. They're more likely to be compacted into.

Looks like we'll need to keep around keyspan.Truncate after all for filtering out rangedels/rangekeys outside virtual sstable bounds.

@bananabrick
Copy link
Contributor

At most one sst is split per ingested sst.

We discussed this point, and there's cases where we might have to split more than one sst. I'm going to run the reproductions from cockroachdb/cockroach#80589 (comment), and determine what % of the ssts can be ingested with at most one split of the virtual sst.

@jbowens
Copy link
Collaborator

jbowens commented Sep 13, 2022

We discussed this point, and there's cases where we might have to split more than one sst

What are these cases?

@bananabrick
Copy link
Contributor

@jbowens

Let's say we're ingesting an sstable s1 with the keys a,c,e,g and there's an sstable s2 in L5 with no data overlap with keys b,d,f,h. I believe there's no way to do exactly one split here and slide s1 into L5.

Does that make sense? I might be forgetting about some invariant.

@jbowens
Copy link
Collaborator

jbowens commented Sep 14, 2022

Yeah, that makes sense. Unfortunately, we don't have a way of cheaply determining that there's no data overlap in that case. See this comment:

pebble/ingest.go

Lines 494 to 501 in 3bbd428

// It is worth noting that the check for data overlap is only approximate. In
// the previous example, the ingested table [d-e] could contain only the
// points 'd' and 'e', in which case the table would be eligible for
// considering lower levels. However, such a fine-grained check would need to
// be exhaustive (comparing points and ranges in both the ingested existing
// tables) and such a check is prohibitively expensive. Thus Pebble treats any
// existing point that falls within the ingested table bounds as being "data
// overlap".

Maybe this could be improved with something fancy like computing the intersection of the files' bloom filters, but I'd guess that it's not worth the complexity.

I speculate that in practice ingested files are typically a contiguous section of the keyspace with no engine keys between the bounds (eg, a sorted import or replica snapshot application), or a dispersed set of keys such that there are many engine keys within the bounds (eg, an import with randomly distributed keys). In the latter case if we had a mechanism to cheaply detect the lack of data overlap, we probably still won't want to fragment into such fine virtual sstables.

@bananabrick
Copy link
Contributor

bananabrick commented Sep 14, 2022

Yea, I think the simplest case, which I'm hoping is the most frequent, is that the file boundaries of the ingested sstable s1 overlaps with no keys of some sstable s2, but does overlap with the file boundaries of s2.

In this case, I believe it's always possible to split the sstable s2 into 2, and slide s1 in between.

I believe this case also implies that the file boundaries of sstable s1 must fit INSIDE the file boundaries of sstable s2. If s1 doesn't fit inside s2, then the file boundaries of s1, are guaranteed to overlap some data in s2.

So, the case which we can deal with easily looks like, where the - are the keys. :

        |- - - s1 -----|
 |- -           s2               - - - --|

@bananabrick bananabrick self-assigned this Sep 14, 2022
bananabrick added a commit that referenced this issue Nov 22, 2022
@itsbilal itsbilal changed the title db: support virtual sstables that are narrower than the physical sstable db: split physical sstables into virtual sstables at ingest time May 23, 2023
@itsbilal itsbilal assigned itsbilal and unassigned bananabrick May 23, 2023
@itsbilal
Copy link
Member

I've reworked this issue to reflect just the creation of virtual sstables at ingestion time. Other pieces of virtual sstable work are tracked in other issues (see #2336 for the former meta-issue listing them all out).

itsbilal added a commit to itsbilal/pebble that referenced this issue May 31, 2023
Currently, if we identify boundary overlap in a level
during ingest target level calculation, but no data overlap,
we are forced to find a target level above the file we saw
the overlap with (if we can't fall below it, such as if the
existing file is in L6, which happens commonly).

This change takes advantage of virtual sstables to split
existing sstables into two virtual sstables when an ingested
sstable would be able to go into the same level had the sstables
been split that way to begin with. Doing this split reduces a
lot of write-amp as it avoids us from having to compact the
newly-ingested sstable with the sstable it boundary-overlapped with.

Biggest part of cockroachdb#1683. First commit is cockroachdb#2538, which this shares
a lot of logic with (mostly just the excise function).
itsbilal added a commit to itsbilal/pebble that referenced this issue Aug 10, 2023
Currently, if we identify boundary overlap in a level
during ingest target level calculation, but no data overlap,
we are forced to find a target level above the file we saw
the overlap with (if we can't fall below it, such as if the
existing file is in L6, which happens commonly).

This change takes advantage of virtual sstables to split
existing sstables into two virtual sstables when an ingested
sstable would be able to go into the same level had the sstables
been split that way to begin with. Doing this split reduces a
lot of write-amp as it avoids us from having to compact the
newly-ingested sstable with the sstable it boundary-overlapped with.

Biggest part of cockroachdb#1683. First commit is cockroachdb#2538, which this shares
a lot of logic with (mostly just the excise function).
itsbilal added a commit to itsbilal/pebble that referenced this issue Aug 10, 2023
Currently, if we identify boundary overlap in a level
during ingest target level calculation, but no data overlap,
we are forced to find a target level above the file we saw
the overlap with (if we can't fall below it, such as if the
existing file is in L6, which happens commonly).

This change takes advantage of virtual sstables to split
existing sstables into two virtual sstables when an ingested
sstable would be able to go into the same level had the sstables
been split that way to begin with. Doing this split reduces a
lot of write-amp as it avoids us from having to compact the
newly-ingested sstable with the sstable it boundary-overlapped with.

Biggest part of cockroachdb#1683. First commit is cockroachdb#2538, which this shares
a lot of logic with (mostly just the excise function).
itsbilal added a commit to itsbilal/pebble that referenced this issue Sep 19, 2023
Currently, if we identify boundary overlap in a level
during ingest target level calculation, but no data overlap,
we are forced to find a target level above the file we saw
the overlap with (if we can't fall below it, such as if the
existing file is in L6, which happens commonly).

This change takes advantage of virtual sstables to split
existing sstables into two virtual sstables when an ingested
sstable would be able to go into the same level had the sstables
been split that way to begin with. Doing this split reduces a
lot of write-amp as it avoids us from having to compact the
newly-ingested sstable with the sstable it boundary-overlapped with.

Fixes cockroachdb#1683.
itsbilal added a commit to itsbilal/pebble that referenced this issue Sep 26, 2023
Currently, if we identify boundary overlap in a level
during ingest target level calculation, but no data overlap,
we are forced to find a target level above the file we saw
the overlap with (if we can't fall below it, such as if the
existing file is in L6, which happens commonly).

This change takes advantage of virtual sstables to split
existing sstables into two virtual sstables when an ingested
sstable would be able to go into the same level had the sstables
been split that way to begin with. Doing this split reduces a
lot of write-amp as it avoids us from having to compact the
newly-ingested sstable with the sstable it boundary-overlapped with.

Fixes cockroachdb#1683.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-write-amp potential to reduce write amplification
Projects
Archived in project
Development

No branches or pull requests

4 participants