-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: split physical sstables into virtual sstables at ingest time #1683
Comments
Regarding the experiment for other ingest scenarios. Ran the large import from https://docs.google.com/document/d/1fLDOfwVTWH6Ty5pmIpKuMOT-H2a-4-HK6I3Rk76rVlA/edit#heading=h.kgzoyzc5gcqj
|
I think it would also be okay to let the table stats asynchronously recompute the I still wonder about compaction heuristics and these 'thin' virtual sstables. Specifically, if a file is split many times by many ingested sstables, each of these virtual sstables will be tiny. Small sstables are less likely to be compacted downwards than other sstables, because the min-overlapping ratio heuristic since they require more I/O to compact relative to the amount of data they're able to clear from the higher level. They're more likely to be compacted into. Looks like we'll need to keep around |
We discussed this point, and there's cases where we might have to split more than one sst. I'm going to run the reproductions from cockroachdb/cockroach#80589 (comment), and determine what % of the ssts can be ingested with at most one split of the virtual sst. |
What are these cases? |
Let's say we're ingesting an sstable s1 with the keys a,c,e,g and there's an sstable s2 in L5 with no data overlap with keys b,d,f,h. I believe there's no way to do exactly one split here and slide s1 into L5. Does that make sense? I might be forgetting about some invariant. |
Yeah, that makes sense. Unfortunately, we don't have a way of cheaply determining that there's no data overlap in that case. See this comment: Lines 494 to 501 in 3bbd428
Maybe this could be improved with something fancy like computing the intersection of the files' bloom filters, but I'd guess that it's not worth the complexity. I speculate that in practice ingested files are typically a contiguous section of the keyspace with no engine keys between the bounds (eg, a sorted import or replica snapshot application), or a dispersed set of keys such that there are many engine keys within the bounds (eg, an import with randomly distributed keys). In the latter case if we had a mechanism to cheaply detect the lack of data overlap, we probably still won't want to fragment into such fine virtual sstables. |
Yea, I think the simplest case, which I'm hoping is the most frequent, is that the file boundaries of the ingested sstable s1 overlaps with no keys of some sstable s2, but does overlap with the file boundaries of s2. In this case, I believe it's always possible to split the sstable s2 into 2, and slide s1 in between. I believe this case also implies that the file boundaries of sstable s1 must fit INSIDE the file boundaries of sstable s2. If s1 doesn't fit inside s2, then the file boundaries of s1, are guaranteed to overlap some data in s2. So, the case which we can deal with easily looks like, where the
|
I've reworked this issue to reflect just the creation of virtual sstables at ingestion time. Other pieces of virtual sstable work are tracked in other issues (see #2336 for the former meta-issue listing them all out). |
Currently, if we identify boundary overlap in a level during ingest target level calculation, but no data overlap, we are forced to find a target level above the file we saw the overlap with (if we can't fall below it, such as if the existing file is in L6, which happens commonly). This change takes advantage of virtual sstables to split existing sstables into two virtual sstables when an ingested sstable would be able to go into the same level had the sstables been split that way to begin with. Doing this split reduces a lot of write-amp as it avoids us from having to compact the newly-ingested sstable with the sstable it boundary-overlapped with. Biggest part of cockroachdb#1683. First commit is cockroachdb#2538, which this shares a lot of logic with (mostly just the excise function).
Currently, if we identify boundary overlap in a level during ingest target level calculation, but no data overlap, we are forced to find a target level above the file we saw the overlap with (if we can't fall below it, such as if the existing file is in L6, which happens commonly). This change takes advantage of virtual sstables to split existing sstables into two virtual sstables when an ingested sstable would be able to go into the same level had the sstables been split that way to begin with. Doing this split reduces a lot of write-amp as it avoids us from having to compact the newly-ingested sstable with the sstable it boundary-overlapped with. Biggest part of cockroachdb#1683. First commit is cockroachdb#2538, which this shares a lot of logic with (mostly just the excise function).
Currently, if we identify boundary overlap in a level during ingest target level calculation, but no data overlap, we are forced to find a target level above the file we saw the overlap with (if we can't fall below it, such as if the existing file is in L6, which happens commonly). This change takes advantage of virtual sstables to split existing sstables into two virtual sstables when an ingested sstable would be able to go into the same level had the sstables been split that way to begin with. Doing this split reduces a lot of write-amp as it avoids us from having to compact the newly-ingested sstable with the sstable it boundary-overlapped with. Biggest part of cockroachdb#1683. First commit is cockroachdb#2538, which this shares a lot of logic with (mostly just the excise function).
Currently, if we identify boundary overlap in a level during ingest target level calculation, but no data overlap, we are forced to find a target level above the file we saw the overlap with (if we can't fall below it, such as if the existing file is in L6, which happens commonly). This change takes advantage of virtual sstables to split existing sstables into two virtual sstables when an ingested sstable would be able to go into the same level had the sstables been split that way to begin with. Doing this split reduces a lot of write-amp as it avoids us from having to compact the newly-ingested sstable with the sstable it boundary-overlapped with. Fixes cockroachdb#1683.
Currently, if we identify boundary overlap in a level during ingest target level calculation, but no data overlap, we are forced to find a target level above the file we saw the overlap with (if we can't fall below it, such as if the existing file is in L6, which happens commonly). This change takes advantage of virtual sstables to split existing sstables into two virtual sstables when an ingested sstable would be able to go into the same level had the sstables been split that way to begin with. Doing this split reduces a lot of write-amp as it avoids us from having to compact the newly-ingested sstable with the sstable it boundary-overlapped with. Fixes cockroachdb#1683.
This issue was originally about the concept of virtual sstables in general, as well as the ingestion-time uses of it. Currently, this issue is about the creation of virtual sstables (which were implemented in #2288 , #2352, and other PRs) from physical sstables at ingest time, when it is beneficial to do so to slot ingested sstables at lower levels. The original description of the issue, as well as the detailed technical description is retained below the horizontal line.
See #2336 for the former meta-issue describing different parts of this project.
Virtual sstables separate the logical key bounds of an sstable from the physical key bounds of the underlying sstable. They allow multiple virtual ssts to share the same underlying physical sst. Virtual ssts are part of the disaggregated shared storage design cockroachdb/cockroach#70419 since they allow us to have ssts that conform to CockroachDB range bounds in lower levels of the LSM.
However, they can also be beneficial in the current Pebble setup which does not share ssts across Pebble instances: in a customer issue https://github.com/cockroachlabs/support/issues/1558 we noticed that a significant number of ssts were being ingested into L0, which caused Pebble overload. These ssts were due to CockroachDB snapshots (see cockroachdb/cockroach#80589 and cockroachdb/cockroach#80607). A reproduction revealed that snapshot ingestion has no data overlap preventing these from being ingested into L6, and all the overlap occurs due to file key bounds.
By introducing virtual ssts we can improve the level assignment for ingestion by:
Effect on DB state data-structures (Version, FileMetadata, VersionEdit ...)
VersionEdit uses FileNum (uint64) as a key. We could potentially have a pair of ints (FileNum, VirtualSSTNum) instead. Given that we only have 55 bits for key seqnums, using 64 bits for FileNums (even though it includes WALs etc.) is not necessary. We could use 56 bits for the physical sst and reserve 8 bits for splitting into virtual ssts. Each split would reserve half of the split bits for future splits (like numbering the nodes in a binary tree): 128-255 for one child and 0-127 for the other child and so on. This does mean that some virtual ssts can no longer be split, because we have run out of bits for the two child ssts. This limitation seems acceptable. FileMetadata would keep FileNum and the number of bits already used when this virtual sst was generated.
Virtual ssts would be permitted only in levels L1-L6. There is some benefit in having virtual ssts in L0 since it allows for reducing the sublevels, but we will forego that benefit (it avoids having to do anything about SmallestSeqNum, as discussed below).
btree comparisons for the FileMetadata in each level uses the smallest key or seqnum (for L0). We are not using virtual ssts in L0, so the smallest seqnum does not need to be changed. The smallest key will be dealt with below.
FileMetadata ref counting: FileMetadata.refs will need to be a pointer to refs so that the same refs can be shared across all virtual ssts with the same physical sst.
FileMetadata stats: There is Size and TableStats. These are used in metrics and in compaction heuristics (e.g. NumEntries, NumDeletions, PointDeletionsBytesEstimate, RangeDeletionsBytesEstimate). These don't need to be very accurate and we can do linear interpolation to split them based on the block count of the original sstable and the block numbers corresponding to the start and end of the virtual sst.
FileMetadata spans:
Iterators: Since the FileMetadata spans are representing the virtual sst, the iterator tree is mostly unaffected. Only the leaf sst iterator needs tweaking. The sst iterator would do some additional filtering based on Smallest, Largest. This filtering can be optimized away if the virtual sst is the same as the physical sst. And it can be optimized away for blocks which are fully contained in the virtual sst (like we do now for iterator bounds).
Next steps:
The text was updated successfully, but these errors were encountered: