Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compaction: consider splitting heuristics to aid sequential ingests #1671

Closed
jbowens opened this issue Apr 27, 2022 · 3 comments
Closed

compaction: consider splitting heuristics to aid sequential ingests #1671

jbowens opened this issue Apr 27, 2022 · 3 comments

Comments

@jbowens
Copy link
Collaborator

jbowens commented Apr 27, 2022

During a sequential import, CockroachDB ingests sstables with successively increasing key ranges. As the import ingests sstables with increasing key ranges and compactions move their keys down the LSM, there's nothing preventing compactions from outputting the new keys into the same sstables as keys after the import's keyspace. These sstables spanning the import keyspace force future ingested sstables into L0. One solution is an explicit guard (#517).

I've also been wanting to experiment with a compaction output–splitting heuristic based on the distribution of sequence numbers in the compaction inputs. Ingests create large, dense swaths of keys with the same sequence number. We could adjust the splitting heuristic to watch for a streak of N keys with the same sequence number. Compactions could split outputs early at the point when the sequence-number streak ends, as long as the current output is not too small. This would encourage splits at the right-hand side of ingested sstables, which then provide a gap in sstables boundaries for future ingested sstables as a part of the same sequential import.

Once keys reach L6 their sequence numbers are zeroed, so this would have little effect in L6 other than encouraging splitting keys with snapshot-preserved sequence numbers from non-snapshot-preserved sequence numbers.

@sumeerbhola
Copy link
Collaborator

if we did the virtual sstable thing, we would only need to worry about actual data overlap, so the problem described here would be solved too, yes?

@jbowens
Copy link
Collaborator Author

jbowens commented Apr 28, 2022

yeah, this ingest problem would also be solved. The virtual sstable thing has some cost, eg, potential of space amplification from thin virtual tables backed by wide physical sstables. Not sure how much of a problem that will be in practice, but sstable-splitting heuristics seem like they could help circumvent it if they're effective.

@jbowens
Copy link
Collaborator Author

jbowens commented May 31, 2023

Going to close this out since we're shipping virtual sstables in 23.2. If in practice we see space amp from vritual sstables as a problem, it might be worth re-examining then.

@jbowens jbowens closed this as not planned Won't fix, can't repro, duplicate, stale May 31, 2023
@jbowens jbowens added this to Storage Jun 4, 2024
@jbowens jbowens moved this to Done in Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants