-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: support user-defined "guards" for sstable boundaries #517
Comments
Most operations involve a local and a global key, right? Every raft command updates LeaseAppliedState. (This doesn't change anything about the problem or your suggested solution). |
Yes. How much would a change here affect normal operation is a different question? Sstables below L0 are partitioned and I would expect that partitioning to reduce the impact of wide spanning tables. Partitioning L0 at the local/global boundary would allow concurrent L0->Lbase compactions in situations where they are currently blocked. This is something @sumeerbhola has mentioned recently. I think this is a fruitful area for exploration, but I'm tempering my hopes. |
Add `Options.TablePartitioner` hook which allows the user to specify a required partition between 2 user-keys. `TablePartitioner` is called during flush and compaction before outputting a new key to an sstable which already contains at least 1 key. TODO One complication from table partitioning is that it can create L0 tables which overlap in seqnum space. In order to support partitioned L0, we'd have to relax the invariant checks in `manifest.CheckOrdering`. Doing so will make Pebble incompatible with RocksDB. In order for partitioning to not naively increase read amplification, we'll want provide some sort of partitioned view of the sstables in `Version.Files[0]`. `mergingIter` will then need to be made aware of the partitioning. We may want to adjust the compaction picking heuristics to not expand compaction inputs across the partition boundary. See #517
Add `Options.TablePartitioner` hook which allows the user to specify a required partition between 2 user-keys. `TablePartitioner` is called during flush and compaction before outputting a new key to an sstable which already contains at least 1 key. TODO One complication from table partitioning is that it can create L0 tables which overlap in seqnum space. In order to support partitioned L0, we'd have to relax the invariant checks in `manifest.CheckOrdering`. Doing so will make Pebble incompatible with RocksDB. In order for partitioning to not naively increase read amplification, we'll want provide some sort of partitioned view of the sstables in `Version.Files[0]`. `mergingIter` will then need to be made aware of the partitioning. We may want to adjust the compaction picking heuristics to not expand compaction inputs across the partition boundary. See #517
Would it make sense to define guards at range boundaries or are those too ephemeral? The problem of excessively wide sstables also applies to lower levels. In the large Here's the LSM of a 'normal' node:
A little less than half of the data was ingested directly into L6. Here's the LSM of the node with higher write I/O:
Only 25 G were ingested into L6, forcing the rest of the ingestions into L5 and L4. More than 12 hours after the IMPORT finished, this node is still compacting to reshape the LSM. |
Maybe user-defined guards aren't a solution for that scenario. We could try to dynamically detect these low-level sstables that are very broad relative to the distribution of keys in higher levels and split them up, anticipating that doing so will allow cheap move compactions. |
The last time I saw something similar happen, the problem was an sstable containing an overly wide range tombstone ingested into L6 that was preventing future ingestions into L6. See cockroachdb/cockroach#44048. |
I confirmed that there are range tombstones in L6, although I'm not sure how to confirm their breadth.
|
You can use |
Stepping back for a moment to the original motivation for this issue, now that we split flushes based on both L0 FlushSplitKeys and Lbase (grandparent) limits, I wonder why we need explicit guards. We control width well inside Pebble, at least in theory, but ingested sstables are outside our control. Should we provide split points to the ingestion logic to ensure it does not produce wide sstables? |
I'm not at all sure explicit guards are necessary.
CRDB does split ingested sstables along the local/global key space. That is hardcoded, though. It is almost like we want an abstraction on top of
This is a good point. @jbowens did add logic recently to perform elision compactions to recompact L6 files that contain tombstones so I think the mark for compaction bit is essentially done. It is possible we should also be prioritizing compactions of L6 files containing tombstones because normally they would be lower priority than other compactions and thus get starved. |
Good point. I dumped the manifest on that problematic node, L6 has a lot of range tombstones interspersed with live data.
When a range rebalances away, we take a snapshot for sending and write a range tombstone, right? Maybe we're too eager in compacting these range tombstones that are incapable of reclaiming disk space because of open snapshots. The 'compensated sizes' for these sstables are weighted as if they'll drop everything, but in reality we'll just rewrite the sstables and possibly prevent cheaper delete-only compactions. If we disincentivized these ineffectual compactions, we could cheaply drop some of the L6 sstables with delete-only compactions, but we'd still be left with the sstable containing the range tombstone. If it no longer overlaps with L6, it might be move compacted into L6 with the broad range tombstone that blocks future ingestion into L6. Like @petermattis said, the L6 elision-only compactions are low priority right now and wouldn't trigger during an IMPORT where there's lots of automatic compaction. Maybe this isn't a huge problem though because if the bulk of the data was already dropped from L6, the min-overlapping ratio heuristic will prioritize compacting into the smaller L6 range tombstone sstable as soon as there is any significant amount of overlapping data in L5. When that happens, the broad tombstone would be elided. |
When a range rebalances away, the leaseholder replica creates a Pebble snapshot and sends the data to the new replica using the snapshot. Only after the data has been sent is one of the existing replicas considered for deletion. I don't think the scenario you described here would happen as described. But because Pebble snapshots are global, we could be seeing something like it due to snapshots being sent for other ranges on the same node. |
I wonder if guards at SQL table boundaries might help with concurrent pre-ordered IMPORT performance. If there are no overlapping sstables in higher levels, ingested sstables slot directly into L6. I think this is typical behavior for the As I understand it, during an ordered IMPORT, most of the data ends up being ingested. But sometimes chunks of imported data are written through normal writes if the produced sstable is smaller than In this scenario, I think SQL table boundary guards would allow most ingested sstables to slot directly into L6. If the source data for a table is split into multiple files, you can still have normal writes that end up overlapping ingested sstables, but maybe in that world it's better to drop the Here's a code comment regarding that
|
We have marked this issue as stale because it has been inactive for |
We have marked this issue as stale because it has been inactive for |
I think it's worth keeping open for now. I can imagine a few specific uses that might benefit from explicit guards but be challenging to handle with a generalized approach like #2366. |
When flushing and compacting sstables, Pebble needs to decide how to break up the resulting output into sstables. We currently use two signals: sstable size, and the boundaries of sstables in the "grandparent" level (i.e. the level below the output level). These signals are reasonable for randomly distributed writes, but can cause problems for skewed writes. Consider the following LSM structure:
L0 contains a single sstable spanning
[a,z]
. L1 contains 9 sstables. A compaction from L0 to L1 will need to include all of the L1 sstables. It is possible that L0 only contains 2 keys:a
andz
. This problem is easy to construct ifa
andz
are written to the memtable near each other in time and then flushed as flushing currently always produces a single sstable.This situation appears to arise in practice within CRDB due to the presence of "local" vs "global" keys. Most operations involve only "global" keys. When a "local" key is operated on, it will end up generating an L0 sstable that spans a large swath of the global key space.
The idea behind "guards" is to allow the user (i.e. CockroachDB) some control over sstable boundaries. CockroachDB would define guards at the boundary between the local and global keyspace. It may also define guards at the boundary between SQL tables. Such guards would ensure that we're segregating sstables along the guard boundaries so that an L0 sstable can't cover a huge fraction of the key space.
Note that "guards" would almost certainly be specified via a callback and not upfront. That is, we'd want to allow the user to specify a callback which can return true if there should be an sstable boundary between two user keys.
Jira issue: PEBBLE-178
The text was updated successfully, but these errors were encountered: