-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: incorporate range tombstones into compaction heuristics #319
Comments
Incorporating an adjustment factor based on range tombstones in the sstable size looks straightforward now that we have The bigger task here is that Pebble does not currently use the sstable size in compaction heuristics. Currently we use a heuristic similar to RocksDB's I don't have a fully fledged idea of what should be done here. I think we want to add support for a compensated size calculation which takes both point deletions and range deletions into consideration. We then want to incorporate the compensated size in the compaction heuristic. It might be sufficient to mimic the To ensure old sstables eventually get compacted, RocksDB introduced a TTL mechanism which ensures sstables get compacted on some cadence. While there might be independent utility in adding that mechanism (which CRDB has never enabled), I suspect we could also adjust the compensated size by the age of an sstable. |
I'm going to dump some notes here as I get context. RocksDB's compensated size is only compensated when the number of point deletions exceeds the number of other entries in the file. It's gated on this condition to avoid distorting the shape of the LSM through over-prioritizing compaction of deletions (rocksdb/db/version_set.go#L2206-L2218). I also noticed RocksDB has a
Do we already have facilities to simulate workloads? When incorporating range deletion tombstones into our own 'compensated size,' there's a long axis of accuracy that we can move along. It'd be helpful to run workloads against iterations of different heuristics to evaluate. If we don't already have something, this might be a good place to start work. |
There are lots of comments in the RocksDB code about "the shape of the LSM". I suspect there is a bunch of knowledge that isn't captured in those comments. On the surface, it seems strange to have a hard barrier between using the compensated size adjustment and not. My instinct would be to have a gradual switchover. Perhaps the compensated size adjustment could be attenuated by the fraction of tombstones in the sstable.
We have the benchmark stuff in |
We're looking to integrate sstable size into compaction heuristics, and relatedly, incorporate range tombstones to better prioritize reclamation of disk space. Our goals for a new a heuristic are to:
My current plan is to mimic the
The compensated size approximation has a lot of room for trading between accuracy and calculation cost. The most accurate estimate of the covered space for a range deletion would sum across all levels beneath the range deletion's sstable, using We could calculate this covered space/compensated size once when an sstable is created, but I'm unsure what to do with it between process starts. Can we persist it on disk in the table properties without RocksDB complaining? Would it be feasible to recalculate it in the background every time the database is opened? It doesn't seem like it, because when it's intertwined with compaction picker the first compaction would be blocked on an expensive traversal of the database. Alternatively, we could use the uncompensated file size until the compensated sizes have all been calculated. I think mimicking |
RocksDB has a facility to compute in-memory stats about sstables at startup.I can't recall exactly the purpose or the name of this facility, but I know that it exists. I also can't recall if it does this calculation in the background or the foreground. Let me know if you need a further pointer. We can definitely add additional table properties without RocksDB complaining. Something to think about in this area is that the There is some rationale in compacting range tombstones out of existence as soon as possible as the presence of a range tombstone, even one that covers no data, can affect other operations such as ingestions. See cockroachdb/cockroach#44048 for an example of such badness. Or perhaps this is an argument that we compensate the sstable size for range tombstones by all of the data covered by the range, not just the data residing lower in the LSM.
SGTM |
I believe what I was remembering was |
It seems like the intention is to initially compact somewhat arbitrarily, favoring picking an arbitrary set of files in the lowest levels. As files are flushed and replaced through compaction, the proportion of files w/ compensated sizes increases.
|
Yeah, I noticed that too. I'm confused about the terminology here. Usually, L0 is considered the highest level while L6 is the lowest level. But I think the reverse is meant in this context. |
I grabbed a couple stores post-roachtest to evaluate strategies for calculating compensated sizes (still treating range and point deletions equally). I wrote a little program to calculate compensated sizes using a few strategies and output them per sstable: tpccbench.csv, clearrange.csv. The clearrange data is not particularly interesting just yet since its deletions are almost exclusively range deletions in their own entire tiny sstables, but I think it'll be a useful dataset when we start to handle range deletions distinctly from point deletions.
All four of the remaining strategies use a gradual switchover, scaled by the fraction of entries that are deletions like you suggested:
Each strategy is followed by a column showing the average value size calculation for that strategy. Some takeaways that I had:
Overall, the arbitrary-overlapping-sstable and the own-file strategies look pretty reasonable, and like they do succeed in capturing the variation in value size. Always using an arbitrary overlapping sstable is simpler but may require an additional IO operation if stats for an overlapping sstable are not already in-memory. |
TIL: Github will pretty-print csv files. The wall of numbers is a bit intimidating, even when pretty-printed. If we take "All Overlap" as the ground truth, it would be interesting to compare the error of "Arbitrary Overlap" and "Own file". I see at least one case ( You could also exactly compute ground truth for analysis purposes by looking up every deletion tombstone in the overlapping sstables in lower levels and determining the value sizes. The compensated-size metric is a known quantity since RocksDB is using it, but we should keep an open mind about other approaches. We'd like to prioritize compactions that reclaim disk space. The more disk space that is reclaimed the better. Tombstones are one indication of disk space being reclaimed (at the very least, if we compact the tombstones to the bottom level we'll reclaim the disk space for the tombstone itself). Another indication is records that are being overwritten, though we have no easy way to determine that. I'm just brainstorming here, and don't have any concrete ideas for alternatives. |
Continuing with just brainstorming, I wonder how predictive the keyspace of an entry is to the likelihood it overwrites values down the line. For example, some SQL tables might be high churn and update-heavy, while others might only see inserts and deletes. With the 'Arbitrary overlap' up above, we're looking at the overlapping lower-level tables to get signal on the higher-level tables keys. Could we do the same with the likelihood of sets/merges overwriting previous values? Could we track the number of entries overwritten by an sstable's entries over their lifetime and stick it in a Then if we want to try to get signal on the overwrites for compaction picking, we consult the overlapping lower-level files' properties? Now that I'm thinking about this though, it doesn't seem desirable to compact down high-churn tables? It reclaims space in the short-term but would increase write amplification. If we wanted to ensure we reclaim space from sets & merges timely but keep write amplification down, I guess we'd want to tend towards compacting tables w/ high likelihood of overwrites but infrequent updates. |
That's an interesting idea. I don't see any fundamental reason it couldn't be done. A small nit is that records are never "overwritten". User keys are overwritten, but I think I'd keep a property on the number of dropped records, though that wouldn't capture records that are not dropped due to snapshots. Regardless, I think something could be done here. And I think that your idea is good that we can use existing of updated user keys at lower levels to inform compaction decisions at higher levels.
There are three types of amplification we want to minimize: write, read, and space. High-churn tables will see high space amplification if we're not compacting that key range. They could also see high read amplification. Write amplification when we're randomly updating existing keys is usually not that bad. The intuition here is that used space in the DB isn't growing, so compactions can generally just stay between a few levels. The harder case is randomly inserting new keys. The used space in the DB grows and compactions have to utilize all of the levels, while making sure read amplification doesn't get out of control. |
I computed the 'ground truth' average values for deleted keys in the test store from above. To accommodate zero-value actual averages, I calculated 'relative percent differences' rather than percent errors. Relative percent difference varies from -2.00 to 2.00, with zero being no difference. Here's the wall of per-SST numbers: tpccbench.csv. The 'Present' column tracks the number of times a deleted key was actually present beneath the tombstone, only counting SET and MERGE entries. A large number of the tombstones did not actually shadow a set or merge entry with the same key, probably because the keys were already dropped in higher-level compactions. Predictably, the lower the SST, the less likely its tombstones were to shadow entries. It seems like this would cause RocksDB to overcompensate low-level file sizes. I was a little surprised by files like 023223 whose tombstones did shadow a non-negligible number of entries (330) but still had a ground truth average value size of 0. Maybe it's the result of interleaved secondary indexes? It also made me wonder if point tombstones in Cockroach are more likely for keys representing secondary indexes because any update to an indexed column would require deleting the old index key and setting the new key. |
Something else that the rocksdb It seems like there could be instances where the picker decides not to pursue a compaction because the various inputs into the singular score balance out. For example, what if a level doesn't contain very many bytes but contains a range tombstone that deletes a broad swath of the level below? Comparing the compensated size ratios, the picker might decide to not pursue a compaction at all, leaving significant disk space unreclaimed. Maybe it's okay because the target ratios between levels creates a tight-enough bound? |
I thought we didn't do interleaved indexes in TPCC by default. Did that change? |
I double checked: interleaving is not enabled by default in our TPCC workload.
This is a good point. I don't have confidence that the last sentence is true. For point tombstones, we also have to worry about the performance impact they have on reads. Iterating through a large swath of point tombstones can be slow. Adding another consideration: leaving significant disk space unreclaimed is primarily a problem if we're running low on disk space. Reclaiming quickly tends to match user expectations. For example, I might drop a table and expect the disk space to be reclaimed quickly so that I can import a large amount of data. |
I am shocked by how many deletions are not actually deleting anything. Can you double check this result? Does your measurement only look at the sstables in the level immediately below, or in all lower levels? |
Yeah, I'll double check. I at least intended it to look in all lower levels, but there could be bugs. |
If it turns out your code is correct, it would be very interesting to understand where these "useless" deletions are coming from. In CRDB, in order to keep MVCC stats up to date, I'm pretty sure we only issue deletions after having first verified there is a key present. (This is inefficient, but that's a different issue). |
Here are some example keys that exist but have zero-length values:
I don't know much about the key format. Is it possible to infer the significance of these from the keys themselves or would I need metadata about these tables. Maybe the indexes aren't interleaved but just happen to fall into the same sstable since they're adjacent in the keyspace? |
(various related things from the discussion above) Adding to the brainstorming:
|
Can you elaborate on this? The keys exist, but the value is zero length? Or a deletion tombstone exists in a higher level sstable but no key exists in a lower level sstable?
The CRDB keys have the structure: We'd need to know the table ID which I think can change from import to import for TPCC. Index ID 1 indicates these are primary keys. We might be able to take a guess at the table's given the number of indexed columns and their values. I'm not seeing any tables where I'd expect the value to be zero-length. |
I did have a bug that was causing some keys to mistakenly be recorded with a value length of 0. I just fixed it and updated the gist. I still see large numbers of deletion tombstones that exist in higher level sstables w/ no corresponding key in lower sstables. The lower the level of the tombstone, the less likely it finds the key in an even lower level. I think that's consistent with the theory that these tombstones have already dropped the keys they're intended to delete in an earlier compaction. I tried summing the average value size estimate times the number of tombstones whose keys were actually present below to get data in aggregate: Total value sizes of all shadowed keys: 68 M
Thanks @sumeerbhola, that helped clarify my thinking a lot. |
This is interesting. When we're doing a compaction, we could compute a compensated size for deletion tombstones that includes the size of the data that has already been dropped. Though if we expect ~1 record per deletion tombstone, incorporating the compensated size for already dropped data doesn't make sense.
The LSM max level count doesn't strictly limit read amplification, though. Compaction picking and compaction heuristics play a big role in how large L0 is which can have a large impact on read amplification. The size of L0 can also affect space amplification, especially in the scenarios where L0->Lbase compaction is starving Lbase->Lbase+1 compaction (#203). |
Yeah, maybe we actually want to reduce the compensated size by the size of already-dropped data. If we estimate data dropped over the lifetime of a tombstone based on overlapping low-level file statistics, we could subtract out a file's own already-dropped statistic from the estimate. Then lower-leveled files would be compensated less, taking into account keys they already dropped as the entries descended the LSM. |
Some random notes on a few options for evaluating compaction heuristics: Ideally, we could evaluate heuristics on:
Trace / replay whole workloadsWe could add facilities for recording / tracing storage-level operations to a log for replaying later. We could collect these logs from stores running various representative Cockroach workloads, and replay them against Pebble DBs with various heuristics. The benefit to this approach is the comprehensiveness of the workload. Capturing realistic read queries would allow us to evaluate read amplification directly at the Pebble layer. It could include relative timing of operations, which would allow us to ensure that we run workloads at a reasonable pacing. It would also ensure that if a compaction heuristic requires tracking data (like dropped keys) even in the memtable, it provides an opportunity to do it. Because it captures a comprehensive view of Pebble-activity, it might be generally useful for debugging or testing. The cost of this approach's comprehensive collection is that collecting all that data is expensive both in time and space. Writing all operations to a second, adjacent log would slow down the workload being traced, although it's not clear to me just how significantly. Unlike the WAL, this log can be flushed asynchronously. It's also expensive in terms of space, since it includes all operations, including reads and writes that might never make it to L0 (keys dropped while in the memtable). Here's a rough sketch of what it might look like in code: 284fa521 Replay deleted filesWe could use a This approach requires less space since it doesn't capture reads and would have a lot less impact on the process running the original workload. It still captures a representative write workload, which will allow us to evaluate write and space amplification. It's also less invasive in Pebble's codebase. Since no information on representative reads is collected, we would need to rely on roachtests for evaluating read amplification. If we collect L0 files, there's no opportunity for heuristics to capture additional statistics (like dropped keys) in the memtable, which might be useful for making compaction decisions later. Generate synthetic workloadsWe could build off (or build something analagous) the This approach avoids any large space requirements since workloads are generated. Because writes go through the normal write path, it gives heuristics an opportunity to populate L0 sstables with any statistics it might need for compaction picking (eg, dropped keys). The workloads we generate will likely be dissimilar to real-world Cockroach workloads. We'd probably need to focus more on specific pathological cases, and rely on roachtests for evaluating heuristics in the context of Cockroach. |
With the sublevels stuff now landing, this should be either number of sublevels in L0, or the "read amplification" of L0 (maximum number of populated sublevels in each L0 interval). See
Sstables contain a file creation time, so a custom A challenge for any of these approaches is how to pace the replay or workload. Simply doing the operations at the same rate they originally arrived is probably not sufficient. Consider if you're trying to replay on a significantly slower machine. Or we make large improvements to compaction so that it gives the appearance that we're replaying on a significantly faster machine.
I recall seeing an academic paper about recording info from real workloads in order to power synthetic workloads. Ah, here we go: https://arxiv.org/pdf/1912.07172.pdf. On the other hand, perhaps focusing on pathological cases is ok. |
When I was looking at the tracing approach, I was imaging that we could limit concurrency rather than rate. The tracer would write a record to the trace whenever an op started or ended. During replay, we could limit our progress through the trace by keeping the number of concurrent ops within the number of concurrent ops at that point in the trace. I'm not sure how that'd work with the ingestion approach though. |
I'm not sure if that is going to work, but I'm also not sure it isn't. I think coming up with a prototype would be useful. And if the prototype doesn't show us anything interesting, we have to be willing to throw it away and start afresh. |
Thinking about this more, point and range tombstones also don't fit perfectly into the write amplification assumption. As a contrived example, you might have a sstable a consisting of just a single range tombstone that deletes an entire sstable b in the next level. The limited update model assumes that compacting a and b will suffer write amplification linear to their sizes, but in actuality both files just need to be dropped. They're breaking the lower bound, so it's not a problem like the space amplification is. But if they don't perfectly in either dimension of the limited update model's space/write tradeoff maybe it makes sense to handle it outside of it entirely. |
Mentioned in #48 (comment):
@ajkr and I both recall discussing this in the past and feel that something akin to the "compensated size" adjustment RocksDB performs for point deletions should be done for range tombstones. For a tombstone at Ln, we could estimate the number of bytes that tombstone covers in Ln+1 using something like RocksDB's
GetApproximateSizes
. This would be cheapish and would only need to be done when an sstable is created.The text was updated successfully, but these errors were encountered: