Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: incorporate range tombstones into compaction heuristics #319

Closed
petermattis opened this issue Sep 30, 2019 · 30 comments
Closed

db: incorporate range tombstones into compaction heuristics #319

petermattis opened this issue Sep 30, 2019 · 30 comments
Assignees

Comments

@petermattis
Copy link
Collaborator

Mentioned in #48 (comment):

Note that CRDB also know marks any sstable containing a range deletion tombstone as requiring compaction, which will tend to result in range deletion tombstones quickly compacting down to the bottom of the LSM and thus freeing up space. Incorporating better heuristics around range deletion tombstones into compaction is worth investigating.

@ajkr and I both recall discussing this in the past and feel that something akin to the "compensated size" adjustment RocksDB performs for point deletions should be done for range tombstones. For a tombstone at Ln, we could estimate the number of bytes that tombstone covers in Ln+1 using something like RocksDB's GetApproximateSizes. This would be cheapish and would only need to be done when an sstable is created.

@petermattis
Copy link
Collaborator Author

Incorporating an adjustment factor based on range tombstones in the sstable size looks straightforward now that we have DB.EstimatedDiskUsage. The small wrinkle is that we'd want to find the estimated disk usage underneath the sstable in question, while EstimatedDiskUsage estimates the disk usage for a range across all levels.

The bigger task here is that Pebble does not currently use the sstable size in compaction heuristics. Currently we use a heuristic similar to RocksDB's kOldestSmallestSeqFirst heuristic which chooses the sstable containing the smallest seqnum in level to compact. This heuristic is beneficial in that it guarantees that given load on a system, every sstable in a level will eventually get compacted. The downside is the lack of more nuanced prioritization. CockroachDB configures RocksDB to use the kMinOverlappingRatio heuristic. That heuristic chooses the sstable with the minimum overlapping ratio with the next level. That heuristic also uses a "compensated size" which adjusts the size of an sstable based on the number of deletions it contains.

I don't have a fully fledged idea of what should be done here. I think we want to add support for a compensated size calculation which takes both point deletions and range deletions into consideration. We then want to incorporate the compensated size in the compaction heuristic. It might be sufficient to mimic the kMinOverlappingRatio heuristic to start with. The kMinOverlappingRatio prioritizes compactions for which the size of the sstables in the input level is large in comparison to the size of the overlapping sstables in the output level. By incorporating range tombstones into a compensated size of the input sstables, we prioritize compacting those sstables. Note that if we don't incorporate range tombstones in the compensated size (as is currently done by RocksDB), we deprioritize compactions involving range tombstones because their overlap ratio tends to be large.

To ensure old sstables eventually get compacted, RocksDB introduced a TTL mechanism which ensures sstables get compacted on some cadence. While there might be independent utility in adding that mechanism (which CRDB has never enabled), I suspect we could also adjust the compensated size by the age of an sstable.

@jbowens
Copy link
Collaborator

jbowens commented Apr 13, 2020

I'm going to dump some notes here as I get context.

RocksDB's compensated size is only compensated when the number of point deletions exceeds the number of other entries in the file. It's gated on this condition to avoid distorting the shape of the LSM through over-prioritizing compaction of deletions (rocksdb/db/version_set.go#L2206-L2218).

I also noticed RocksDB has a files_size_error_margin for their ApproximateSize calculation. It permits skipping looking into the files on the edge of a range if those boundary files' sizes are less than an error margin of the sum of completely contained files.

376     // TODO(peter): Select the file within the level to compact. See the
377     // kMinOverlappingRatio heuristic in RocksDB which chooses the file with the
378     // minimum overlapping ratio with the next level. This minimizes write
379     // amplification. We also want to computed a "compensated size" which adjusts
380     // the size of a table based on the number of deletions it contains.
381     //
382     // We want to minimize write amplification, but also ensure that deletes
383     // are propagated to the bottom level in a timely fashion so as to reclaim
384     // disk space. A table's smallest sequence number provides a measure of its
385     // age. The ratio of overlapping-bytes / table-size gives an indication of
386     // write amplification (a smaller ratio is preferrable).
387     //
388     // Simulate various workloads:
389     // - Uniform random write
390     // - Uniform random write+delete
391     // - Skewed random write
392     // - Skewed random write+delete
393     // - Sequential write
394     // - Sequential write+delete (queue)

Do we already have facilities to simulate workloads? When incorporating range deletion tombstones into our own 'compensated size,' there's a long axis of accuracy that we can move along. It'd be helpful to run workloads against iterations of different heuristics to evaluate. If we don't already have something, this might be a good place to start work.

@petermattis
Copy link
Collaborator Author

RocksDB's compensated size is only compensated when the number of point deletions exceeds the number of other entries in the file. It's gated on this condition to avoid distorting the shape of the LSM through over-prioritizing compaction of deletions (rocksdb/db/version_set.go#L2206-L2218).

There are lots of comments in the RocksDB code about "the shape of the LSM". I suspect there is a bunch of knowledge that isn't captured in those comments. On the surface, it seems strange to have a hard barrier between using the compensated size adjustment and not. My instinct would be to have a gradual switchover. Perhaps the compensated size adjustment could be attenuated by the fraction of tombstones in the sstable.

Do we already have facilities to simulate workloads? When incorporating range deletion tombstones into our own 'compensated size,' there's a long axis of accuracy that we can move along. It'd be helpful to run workloads against iterations of different heuristics to evaluate. If we don't already have something, this might be a good place to start work.

We have the benchmark stuff in ./cmd/pebble. I was imagining extending that when the TODO comment above was written.

@jbowens
Copy link
Collaborator

jbowens commented Apr 14, 2020

We're looking to integrate sstable size into compaction heuristics, and relatedly, incorporate range tombstones to better prioritize reclamation of disk space.

Our goals for a new a heuristic are to:

  • reduce write amplification by prioritizing files with a lower overlapping ratio with the next level, similar to RocksDB's kMinOverlappingRatio.
  • reclaim disk space timely by prioritizing compaction of deletions, especially range deletions which otherwise would be implicitly deprioritized by a pure reimplementation of kMinOverlappingRatio.
  • compact older sstables eventually by prioritizing older sstables.

My current plan is to mimic the kMinOverlappingRatio heuristic but add two things:

  1. adjust the compensated size calculation to also approximate the covered space of all the range deletions contained in the file
  2. scale the resulting ratio according to the age of the file.

The compensated size approximation has a lot of room for trading between accuracy and calculation cost.

The most accurate estimate of the covered space for a range deletion would sum across all levels beneath the range deletion's sstable, using *sstable.Reader's EstimatedDiskUsage method to sum the length of all data blocks overlapping with the range. Since no new entries can be inserted beneath the sstable, the covered space for a sstable's range deletions is roughly static for the lifetime of the sstable. It can only shrink as compactions among lower levels collapse entries.

We could calculate this covered space/compensated size once when an sstable is created, but I'm unsure what to do with it between process starts. Can we persist it on disk in the table properties without RocksDB complaining? Would it be feasible to recalculate it in the background every time the database is opened? It doesn't seem like it, because when it's intertwined with compaction picker the first compaction would be blocked on an expensive traversal of the database. Alternatively, we could use the uncompensated file size until the compensated sizes have all been calculated.

I think mimicking kMinOverlappingRatio is a good place to start, and we can adapt it as our compensated size starts to account for range deletions.

@petermattis
Copy link
Collaborator Author

We could calculate this covered space/compensated size once when an sstable is created, but I'm unsure what to do with it between process starts. Can we persist it on disk in the table properties without RocksDB complaining? Would it be feasible to recalculate it in the background every time the database is opened? It doesn't seem like it, because when it's intertwined with compaction picker the first compaction would be blocked on an expensive traversal of the database. Alternatively, we could use the uncompensated file size until the compensated sizes have all been calculated.

RocksDB has a facility to compute in-memory stats about sstables at startup.I can't recall exactly the purpose or the name of this facility, but I know that it exists. I also can't recall if it does this calculation in the background or the foreground. Let me know if you need a further pointer.

We can definitely add additional table properties without RocksDB complaining. Something to think about in this area is that the EstimatedDiskUsage underneath a range tombstone can change as compactions occur beneath the range tombstone. Does this matter? I'm honestly not sure.

There is some rationale in compacting range tombstones out of existence as soon as possible as the presence of a range tombstone, even one that covers no data, can affect other operations such as ingestions. See cockroachdb/cockroach#44048 for an example of such badness. Or perhaps this is an argument that we compensate the sstable size for range tombstones by all of the data covered by the range, not just the data residing lower in the LSM.

I think mimicking kMinOverlappingRatio is a good place to start, and we can adapt it as our compensated size starts to account for range deletions.

SGTM

@petermattis
Copy link
Collaborator Author

RocksDB has a facility to compute in-memory stats about sstables at startup.I can't recall exactly the purpose or the name of this facility, but I know that it exists. I also can't recall if it does this calculation in the background or the foreground. Let me know if you need a further pointer.

I believe what I was remembering was VersionSet::UpdateAccumulatedStats. This populates several fields of FileMetadata from table properties. AFAICT, it does this incrementally every time a new version is created.

@jbowens
Copy link
Collaborator

jbowens commented Apr 14, 2020

VersionSet::UpdateAccumulatedStats restricts itself to only updating 20 lower-level files' stats, and then updates compensated sizes across all files. AFAICT, all of the files that exist at startup except the 20 chosen in the first call to UpdateAccumulatedStats have compensated sizes set exactly to their ordinary file size. Once the compensated size is computed for a file it's never recomputed, and a nonzero compensated size blocks ever loading the property metadata fields.

It seems like the intention is to initially compact somewhat arbitrarily, favoring picking an arbitrary set of files in the lowest levels. As files are flushed and replaced through compaction, the proportion of files w/ compensated sizes increases.

    // The reason for choosing files from lower-level instead of higher-level
    // is that such design is able to propagate the initialization from
    // lower-level to higher-level:  When the num_deletions of lower-level
    // files are updated, it will make the lower-level files have accurate
    // compensated_file_size, making lower-level to higher-level compaction
    // will be triggered, which creates higher-level files whose num_deletions
    // will be updated here.

@petermattis
Copy link
Collaborator Author

AFAICT, all of the files that exist at startup except the 20 chosen in the first call to UpdateAccumulatedStats have compensated sizes set exactly to their ordinary file size. Once the compensated size is computed for a file it's never recomputed, and a nonzero compensated size blocks ever loading the property metadata fields.

Yeah, I noticed that too. I'm confused about the terminology here. Usually, L0 is considered the highest level while L6 is the lowest level. But I think the reverse is meant in this context.

@jbowens
Copy link
Collaborator

jbowens commented Apr 23, 2020

I grabbed a couple stores post-roachtest to evaluate strategies for calculating compensated sizes (still treating range and point deletions equally). I wrote a little program to calculate compensated sizes using a few strategies and output them per sstable: tpccbench.csv, clearrange.csv. The clearrange data is not particularly interesting just yet since its deletions are almost exclusively range deletions in their own entire tiny sstables, but I think it'll be a useful dataset when we start to handle range deletions distinctly from point deletions.

  • The 'RocksDB' column shows the compensated size as RocksDB calculates it, only compensating if deletes exceed other entries and adding: (#deletions - #nondeletions) × accumulated average value size × 2.

All four of the remaining strategies use a gradual switchover, scaled by the fraction of entries that are deletions like you suggested:

  • The 'Global' column shows the compensated size using incrementally accumulated statistics across all sstables for average values (like RocksDB uses, but gradually scaled w/o a discrete switchover).

  • The 'Arbitrary Overlap' column shows the compensated size calculated with an average value size taken from an arbitrary, overlapping lower-level sstable.

  • The 'All Overlap' column shows the compensated size calculated with an average value size taken from all the overlapping sstables in the lowest level that contains overlapping sstables. This is mostly useful as a way of estimating the accuracy of the 'Arbitrary Overlap' and 'Own file' strategies.

  • The 'Own file' column shows the compensated size calculated with the average file size within the file whose compensated size is being calculated. It falls back to an arbitrary overlapping sstable if the table only contains deletions. All of the sstables with deletions in the clearrange test store contain only deletions, but none of the sstables in the tpccbench test store do.

Each strategy is followed by a column showing the average value size calculation for that strategy.

Some takeaways that I had:

  • The distribution of value sizes is pretty wide in practice, so the global accumulated average that RocksDB calculates is frequently off by large factors (2-25x).
  • The average value in an arbitrary overlapping sstable is usually very close to the average across all overlapping sstables.

Overall, the arbitrary-overlapping-sstable and the own-file strategies look pretty reasonable, and like they do succeed in capturing the variation in value size. Always using an arbitrary overlapping sstable is simpler but may require an additional IO operation if stats for an overlapping sstable are not already in-memory.

@petermattis
Copy link
Collaborator Author

Overall, the arbitrary-overlapping-sstable and the own-file strategies look pretty reasonable, and like they do succeed in capturing the variation in value size. Always using an arbitrary overlapping sstable is simpler but may require an additional IO operation if stats for an overlapping sstable are not already in-memory.

TIL: Github will pretty-print csv files. The wall of numbers is a bit intimidating, even when pretty-printed. If we take "All Overlap" as the ground truth, it would be interesting to compare the error of "Arbitrary Overlap" and "Own file". I see at least one case (023264) where "Arbitrary Overlap" is significantly different from "All Overlap" and "Own file".

You could also exactly compute ground truth for analysis purposes by looking up every deletion tombstone in the overlapping sstables in lower levels and determining the value sizes.

The compensated-size metric is a known quantity since RocksDB is using it, but we should keep an open mind about other approaches. We'd like to prioritize compactions that reclaim disk space. The more disk space that is reclaimed the better. Tombstones are one indication of disk space being reclaimed (at the very least, if we compact the tombstones to the bottom level we'll reclaim the disk space for the tombstone itself). Another indication is records that are being overwritten, though we have no easy way to determine that. I'm just brainstorming here, and don't have any concrete ideas for alternatives.

@jbowens
Copy link
Collaborator

jbowens commented Apr 23, 2020

Continuing with just brainstorming, I wonder how predictive the keyspace of an entry is to the likelihood it overwrites values down the line. For example, some SQL tables might be high churn and update-heavy, while others might only see inserts and deletes. With the 'Arbitrary overlap' up above, we're looking at the overlapping lower-level tables to get signal on the higher-level tables keys. Could we do the same with the likelihood of sets/merges overwriting previous values?

Could we track the number of entries overwritten by an sstable's entries over their lifetime and stick it in a overwritten_entries sstable property? Maybe compaction could count the number of overwritten values as it builds a new table, add it to the sum of all input file's overwritten_entries and write it to the output file's properties?

Then if we want to try to get signal on the overwrites for compaction picking, we consult the overlapping lower-level files' properties?

Now that I'm thinking about this though, it doesn't seem desirable to compact down high-churn tables? It reclaims space in the short-term but would increase write amplification. If we wanted to ensure we reclaim space from sets & merges timely but keep write amplification down, I guess we'd want to tend towards compacting tables w/ high likelihood of overwrites but infrequent updates.

@petermattis
Copy link
Collaborator Author

Could we track the number of entries overwritten by an sstable's entries over their lifetime and stick it in a overwritten_entries sstable property? Maybe compaction could count the number of overwritten values as it builds a new table, add it to the sum of all input file's overwritten_entries and write it to the output file's properties?

Then if we want to try to get signal on the overwrites for compaction picking, we consult the overlapping lower-level files' properties?

That's an interesting idea. I don't see any fundamental reason it couldn't be done. A small nit is that records are never "overwritten". User keys are overwritten, but I think I'd keep a property on the number of dropped records, though that wouldn't capture records that are not dropped due to snapshots. Regardless, I think something could be done here. And I think that your idea is good that we can use existing of updated user keys at lower levels to inform compaction decisions at higher levels.

Now that I'm thinking about this though, it doesn't seem desirable to compact down high-churn tables? It reclaims space in the short-term but would increase write amplification. If we wanted to ensure we reclaim space from sets & merges timely but keep write amplification down, I guess we'd want to tend towards compacting tables w/ high likelihood of overwrites but infrequent updates.

There are three types of amplification we want to minimize: write, read, and space. High-churn tables will see high space amplification if we're not compacting that key range. They could also see high read amplification. Write amplification when we're randomly updating existing keys is usually not that bad. The intuition here is that used space in the DB isn't growing, so compactions can generally just stay between a few levels. The harder case is randomly inserting new keys. The used space in the DB grows and compactions have to utilize all of the levels, while making sure read amplification doesn't get out of control.

@jbowens
Copy link
Collaborator

jbowens commented Apr 27, 2020

I computed the 'ground truth' average values for deleted keys in the test store from above. To accommodate zero-value actual averages, I calculated 'relative percent differences' rather than percent errors. Relative percent difference varies from -2.00 to 2.00, with zero being no difference.

Here's the wall of per-SST numbers: tpccbench.csv. The 'Present' column tracks the number of times a deleted key was actually present beneath the tombstone, only counting SET and MERGE entries.

A large number of the tombstones did not actually shadow a set or merge entry with the same key, probably because the keys were already dropped in higher-level compactions. Predictably, the lower the SST, the less likely its tombstones were to shadow entries. It seems like this would cause RocksDB to overcompensate low-level file sizes.

I was a little surprised by files like 023223 whose tombstones did shadow a non-negligible number of entries (330) but still had a ground truth average value size of 0. Maybe it's the result of interleaved secondary indexes? It also made me wonder if point tombstones in Cockroach are more likely for keys representing secondary indexes because any update to an indexed column would require deleting the old index key and setting the new key.

@jbowens
Copy link
Collaborator

jbowens commented Apr 27, 2020

Something else that the rocksdb deletion_ratio_compaction_trigger PR has me thinking about is whether or not it's beneficial to lump various signals into a single compaction heuristic. We've talked about inflating compensated size for range deletions and scaling according to sstable age.

It seems like there could be instances where the picker decides not to pursue a compaction because the various inputs into the singular score balance out. For example, what if a level doesn't contain very many bytes but contains a range tombstone that deletes a broad swath of the level below? Comparing the compensated size ratios, the picker might decide to not pursue a compaction at all, leaving significant disk space unreclaimed. Maybe it's okay because the target ratios between levels creates a tight-enough bound?

@petermattis
Copy link
Collaborator Author

I was a little surprised by files like 023223 whose tombstones did shadow a non-negligible number of entries (330) but still had a ground truth average value size of 0. Maybe it's the result of interleaved secondary indexes? It also made me wonder if point tombstones in Cockroach are more likely for keys representing secondary indexes because any update to an indexed column would require deleting the old index key and setting the new key.

I thought we didn't do interleaved indexes in TPCC by default. Did that change?

@petermattis
Copy link
Collaborator Author

I thought we didn't do interleaved indexes in TPCC by default. Did that change?

I double checked: interleaving is not enabled by default in our TPCC workload.

It seems like there could be instances where the picker decides not to pursue a compaction because the various inputs into the singular score balance out. For example, what if a level doesn't contain very many bytes but contains a range tombstone that deletes a broad swath of the level below? Comparing the compensated size ratios, the picker might decide to not pursue a compaction at all, leaving significant disk space unreclaimed. Maybe it's okay because the target ratios between levels creates a tight-enough bound?

This is a good point. I don't have confidence that the last sentence is true. For point tombstones, we also have to worry about the performance impact they have on reads. Iterating through a large swath of point tombstones can be slow. Adding another consideration: leaving significant disk space unreclaimed is primarily a problem if we're running low on disk space. Reclaiming quickly tends to match user expectations. For example, I might drop a table and expect the disk space to be reclaimed quickly so that I can import a large amount of data.

@petermattis
Copy link
Collaborator Author

Here's the wall of per-SST numbers: tpccbench.csv. The 'Present' column tracks the number of times a deleted key was actually present beneath the tombstone, only counting SET and MERGE entries.

I am shocked by how many deletions are not actually deleting anything. Can you double check this result? Does your measurement only look at the sstables in the level immediately below, or in all lower levels?

@jbowens
Copy link
Collaborator

jbowens commented Apr 28, 2020

I am shocked by how many deletions are not actually deleting anything. Can you double check this result? Does your measurement only look at the sstables in the level immediately below, or in all lower levels?

Yeah, I'll double check. I at least intended it to look in all lower levels, but there could be bugs.

@petermattis
Copy link
Collaborator Author

Yeah, I'll double check. I at least intended it to look in all lower levels, but there could be bugs.

If it turns out your code is correct, it would be very interesting to understand where these "useless" deletions are coming from. In CRDB, in order to keep MVCC stats up to date, I'm pretty sure we only issue deletions after having first verified there is a key present. (This is inefficient, but that's a different issue).

@jbowens
Copy link
Collaborator

jbowens commented Apr 28, 2020

I double checked: interleaving is not enabled by default in our TPCC workload.

Here are some example keys that exist but have zero-length values:

  • /Table/61/1/674/57889/0/0.000000000,0
  • /Table/61/1/674/57889/0/1587573721.235130005,2
  • /Table/59/1/312/6/1210/0/1587575695.345387708,2

I don't know much about the key format. Is it possible to infer the significance of these from the keys themselves or would I need metadata about these tables. Maybe the indexes aren't interleaved but just happen to fall into the same sstable since they're adjacent in the keyspace?

@sumeerbhola
Copy link
Collaborator

The more disk space that is reclaimed the better. ... Another indication is records that are being overwritten, though we have no easy way to determine that. ... Now that I'm thinking about this though, it doesn't seem desirable to compact down high-churn tables? It reclaims space in the short-term but would increase write amplification. If we wanted to ensure we reclaim space from sets & merges timely but keep write amplification down, I guess we'd want to tend towards compacting tables w/ high likelihood of overwrites but infrequent updates. ... There are three types of amplification we want to minimize: write, read, and space. High-churn tables will see high space amplification if we're not compacting that key range. They could also see high read amplification.... It seems like there could be instances where the picker decides not to pursue a compaction because the various inputs into the singular score balance out. For example, what if a level doesn't contain very many bytes but contains a range tombstone that deletes a broad swath of the level below? Comparing the compensated size ratios, the picker might decide to not pursue a compaction at all, leaving significant disk space unreclaimed. Maybe it's okay because the target ratios between levels creates a tight-enough bound?... This is a good point. I don't have confidence that the last sentence is true. For point tombstones, we also have to worry about the performance impact they have on reads. Iterating through a large swath of point tombstones can be slow.

(various related things from the discussion above)

Adding to the brainstorming:

  • the LSM max level count, and target bytes per level (10x reduction from the level below) already constrain the read and space amplification respectively, under a limited model of updates. In this setting, L6-bytes < live bytes < 1.1 * L6-bytes (the lower bound represents all the keys in the levels above L6 representing updates, and the upper bound represents all the keys being new inserts). Compaction picking gives us control over the write amplification.
  • the limited update model is SET where each SET that updates a user key has the same value size as a previous SET for that user key. MERGE also works if one assumes it falls into one of two extremes: merge of two values will have a byte size equal to the sum (behaves like a SET that did an insert) or a byte size equal to each (behaves like a SET that did an update).
  • the problem we started with is that point and range tombstones don't fall into this clean model, because their byte size breaks the space amplification assumption. Adjusting for that fixes the space-amplification assumption. I don't think we need to worry about individual files that don't get compacted even though they have large tombstones if their level is below the target size -- i.e., I am agreeing with what Jackson said above "Maybe it's okay because the target ratios between levels creates a tight-enough bound?"
  • The last thing that Peter raised "Iterating through a large swath of point tombstones can be slow." is potentially another break in the model related to read-amplification: it is defined using a get of a user key -- we don't need to worry about swaths of point tombstones for that, since we can stop at the first tombstone. But for range scans it does matter, so if we leave more garbage in one part of the key space than another, our worst-case behavior will be worse than average. I am not sure how we can address this without potentially blowing up write-amplification -- may be something like read-directed compactions, i.e., don't worry about uneven distribution of garbage in the key space if it is not disproportionately affecting reads.

@petermattis
Copy link
Collaborator Author

Here are some example keys that exist but have zero-length values:

Can you elaborate on this? The keys exist, but the value is zero length? Or a deletion tombstone exists in a higher level sstable but no key exists in a lower level sstable?

I don't know much about the key format. Is it possible to infer the significance of these from the keys themselves or would I need metadata about these tables. Maybe the indexes aren't interleaved but just happen to fall into the same sstable since they're adjacent in the keyspace?

The CRDB keys have the structure: /Table/<tableID>/<indexID>/<indexed-columns>/<mvcc-timestamp>

We'd need to know the table ID which I think can change from import to import for TPCC. Index ID 1 indicates these are primary keys. We might be able to take a guess at the table's given the number of indexed columns and their values. I'm not seeing any tables where I'd expect the value to be zero-length.

@jbowens
Copy link
Collaborator

jbowens commented Apr 28, 2020

I did have a bug that was causing some keys to mistakenly be recorded with a value length of 0. I just fixed it and updated the gist.

I still see large numbers of deletion tombstones that exist in higher level sstables w/ no corresponding key in lower sstables. The lower the level of the tombstone, the less likely it finds the key in an even lower level. I think that's consistent with the theory that these tombstones have already dropped the keys they're intended to delete in an earlier compaction.

I tried summing the average value size estimate times the number of tombstones whose keys were actually present below to get data in aggregate:

Total value sizes of all shadowed keys: 68 M
rocksdb: 22 M
arbitrary_overlap: 59 M
own_size: 54 M

the LSM max level count, and target bytes per level (10x reduction from the level below) already constrain the read and space amplification respectively, under a limited model of updates. In this setting, L6-bytes < live bytes < 1.1 * L6-bytes (the lower bound represents all the keys in the levels above L6 representing updates, and the upper bound represents all the keys being new inserts). Compaction picking gives us control over the write amplification.

Thanks @sumeerbhola, that helped clarify my thinking a lot.

@petermattis
Copy link
Collaborator Author

I still see large numbers of deletion tombstones that exist in higher level sstables w/ no corresponding key in lower sstables. The lower the level of the tombstone, the less likely it finds the key in an even lower level. I think that's consistent with the theory that these tombstones have already dropped the keys they're intended to delete in an earlier compaction.

This is interesting. When we're doing a compaction, we could compute a compensated size for deletion tombstones that includes the size of the data that has already been dropped. Though if we expect ~1 record per deletion tombstone, incorporating the compensated size for already dropped data doesn't make sense.

the LSM max level count, and target bytes per level (10x reduction from the level below) already constrain the read and space amplification respectively, under a limited model of updates. In this setting, L6-bytes < live bytes < 1.1 * L6-bytes (the lower bound represents all the keys in the levels above L6 representing updates, and the upper bound represents all the keys being new inserts). Compaction picking gives us control over the write amplification.

Thanks @sumeerbhola, that helped clarify my thinking a lot.

The LSM max level count doesn't strictly limit read amplification, though. Compaction picking and compaction heuristics play a big role in how large L0 is which can have a large impact on read amplification. The size of L0 can also affect space amplification, especially in the scenarios where L0->Lbase compaction is starving Lbase->Lbase+1 compaction (#203).

@jbowens
Copy link
Collaborator

jbowens commented Apr 29, 2020

This is interesting. When we're doing a compaction, we could compute a compensated size for deletion tombstones that includes the size of the data that has already been dropped. Though if we expect ~1 record per deletion tombstone, incorporating the compensated size for already dropped data doesn't make sense.

Yeah, maybe we actually want to reduce the compensated size by the size of already-dropped data. If we estimate data dropped over the lifetime of a tombstone based on overlapping low-level file statistics, we could subtract out a file's own already-dropped statistic from the estimate. Then lower-leveled files would be compensated less, taking into account keys they already dropped as the entries descended the LSM.

@jbowens
Copy link
Collaborator

jbowens commented Apr 29, 2020

Some random notes on a few options for evaluating compaction heuristics:

Ideally, we could evaluate heuristics on:

  • Write amplification
    • We have existing data in Metrics.
  • Space amplification
    • At the end of a test, we could manually Compact(_, _) the entire key space and measure the change in sum of alls stables.
    • Measuring the size of the database's active files at consistent points in the same workload can give us a relative measure of space amplification (eg, this heuristic amplified by x additional bytes), but not an amplification factor.
  • Read amplification
    • Point read amplification
      • How many files are in L0?
    • Range read amplification
      • The presence of tombstones may increase this.
      • We can measure per-operation latency if we include reads as a part of workload. Otherwise, we need to rely on roachtests.
      • We could measure block cache retrievals, but that would include reads from queries and compactions.

Trace / replay whole workloads

We could add facilities for recording / tracing storage-level operations to a log for replaying later. We could collect these logs from stores running various representative Cockroach workloads, and replay them against Pebble DBs with various heuristics.

The benefit to this approach is the comprehensiveness of the workload. Capturing realistic read queries would allow us to evaluate read amplification directly at the Pebble layer. It could include relative timing of operations, which would allow us to ensure that we run workloads at a reasonable pacing. It would also ensure that if a compaction heuristic requires tracking data (like dropped keys) even in the memtable, it provides an opportunity to do it. Because it captures a comprehensive view of Pebble-activity, it might be generally useful for debugging or testing.

The cost of this approach's comprehensive collection is that collecting all that data is expensive both in time and space. Writing all operations to a second, adjacent log would slow down the workload being traced, although it's not clear to me just how significantly. Unlike the WAL, this log can be flushed asynchronously. It's also expensive in terms of space, since it includes all operations, including reads and writes that might never make it to L0 (keys dropped while in the memtable).

Here's a rough sketch of what it might look like in code: 284fa521

Replay deleted files

We could use a pebble.Cleaner implementation to capture obsolete files from a store running various representative Cockroach workloads. We could capture either log files or L0 files. We might be able to reconstruct pacing information through also installing a custom EventListener to listen for flushes and garner when files were introduced.

This approach requires less space since it doesn't capture reads and would have a lot less impact on the process running the original workload. It still captures a representative write workload, which will allow us to evaluate write and space amplification. It's also less invasive in Pebble's codebase.

Since no information on representative reads is collected, we would need to rely on roachtests for evaluating read amplification. If we collect L0 files, there's no opportunity for heuristics to capture additional statistics (like dropped keys) in the memtable, which might be useful for making compaction decisions later.

Generate synthetic workloads

We could build off (or build something analagous) the pebble bench command to generate more workloads with new distributions.

This approach avoids any large space requirements since workloads are generated. Because writes go through the normal write path, it gives heuristics an opportunity to populate L0 sstables with any statistics it might need for compaction picking (eg, dropped keys).

The workloads we generate will likely be dissimilar to real-world Cockroach workloads. We'd probably need to focus more on specific pathological cases, and rely on roachtests for evaluating heuristics in the context of Cockroach.

@petermattis
Copy link
Collaborator Author

Point read amplification
How many files are in L0?

With the sublevels stuff now landing, this should be either number of sublevels in L0, or the "read amplification" of L0 (maximum number of populated sublevels in each L0 interval). See L0Sublevels.ReadAmplification().

We could use a pebble.Cleaner implementation to capture obsolete files from a store running various representative Cockroach workloads. We could capture either log files or L0 files. We might be able to reconstruct pacing information through also installing a custom EventListener to listen for flushes and garner when files were introduced.

Sstables contain a file creation time, so a custom EventListener might not be needed. We also capture sstable creation time in the MANIFEST.

A challenge for any of these approaches is how to pace the replay or workload. Simply doing the operations at the same rate they originally arrived is probably not sufficient. Consider if you're trying to replay on a significantly slower machine. Or we make large improvements to compaction so that it gives the appearance that we're replaying on a significantly faster machine.

The workloads we generate will likely be dissimilar to real-world Cockroach workloads. We'd probably need to focus more on specific pathological cases, and rely on roachtests for evaluating heuristics in the context of Cockroach.

I recall seeing an academic paper about recording info from real workloads in order to power synthetic workloads. Ah, here we go: https://arxiv.org/pdf/1912.07172.pdf.

On the other hand, perhaps focusing on pathological cases is ok.

@jbowens
Copy link
Collaborator

jbowens commented Apr 29, 2020

A challenge for any of these approaches is how to pace the replay or workload. Simply doing the operations at the same rate they originally arrived is probably not sufficient. Consider if you're trying to replay on a significantly slower machine. Or we make large improvements to compaction so that it gives the appearance that we're replaying on a significantly faster machine.

When I was looking at the tracing approach, I was imaging that we could limit concurrency rather than rate. The tracer would write a record to the trace whenever an op started or ended. During replay, we could limit our progress through the trace by keeping the number of concurrent ops within the number of concurrent ops at that point in the trace.

I'm not sure how that'd work with the ingestion approach though.

@petermattis
Copy link
Collaborator Author

When I was looking at the tracing approach, I was imaging that we could limit concurrency rather than rate. The tracer would write a record to the trace whenever an op started or ended. During replay, we could limit our progress through the trace by keeping the number of concurrent ops within the number of concurrent ops at that point in the trace.

I'm not sure if that is going to work, but I'm also not sure it isn't. I think coming up with a prototype would be useful. And if the prototype doesn't show us anything interesting, we have to be willing to throw it away and start afresh.

@jbowens
Copy link
Collaborator

jbowens commented May 6, 2020

the problem we started with is that point and range tombstones don't fall into this clean model, because their byte size breaks the space amplification assumption. Adjusting for that fixes the space-amplification assumption. I don't think we need to worry about individual files that don't get compacted even though they have large tombstones if their level is below the target size -- i.e., I am agreeing with what Jackson said above "Maybe it's okay because the target ratios between levels creates a tight-enough bound?"

Thinking about this more, point and range tombstones also don't fit perfectly into the write amplification assumption. As a contrived example, you might have a sstable a consisting of just a single range tombstone that deletes an entire sstable b in the next level. The limited update model assumes that compacting a and b will suffer write amplification linear to their sizes, but in actuality both files just need to be dropped.

They're breaking the lower bound, so it's not a problem like the space amplification is. But if they don't perfectly in either dimension of the limited update model's space/write tradeoff maybe it makes sense to handle it outside of it entirely.

@jbowens jbowens closed this as completed Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants