-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: read-based compaction heuristic #29
Comments
The read-based compaction heuristic works by maintaining a per-sstable counter of the number of allowed seeks on the sstable before it will be compacted. When an sstable is first created, this counter is populated as:
During iteration, LevelDB randomly samples the keys with a period of 1MB and updates the seek stats whenever there are 2 or more overlapping files for a particular key. Only the |
By re-introduce, are you implying that RocksDB doesn't use this one (if so, any idea why not)? Or was this once present in pebble and was removed for some reason? |
It was in leveldb but dropped from rocksdb. I do not know why it was originally dropped (as a random guess, that comment looks like it's all about disks and assumes reads/writes are all about seek cost, which isn't true anymore. Nowadays writes wear out your flash and reads don't so there's no clear crossover point where compacting becomes a clearly better choice.). Periodically we considered reimplementing it but couldn't think of use cases that would benefit besides read-only benchmarks, so it was never prioritized. |
@tbg To echo what @ajkr said, RocksDB dropped this heuristic from LevelDB. While Pebble was based on a Go port of LevelDB, it never had the heuristic (the port had not yet implemented it). I'd definitely want a read-based compaction heuristic to be motivated by a workload which would benefit from it. My guess is that a read-heavy workload would. Perhaps I'm wrong about that. |
Results of some early YCSB-A with the changes from #968:
bench options:
As discussed in storage morning sync, I am going to try running YCSB-C and YCSB-E with a large set of data. When I ran YCSB-C initially, I used |
@aadityasondhi It would be good to include the LSM metrics so we can see how many levels there are during these runs? Also, are you measuring how often the read-triggered compactions are being performed. I can't tell from these numbers whether they are being performed at all. YCSB-A is 50% updates, 50% reads. I'd expect YCSB-C (100% reads) to provide the clearest signal of whether there is an improvement here. One thing that will be interesting is to track the ops/sec of YCSB-C over time as the expectation is that ops/sec should increase as the compactions are performed. I usually dump raw numbers into a Google sheet for one-off analysis like this. |
Running YCSB-C with larger sets of data uncovered a bug in the read compaction logic where there were some level mismatches causing the stats to be off. After fixing that, here are some of the results: Read compaction branch:
master branch:
benchstat comparison:
There seems to be a slight improvement in r-amp but a regression in other places. I am currently running a longer test with more data to hopefully produce more conclusive data. Will update this comment with the results from that. |
That is quite a significant perf hit. Looking only at the final ops/sec value doesn't give a full picture. I'll reiterate that it will be useful to dump the per-second ops/sec output to a Google sheet and graph it for the full run. It is curious how many fewer sstables there are in your read-triggered branch vs the master branch runs: 467 vs 1635. It would be worthwhile to verify that the sampling you've added to iterators isn't having a negative impact on read performance. One way to do this is to leave the sampling in place, but to disable the read-triggered compaction suggestions. |
Here is a google sheet with the results of ops/sec over the 30min bench period: https://docs.google.com/spreadsheets/d/1UiYY68nEBOh12KDOOuiq2IoxLht5K5CUE3uTso4kI4E I was also suspecting the same thing regarding the sampling. I will do a quick test to see if reduced and/or no sampling improves the read performance. I am also setting up a gce worker to do the tests so that any local machine inconsistencies don't affect the results. |
How long were the runs above? I'm assuming this is for the read-triggered compaction branch. You'll also want a run for the master branch as comparison.
Oh, definitely do this. Or use |
They were 5mins, I am currently running the master branch for 30mins and will add the results to the sheet. |
Update on progress/findings since the previous comments (for more visibility): I was running into a few issues trying to run the benchmark using
Initially, I suspected something was wrong with the way I was building the binary and/or setting up the tests. Tried a few different things with no luck. Jackson helped me debug this using ssh into one of the cluster nodes. Found this error message:
I ran the benchmark with the same data using the master build and it worked fine. Seems like a bug in changes from #968. This error comes from: https://github.com/aadityasondhi/pebble/blob/read-triggered-compactions/compaction.go#L1131. I am going to add some debug statements to find what is causing this and why it was not happening when running the same benchmark locally earlier. |
I was able to run some more benchmarks using roachtets with the following settings:
The results however are a little strange. I ran the same tests twice to make sure I was not making any errors in configuring them. master branch vs read compaction:
As suggested by @petermattis, I also ran another set of tests with sampling turned on but read compaction turned off to isolate any performance regression due to just the sampling portion:
Based on this, it does seem that the sampling itself causes some slowdown in ops/sec on a read-only workload (C). But performing read triggered compactions is slowing down the ops/sec in a mixed workload (E). We might need to find a balance between the sampling rate and threshold for triggering read compaction to minimize the ops/sec performance regressions. end state of master branch:
end state of read compaction branch:
I am not sure if my test duration is too low or if other factors are coming into play here. The increase in Edit: https://docs.google.com/spreadsheets/d/1UiYY68nEBOh12KDOOuiq2IoxLht5K5CUE3uTso4kI4E/edit#gid=239494140 |
Do you mind adding the output of |
I should have checked this earlier but this shows a more clear regression due to sampling alone. |
When you ran with just the sampling, did you avoid creating the Do you know how many I left some suggestions on #968, some of which might have an impact on performance. It's probably also a good idea to look at heap and memory profiles from a run. If your roachtest binary is recently built (and you have a very recently updated master), the |
This would be highly desirable, given how many bytes are stuck in higher levels (and their high scores) in the ycsb/E read-compaction LSM
|
Thanks for the feedback @jbowens and @sumeerbhola. I have made some changes based on recommendations from @jbowens in #968, and can already see an improvement in ops/sec, r-amp, w-amp on small scale local tests. I am now running the roachtests on aws machines to confirm. I will update here once I have the results.
Yeah I removed the part where the
This is very helpful, I will take it into account on my next few runs to find more of the allocations related performance hits. |
Re-ran the same tests as before after making some of the changes:
These changes led to a more expected result. master branch vs read compaction:
master branch vs sampling only:
end state of master branch:
end state of read compaction branch:
Chart comparing ops/sec over time: https://docs.google.com/spreadsheets/d/1UiYY68nEBOh12KDOOuiq2IoxLht5K5CUE3uTso4kI4E/edit#gid=239494140 There is ops/sec regression in sampling only but it is made up when read compactions start to get triggered as seen by the chart. As expected in workload C (read-only), the read compactions help reduce the r-amp and over time the ops/sec also increases. However in a more mixed workload, there is still performance regression as seen in Workload E. Overall, it seems to work as initially expected. I am going to start looking for more optimizations using memory profiles. I will also look into testing different sampling periods to help with the reduced ops/sec in mixed workloads. |
Update: ran tests after the changes in 19c1850. There seems to be a big improvement on the performance regression due to sampling. Seems like most of the overhead introduced by sampling was due the use of master branch vs read compaction:
master branch vs samping only branch:
master branch:
read compaction branch:
There is some extra w-amp in workload E which seems odd since read compaction is the last type of compaction considered by |
Results from all YCSB workloads. This suggests that most of the sampling based perf drops have been solved by the latest change. master branch vs read compaction branch*
master branch vs sampling only
The read compaction comparison suggests that it may be useful to optimize the compaction trigger thresholds a little, by possibly reducing the frequency. This could help with perf drop in workload B. In terms of the slight perf drop due to sampling, the |
Updated test results with the latest commit using
Sampling seems to cause minimal overhead with this new change. The result to note is the ops/sec regression in workloads A and B for the read compactions. I will adjust the read compaction parameters to reduce their frequency which might help with the perf there. |
Yeah, the sampling overhead looks non-existent with the latest change. If there is significant write traffic, perhaps any queued read compactions should be dropped. Can that be done in such a way that if the write traffic stops and reads continue we'll eventually queue the active sstables for read compactions again? |
As it is currently designed, read compactions are queued when the files reach 0 |
|
The other possibility (and a motivation to increase the default AllowedSeeks) is that read compactions could be getting scheduled in the short durations after a score-based compaction, or at times when all score-based candidates are already compacting. Then, while a read compaction runs, we're unable to do more score-based compactions until it finishes. It might be worth tracking the sizes of read - based compactions; if they're too large, that would also explain the ycsb/{A,B} slowness. |
I see the problem now. There are times during a workload like A where read compactions do occur when there are no score based compactions possible either because those have already occurred and we fail early, or because there is a small window where there truly isn't a suitable score based compactions. A possible solution I can think of is to track the reason why a score based compaction was not possible. If it is because the scores on the levels are low enough to not need a compaction, we can try read compacting. But if it is because a compaction was already in progress for the picked file, we should try another score based compaction rather than trying a read compaction.
I did try reducing the threshold by 50%, which doubles the Results from latest changes in
|
Another area to explore: don't do a read-based compaction if a flush is in progress. Perhaps even better would be to notice if a flush is likely to happen soon, though I don't have an immediate suggestion for how to accomplish that. |
Did a quick implementation of this to see how much of an impact it would make: f8fdd69. Looks like it is significant enough to make it worthwhile to explore the second suggestion as well to anticipate flushes and stop read compacting. I will see if I can come up with something using the memtable sizes.
|
Previously, our compactions were triggered on write based heuristics. This change introduces compactions based on high read activity to improve read performance in read-heavy workloads. It is inspired from LevelDB's read based compaction heuristic which is outlined in the following issue: cockroachdb#29. These compactions are triggered using an `AllowedSeeks` parameter on each file's `FileMetadata`. Reads are sampled based on a rate defined in `iterator.MaybeSampleRead()`. If a read is sampled, `AllowedSeeks` is decremented on the file in the top-most level containing the key. Once `AllowedSeeks` reaches 0, a compaction for the key range is scheduled. Read triggered compactions are only considered if no other compaction is possible at the time. This helps prioritize score based compactions to maintain a healthy LSM shape. The results of this change in benchmarks are outlined in the github issue: cockroachdb#29.
Results after the most recent PR: For 1024 master vs read compactions
64 master vs read compactions
|
Previously, our compactions were triggered on write based heuristics. This change introduces compactions based on high read activity to improve read performance in read-heavy workloads. It is inspired from LevelDB's read based compaction heuristic which is outlined in the following issue: cockroachdb#29. These compactions are triggered using an `AllowedSeeks` parameter on each file's `FileMetadata`. Reads are sampled based on a rate defined in `iterator.MaybeSampleRead()`. If a read is sampled, `AllowedSeeks` is decremented on the file in the top-most level containing the key. Once `AllowedSeeks` reaches 0, a compaction for the key range is scheduled. Read triggered compactions are only considered if no other compaction is possible at the time. This helps prioritize score based compactions to maintain a healthy LSM shape. The results of this change in benchmarks are outlined in the github issue: cockroachdb#29.
Previously, our compactions were triggered on write based heuristics. This change introduces compactions based on high read activity to improve read performance in read-heavy workloads. It is inspired from LevelDB's read based compaction heuristic which is outlined in the following issue: cockroachdb#29. These compactions are triggered using an `AllowedSeeks` parameter on each file's `FileMetadata`. Reads are sampled based on a rate defined in `iterator.MaybeSampleRead()`. If a read is sampled, `AllowedSeeks` is decremented on the file in the top-most level containing the key. Once `AllowedSeeks` reaches 0, a compaction for the key range is scheduled. Read triggered compactions are only considered if no other compaction is possible at the time. This helps prioritize score based compactions to maintain a healthy LSM shape. The results of this change in benchmarks are outlined in the github issue: cockroachdb#29.
Previously, our compactions were triggered on write based heuristics. This change introduces compactions based on high read activity to improve read performance in read-heavy workloads. It is inspired from LevelDB's read based compaction heuristic which is outlined in the following issue: cockroachdb#29. These compactions are triggered using an `AllowedSeeks` parameter on each file's `FileMetadata`. Reads are sampled based on a rate defined in `iterator.MaybeSampleRead()`. If a read is sampled, `AllowedSeeks` is decremented on the file in the top-most level containing the key. Once `AllowedSeeks` reaches 0, a compaction for the key range is scheduled. Read triggered compactions are only considered if no other compaction is possible at the time. This helps prioritize score based compactions to maintain a healthy LSM shape. The results of this change in benchmarks are outlined in the github issue: cockroachdb#29.
Previously, our compactions were triggered on write based heuristics. This change introduces compactions based on high read activity to improve read performance in read-heavy workloads. It is inspired from LevelDB's read based compaction heuristic which is outlined in the following issue: cockroachdb#29. These compactions are triggered using an `AllowedSeeks` parameter on each file's `FileMetadata`. Reads are sampled based on a rate defined in `iterator.MaybeSampleRead()`. If a read is sampled, `AllowedSeeks` is decremented on the file in the top-most level containing the key. Once `AllowedSeeks` reaches 0, a compaction for the key range is scheduled. Read triggered compactions are only considered if no other compaction is possible at the time. This helps prioritize score based compactions to maintain a healthy LSM shape. The results of this change in benchmarks are outlined in the github issue: cockroachdb#29.
After the most recent changes, the benchmarks are aligning with what we would have expected to see from this project. 1024 master vs read compactions
64 master vs read compactions
Compared to the last benchmark, this helps remove all of the regression in the small value size tests. We see noticeable improvements throughout all read-heavy workloads. |
Previously, our compactions were triggered on write based heuristics. This change introduces compactions based on high read activity to improve read performance in read-heavy workloads. It is inspired from LevelDB's read based compaction heuristic which is outlined in the following issue: cockroachdb#29. These compactions are triggered using an `AllowedSeeks` parameter on each file's `FileMetadata`. Reads are sampled based on a rate defined in `iterator.MaybeSampleRead()`. If a read is sampled, `AllowedSeeks` is decremented on the file in the top-most level containing the key. Once `AllowedSeeks` reaches 0, a compaction for the key range is scheduled. Read triggered compactions are only considered if no other compaction is possible at the time. This helps prioritize score based compactions to maintain a healthy LSM shape. The results of this change in benchmarks are outlined in the github issue: cockroachdb#29.
Previously, our compactions were triggered on write based heuristics. This change introduces compactions based on high read activity to improve read performance in read-heavy workloads. It is inspired from LevelDB's read based compaction heuristic which is outlined in the following issue: cockroachdb#29. These compactions are triggered using an `AllowedSeeks` parameter on each file's `FileMetadata`. Reads are sampled based on a rate defined in `iterator.MaybeSampleRead()`. If a read is sampled, `AllowedSeeks` is decremented on the file in the top-most level containing the key. Once `AllowedSeeks` reaches 0, a compaction for the key range is scheduled. Read triggered compactions are only considered if no other compaction is possible at the time. This helps prioritize score based compactions to maintain a healthy LSM shape. The results of this change in benchmarks are outlined in the github issue: cockroachdb#29. Fixes cockroachdb#29.
This change enables read based compactions by default and sets the read sampling threshold. This is based on the findings in cockroachdb#29.
This change enables read based compactions by default and sets the read sampling threshold. This is based on the findings in cockroachdb#29.
This change enables read based compactions by default and sets the read sampling threshold. This is based on the findings in #29.
Re-introduce a read-based compaction heuristic. LevelDB had a heuristic to trigger a compaction of a table if that table was being read frequently and there were tables from multiple levels involved in the read. This essentially reduced read-amplification in read heavy workloads. A read-based compaction is only selected if a size-based compaction is not needed.
In addition to reducing read-amplification, a read compaction would also squash deletion tombstones. So we'd get a scan performance benefit in not needing to skip the deletion tombstones.
The text was updated successfully, but these errors were encountered: