Change The Way Level Target And Compaction Score Are Calculated #10057

siying · 2022-05-25T21:42:01Z

Summary:
The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most.
Basic idea:
(1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is:
(2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size.

Test Plan:
Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario.

siying · 2022-05-25T21:42:39Z

#9423 is one symptom for the problem.

ajkr · 2022-05-26T08:30:38Z

My understanding is this bundles two changes, which is fine assuming they're both desirable, but would be helpful if it were explicitly stated (or corrected):

(1) Switching target level size adjustment to score adjustment in order to stabilize the calculation of pending compaction bytes.
(2) Changing the prioritization of level to compact.

In terms of desirability, (1) seems clearly desirable. It does make us stall earlier compared to before in certain scenarios, but we have been reasonably successful in having customers increase/disable the stalling limits as needed, and could probably increase defaults too, so this is fine with me.

The new heuristic (2) is more difficult. I need to study it more closely tomorrow but it certainly appears to have the advantage that it can be "always-on", unlike the level multiplier smoothing we had before.

siying · 2022-05-26T19:26:03Z

(1) Switching target level size adjustment to score adjustment in order to stabilize the calculation of pending compaction bytes.
To clarify, it's not just about calculating of pending compaction bytes. Once we have adjusted level targets, some levels won't qualify for compactions any more. For example, consider following size per level:

L0: 5GB
L1: 200 MB (unadjusted target 100MB)
L2: 2 GB (unadjusted target 1GB)
L3: 15 GB (unadjusted target 10GB)
L4: 100 GB (unadjusted target 100GB)

All levels would qualify for compaction before adjusting level size, so some L2->L3 and L3->L4 compactions would happen while L0->L1 is happening. However, with adjusted level sizing, the target would look like this:

L0: 5GB
L1: 200 MB (adjusted target 5 GB)
L2: 2 GB (adjusted target 13.6 GB)
L3: 15 GB (adjusted target 36.8 GB)
L4: 100 GB (adjusted target 100GB)

and only L0->L1 compaction will be going on and all other levels' compactions will be on hold. With this change, L3->L4 will also happen if there are free compaction slots.

ajkr · 2022-06-07T00:49:19Z

This seems like a good thing to try. I would guess it helps space-amp during write burst and doesn't hurt write-amp (?). Some experimental data would be helpful.

Some notes:

Agreed this can increase parallelism
- Reverting the adaptive level sizing alone might increase parallelism similarly
This may go well with Use Env::IO_MID for the L0->L0 && L0->L1 compaction #9999 since the parallel compactions it introduces in lower levels seem less urgent than ongoing intra-L0 and L0->Lbase compactions
I wonder if the estimated upper bytes coming down in the denominator encourages hourglass LSM shapes. It seems levels like Lbase, Lbase+1 will be deprioritized the most.

mdcallag · 2022-06-07T20:50:34Z

@siying how will this impact when stalls occur? Does it mean the stall conditions won't be adjusted?

My other question is that if we have feature targeted towards handling write bursts, then is it worth the additional complexity of trying to distinguish between bursts of writes vs steady state of high write rates? Because #9423 is caused by a steady state of high write rates.

siying · 2022-06-07T22:48:40Z

@siying how will this impact when stalls occur? Does it mean the stall conditions won't be adjusted?
Stall condition would be the same as without adjustable level targets.

My other question is that if we have feature targeted towards handling write bursts, then is it worth the additional complexity of trying to distinguish between bursts of writes vs steady state of high write rates? Because #9423 is caused by a steady state of high write rates.
I hope this proposal can handle bursts of writes a little bit. If there are a burst of writes coming from L0, upper level compactions are taking lower priority compared to lower levels, relative to previously.

mdcallag · 2022-06-08T03:36:32Z

One more suggestion from Manos that I think is interesting...

Have we considered remove write stalls, keeping write slowdowns, but making the slowdown time a function of the badness of the write overload. The goal is to dynamically adjust the write slowdown to figure out what it needs to be to make ingest match outgest (outgest == how fast RocksDB can reduce compaction debt).

mdcallag · 2022-06-08T18:13:01Z

With a b-tree the behavior is close to "pay as you go" for writes. When the buffer pool is full of dirty pages a new RMW must do some writeback if before it reads the to-be-modified block into the buffer pool because it must evict a dirty page before doing the read. This limits the worst case write stall, ignoring other perf problems with checkpoint.

But an LSM decouples the write (debt creation) from compaction (debt repayment). Write slowdowns are a way to couple them but from memory the current write stall uses a fixed wait (maybe 1 millisecond). We can estimate the cost of debt repayment as: X = compaction-seconds / ingest-bytes and then make the slowdown ~= X * bytes-to-by-written. The debt repayment estimates assumes that compaction is fully sequential which is a worst-case assumption as some of the repayment is concurrent.

From recent benchmarks I have done the value for X approximately 0.1 microseconds per byte of ingest. One example is:

ingest = 318.4 GB
compaction wall clock seconds = 31901

This was measured via db_bench --benchmarks=overwrite,waitforcompaction

I know there is a limit on how short a wait we can implement if a thread is to sleep, although I don't know what that is. Short waits could be implemented by spinning on a CPU but that has bad side effects.

ajkr · 2022-06-08T18:51:25Z

One more suggestion from Manos that I think is interesting...

Have we considered remove write stalls, keeping write slowdowns, but making the slowdown time a function of the badness of the write overload. The goal is to dynamically adjust the write slowdown to figure out what it needs to be to make ingest match outgest (outgest == how fast RocksDB can reduce compaction debt).

Does write slowdown before reaching a hard limit ever help write latencies in an open system? See Section 3.2 of https://arxiv.org/abs/1906.09667 for explanation on the limitations of using closed-loop benchmarks. I can see scenarios where write slowdowns hurt write latencies in an open system (cases where the workload could be handled without breaching the limits, but gets slowed down - thus building up a backlog - because the workload brought the DB near its limits) but have yet to see a scenario where it helps.

siying · 2022-06-08T20:34:36Z

One more suggestion from Manos that I think is interesting...

Have we considered remove write stalls, keeping write slowdowns, but making the slowdown time a function of the badness of the write overload. The goal is to dynamically adjust the write slowdown to figure out what it needs to be to make ingest match outgest (outgest == how fast RocksDB can reduce compaction debt).

"making the slowdown time a function of the badness of the write overload" is already partially down. The more L0 files there are, the lower the write rate we set to. It is not expanded to estimated compaction debt and might not work well enough with L0->L0 compaction.

mdcallag · 2022-06-13T18:19:11Z

I tested 3 binaries using a96a4a2 as the base. Tests were repeated for an IO-bound (database larger than RAM) and cached (database cached by RocksDB) workloads. The test is benchmark.sh run the way I run it.

The binaries are:

pre - a96a4a2 as-is
post - a96a4a2 with Siying's RFC
nonadaptive - a96a4a2 with intra-L0 and dynamic target resizing disabled

First I will show throughput over time during overwrite which runs at the end of the benchmark. The nonadaptive binary has not much variance. The pre and post binaries have a lot.

This is for cached.

This is for IO-bound

From the benchmark summary for cached:

at the top are the results for fillseq where nonadaptive had the best insert rate, although not by much (980k, 960k, 1003k) for (pre, post, nonadaptive).
at the bottom are the results for overwrite where ...
- nonadaptive has the best insert rate, then pre, then post (317k, 243k, 107k per second)
- worst-case write stalls were similar
- stall percentages were 49.6%, 76.2%, 33.5% for pre, post and nonadaptive

From the benchmark summary for IO-bound:

at the top are the results for fillseq where nonadaptive had the worst insert rate (343k, 329k, 213k) for (pre, post, nonadaptive). The nonadaptive binary only did trivial moves while pre/post did some regular compaction. Worst-case write stalls were much worse for pre/post.
at the bottom are the results for overwrite where ...
- nonadaptive has the best insert rate, then pre, then post (129k, 127k, 96k per second)
- worst-case write stalls were ~351, ~121, 1.6 seconds for pre, post, nonadaptive
- stall percentages were 58.5%, 71.8%, 42.9% for pre, post and nonadaptive

Write stall counters are here.

for cached, nonadaptive gets more level0_slowdown
for IO-bound, nonadaptive gets more pending_compaction_bytes slowdown, while pre & post get more pending_compaction_bytes stops and more level0_slowdown

ajkr · 2022-06-13T18:58:46Z

First I will show throughput over time during overwrite which runs at the end of the benchmark. The nonadaptive binary has not much variance. The pre and post binaries have a lot.

Each curve in these graphs is using a different workload. That's the problem with closed-loop benchmarks that I alluded to earlier: "See Section 3.2 of https://arxiv.org/abs/1906.09667 for explanation on the limitations of using closed-loop benchmarks". The graphs give no indication of whether the "pre" or "post" binaries could handle the workload that was sent to the "nonadaptive" binary with acceptable write latencies.

siying · 2022-06-13T19:10:14Z

fillseq

To clarify, nonadaptive not only removes adaptive level sizing but also L0->L0 compaction, right?

mdcallag · 2022-06-13T20:14:25Z

@ajkr All binaries get the same workload. The workload is send writes to RocksDB faster than compaction can handle. The goal is to see how well or how poorly RocksDB handles it.

siying · 2022-06-13T20:36:44Z

First I will show throughput over time during overwrite which runs at the end of the benchmark. The nonadaptive binary has not much variance. The pre and post binaries have a lot.

Each curve in these graphs is using a different workload. That's the problem with closed-loop benchmarks that I alluded to earlier: "See Section 3.2 of https://arxiv.org/abs/1906.09667 for explanation on the limitations of using closed-loop benchmarks". The graphs give no indication of whether the "pre" or "post" binaries could handle the workload that was sent to the "nonadaptive" binary with acceptable write latencies.

Reading Section 3.2 of the paper you referred to, I think I got your point that the benchmark that writes as soon as it can isn't a good indication of what sustainable write throughput without stalling. I think @mdcallag probably doesn't say his benchmark is measuring the write throughput without stalling. The question is, do you think it is valuable to measure the stalling when users write as far as they can. The fact that most users probably won't write DB in this style doesn't necessarily mean it isn't a valid use case to measure.

ajkr · 2022-06-13T20:56:56Z

The question is, do you think it is valuable to measure the stalling when users write as far as they can. The fact that most users probably won't write DB in this style doesn't necessarily mean it isn't a valid use case to measure.

Yes, I just want to be clear about the limitations and relevance to production so we don't overfit the system to this kind of benchmark. One example is we force slowdown when N-1 memtables are full and memtable limit is >= 3, although that should reduce peak sustainable throughput. Other ideas I've heard recently like replacing stops with slowdowns also sound harmful to peak sustainable throughput since they will necessarily slowdown writes before any limit has been breached.

@ajkr All binaries get the same workload. The workload is send writes to RocksDB faster than compaction can handle.

For me, same workload means same requests are sent at the same time. That can't be the case here because the inserts/second graph shows "pre" and "post" sometimes get higher QPS than "nonadaptive", and at other times get lower QPS. I believe that's because the workload is dictated by the binary (i.e., a RocksDB slowdown slows down the workload). So different binaries will produce different workloads.

siying · 2022-06-13T22:47:45Z

The question is, do you think it is valuable to measure the stalling when users write as far as they can. The fact that most users probably won't write DB in this style doesn't necessarily mean it isn't a valid use case to measure.

Yes, I just want to be clear about the limitations and relevance to production so we don't overfit the system to this kind of benchmark. One example is we force slowdown when N-1 memtables are full and memtable limit is >= 3, although that should reduce peak sustainable throughput. Other ideas I've heard recently like replacing stops with slowdowns also sound harmful to peak sustainable throughput since they will necessarily slowdown writes before any limit has been breached.

There are several metrics:

Sustainable write throughput without stalling/stopping
Sustainable write throughput with stalling/stopping
Longest single stalling time while the DB is written in infinite write throughput

I believe @mdcallag tried to measure 2 and 3 and claimed that non-adaptive is the best for these two metrics. Your point is that 1 is not measured. It is indeed the question that when 1, 2 and 3 are contradicting, how we should do trade-offs, but it's still not clear to me that they are contradicting with current implementation. I suspect whether it is the case. Indeed, 1 is very hard to measure, and now we are in deadlock and won't be able to make progress.

(My question about whether fillseq is a good benchmark to measure this PR is totally orthogonal to this).

ajkr · 2022-06-14T00:50:12Z

but it's still not clear to me that they are contradicting with current implementation. I suspect whether it is the case. Indeed, 1 is very hard to measure, and now we are in deadlock and won't be able to make progress.

I don't know what idea we're talking about being blocked. For this PR, it is fine with me, I don't see a problem if it helps write-amp or some other metric. For other ideas mentioned like disabling intra-L0, or replacing stop with slowdowns, I suspect it'll make things worse for 1., so don't see those as progress right now.

mdcallag · 2022-06-14T16:20:17Z

There are several metrics:

Sustainable write throughput without stalling/stopping

Sustainable write throughput with stalling/stopping

Longest single stalling time while the DB is written in infinite write throughput

I believe @mdcallag tried to measure 2 and 3 and claimed that non-adaptive is the best for these two metrics. Your point is that 1 is not measured. It is indeed the question that when 1, 2 and 3 are contradicting, how we should do trade-offs, but it's still not clear to me that they are contradicting with current implementation. I suspect whether it is the case. Indeed, 1 is very hard to measure, and now we are in deadlock and won't be able to make progress.

I have no doubt that users encounter this. I assume that in most cases it isn't intentional. The goal is a DBMS that behaves better when overloaded. I encountered this with InnoDB and WiredTiger (usually via the insert benchmark). Worst-case write stalls with WiredTiger used to exceed 10 minutes, in recent versions that is reduced to less than 10 seconds. For both engines it took a while to fix as the problem is complicated.

I didn't encounter this with Postgres, but mostly because they worked on the problem for many years before I started to use it.

My point is that behaves well when overloaded is a feature and something worth having in RocksDB.

WRT to benchmarks that find the peak throughput for a DBMS while respecting an SLA -- that would be great to add to RocksDB and even on my TODO list, just not a high-pri given other things I work on. YCSB supports that, db_bench does not (today).

mdcallag · 2022-06-20T19:55:55Z

This wraps up my work on perf tests for this PR.

I repeated the overwrite benchmark using 1, 2, 4, 8, 16 and 32 client threads where writes were rate limited to 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 and 100 MB/second. The server has 40 CPUs and 80 HW threads (HT was enabled). I used three binaries labeled "pre", "post" and "nonadaptive" where 'pre" is upstream RocksDB, "post" is RocksDB with this PR and "nonadaptive" is RocksDB with intra-L0 and dynamic level resizing disabled.

Graphs are provided for:

ops_sec - writes/second
p99, p99.9, p99.99 - response time at that percentile
pmax - max response time
stallpct - stall percentage reported by compaction IO statistics

Summary (by QoS I mean variance):

Throughput is harder to characterize
- at 1 & 2 client threads nonadaptive and post have the best throughput (to be fair post is slightly better than nonadaptive)
- at 4+ client threads nonadaptive and post have the best throughput up to ~50MB/s write rate. After that throughput for post degrades and nonadaptive has the best throughput
QoS with the pre and post binaries is much worse with 4+ client threads than with 1 or 2
QoS at 4+ threads degrades more for the post binary than the pre binary
nonadaptive does better at p99 and pmax but worse at p99.9 and p99.99 vs pre and post. By better/worse here I mean absolute values. The results for nonadaptive (mostly) have less variance.
peak throughput improves with more client threads (perhaps this is the benefit from batching)
the stallpct graphs are the hardest to characterize
- at 1, 2 & 4 threads: nonadaptive has no stalls
- at 1 and 2 threads: pre has stalls at 40+ MB/s, post at 95+ MB/s
- at 4 threads: pre still has stalls at 40+ MB/s, post at 65+ MB/s but the slope for post is more vertical
- at 8, 16 & 32 threads: nonadaptive has stalls at 65+ MB/s, pre at 40+ MB/s, post at 60+ MB/s, the slope for post is more vertical

Graphs to follow in separate posts.

mdcallag · 2022-06-20T20:03:34Z

These have writes/second (ops_sec)

1 client thread

2 client threads

4 client threads

8 client threads

16 client threads

32 client threads

mdcallag · 2022-06-20T20:05:45Z

p99 response time in microseconds

1 client thread

2 client threads

4 client threads

8 client threads

16 client threads

32 client threads

mdcallag · 2022-06-20T20:07:05Z

p99.9 percentile response time in microseconds

1 client thread

2 client threads

4 client threads

8 client threads

16 client threads

32 client threads

mdcallag · 2022-06-20T20:08:14Z

p99.99 percentile response time

1 client thread

2 client threads

4 client threads

8 client threads

16 client threads

32 client theads

mdcallag · 2022-06-20T20:09:16Z

Max response time in microseconds

1 client thread

2 client threads

4 client threads

8 client threads

16 client threads

32 client threads

mdcallag · 2022-06-20T20:10:48Z

Stall percentage as reported by compaction IO statistics

1 client thread

2 client threads

4 client threads

8 client threads

16 client threads

32 client threads

siying · 2022-06-21T02:11:33Z

@mdcallag thanks for helping with benchmarking. It looks like that for the area this PR is targeting (max stalling), the PR is only slightly better. There might be something wrong with the previous assumption and let me investigate.

mdcallag · 2022-07-05T18:04:13Z

This has graphs for 4 binaries:

pre - same as above, unchanged upstream as of 8f59c41
post - 8f59c41 with this diff (PR 10057) applied
post2 - post and then apply 6115254
nonadapt - 8f59c41 and then disable intra-L0 and dynamic level target resizing

Throughput (operations/second = ops_sec)

1 client thread

16 client threads

32 client threads

mdcallag · 2022-07-05T18:05:05Z

Write amplification from overwrite with a wait for compaction to finish

1 client thread

16 client threads

32 client threads

mdcallag · 2022-07-05T18:05:42Z

Write stall percentage

1 client thread

16 client threads

32 client threads

mdcallag · 2022-07-05T18:06:21Z

p99 response time

1 client thread

16 client threads

32 client threads

mdcallag · 2022-07-05T18:06:54Z

p99.9 response time

1 client thread

16 client threads

32 client threads

mdcallag · 2022-07-05T18:07:40Z

p99.99 response time

1 client thread

16 client threads

32 client threads

mdcallag · 2022-07-05T18:08:16Z

max response time

1 client thread

16 client threads

32 client threads

siying · 2022-07-05T18:27:02Z

max response time

1 client thread

16 client threads

32 client threads

Thanks for all the help rerunning the benchmarks
The result is quite different from a previous run. In the previous run, "post" still showed some second-level latency and in this result, it is gone. Any idea why?

mdcallag · 2022-07-06T00:19:28Z

Revisiting what I posted ~2 weeks ago and some of those graphs are bogus. I am not sure why. The graphs from today look good when I compare them with the data in text files.

mdcallag · 2022-07-06T17:53:41Z

Graphs for throughput vs time at 1-second intervals using 32 client threads.

First for 50MB/s write rates were stalls aren't a problem

mdcallag · 2022-07-06T17:54:45Z

And then for a 60 MB/s write rate where stalls are an issue

siying · 2022-07-06T20:38:39Z

And then for a 60 MB/s write rate where stalls are an issue

Just to clarify, is it also using delayed_write_rate = 8MB? If it is the case, then it's not surprising that throughput quickly drops to about 1/8 of the sustained rate and sometimes dip further. 8MB is about 1/7.5 of the 60MB fix rate, so any time a slowing down condition triggers, it is down to that level, and might got further down. If we zoom into the 8MB/s base range, the graph looks that "post2" is doing significantly better than "post", as "post2" rarely go more than one order of magnitude lower than 8MB/s, while "post" often goes much lower.

mdcallag · 2022-07-07T16:21:18Z

The benchmark scripts don't set delayed_write_rate so the default, 8MB, is used. Confirmed by looking at LOG.

siying · 2022-07-07T17:20:09Z

The benchmark scripts don't set delayed_write_rate so the default, 8MB, is used. Confirmed by looking at LOG.
Indeed, db_bench has a default of 8MB: https://github.com/facebook/rocksdb/blob/main/tools/db_bench_tool.cc#L1343-L1345 which is different from RocksDB's default, which is 16MB, or the byte of rate limiting value given by options.rate_limiter if it is specified. It's not necessarily right or wrong, but could explain the 8-fold drop when stalling happens.

Summary: facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stucks in L1 without being moved down. We fix calculating a score of L0 in the same way of L1 so that L0->L1 is favored if L0 size is larger than L1. Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.

Summary: #10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large.. Pull Request resolved: #10518 Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away. Reviewed By: ajkr Differential Revision: D38603145 fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225

dongdongwcpp · 2023-01-29T07:20:35Z

the wiki should be updated cause target level size no more adjusted? @siying

…11525) Summary: after #11321 and #11340 (both included in RocksDB v8.2), migration from `level_compaction_dynamic_level_bytes=false` to `level_compaction_dynamic_level_bytes=true` is automatic by RocksDB and requires no manual compaction from user. Making the option true by default as it has several advantages: 1. better space amplification guarantee (a more stable LSM shape). 2. compaction is more adaptive to write traffic. 3. automatic draining of unneeded levels. Wiki is updated with more detail: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size. The PR mostly contains fixes for unit tests as they assumed `level_compaction_dynamic_level_bytes=false`. Most notable change is commit f742be3 and b1928e4 which override the default option in DBTestBase to still set `level_compaction_dynamic_level_bytes=false` by default. This helps to reduce the change needed for unit tests. I think this default option override in unit tests is okay since the behavior of `level_compaction_dynamic_level_bytes=true` is tested by explicitly setting this option. Also, `level_compaction_dynamic_level_bytes=false` may be more desired in unit tests as it makes it easier to create a desired LSM shape. Comment for option `level_compaction_dynamic_level_bytes` is updated to reflect this change and change made in #10057. Pull Request resolved: #11525 Test Plan: `make -j32 J=32 check` several times to try to catch flaky tests due to this option change. Reviewed By: ajkr Differential Revision: D46654256 Pulled By: cbi42 fbshipit-source-id: 6b5827dae124f6f1fdc8cca2ac6f6fcd878830e1

…book#10057) Summary: The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most. Basic idea: (1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is: (2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size. Pull Request resolved: facebook#10057 Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario. Reviewed By: ajkr Differential Revision: D37539742 fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4 Signed-off-by: tabokie <xy.tao@outlook.com>

Summary: facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large.. Pull Request resolved: facebook#10518 Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away. Reviewed By: ajkr Differential Revision: D38603145 fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225 Signed-off-by: tabokie <xy.tao@outlook.com>

…book#10057) Summary: The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most. Basic idea: (1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is: (2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size. Pull Request resolved: facebook#10057 Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario. Reviewed By: ajkr Differential Revision: D37539742 fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4 Signed-off-by: tabokie <xy.tao@outlook.com>

Summary: facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large.. Pull Request resolved: facebook#10518 Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away. Reviewed By: ajkr Differential Revision: D38603145 fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225 Signed-off-by: tabokie <xy.tao@outlook.com>

siying requested a review from ajkr May 25, 2022 21:42

facebook-github-bot added the CLA Signed label May 25, 2022

siying force-pushed the better_adaptive_l1 branch from 9fef2eb to 569aab0 Compare June 21, 2022 04:27

facebook-github-bot closed this in b397dcd Jun 30, 2022

mdcallag mentioned this pull request Jul 18, 2022

Compaction blocked. #10384

Open

siying mentioned this pull request Aug 10, 2022

Fix regression issue of too large score #10518

Closed

cbi42 mentioned this pull request Jun 9, 2023

Make option level_compaction_dynamic_level_bytes true by default #11525

Closed

tabokie mentioned this pull request Aug 14, 2023

cherry-pick compaction enhancements tikv/rocksdb#345

Closed

Change The Way Level Target And Compaction Score Are Calculated #10057

Change The Way Level Target And Compaction Score Are Calculated #10057

Conversation

siying commented May 25, 2022 • edited Loading

siying commented May 25, 2022

ajkr commented May 26, 2022

siying commented May 26, 2022 • edited Loading

ajkr commented Jun 7, 2022 • edited Loading

mdcallag commented Jun 7, 2022

siying commented Jun 7, 2022

mdcallag commented Jun 8, 2022

mdcallag commented Jun 8, 2022

ajkr commented Jun 8, 2022 • edited Loading

siying commented Jun 8, 2022

mdcallag commented Jun 13, 2022 • edited Loading

ajkr commented Jun 13, 2022 • edited Loading

siying commented Jun 13, 2022

mdcallag commented Jun 13, 2022

siying commented Jun 13, 2022 • edited Loading

ajkr commented Jun 13, 2022

siying commented Jun 13, 2022

ajkr commented Jun 14, 2022 • edited Loading

mdcallag commented Jun 14, 2022

mdcallag commented Jun 20, 2022 • edited Loading

mdcallag commented Jun 20, 2022

mdcallag commented Jun 20, 2022

mdcallag commented Jun 20, 2022

mdcallag commented Jun 20, 2022

mdcallag commented Jun 20, 2022

mdcallag commented Jun 20, 2022

siying commented Jun 21, 2022

mdcallag commented Jul 5, 2022

mdcallag commented Jul 5, 2022

mdcallag commented Jul 5, 2022

mdcallag commented Jul 5, 2022

mdcallag commented Jul 5, 2022

mdcallag commented Jul 5, 2022

mdcallag commented Jul 5, 2022

siying commented Jul 5, 2022

mdcallag commented Jul 6, 2022

mdcallag commented Jul 6, 2022

mdcallag commented Jul 6, 2022

siying commented Jul 6, 2022

mdcallag commented Jul 7, 2022

siying commented Jul 7, 2022

dongdongwcpp commented Jan 29, 2023

siying commented May 25, 2022 •

edited

Loading

siying commented May 26, 2022 •

edited

Loading

ajkr commented Jun 7, 2022 •

edited

Loading

ajkr commented Jun 8, 2022 •

edited

Loading

mdcallag commented Jun 13, 2022 •

edited

Loading

ajkr commented Jun 13, 2022 •

edited

Loading

siying commented Jun 13, 2022 •

edited

Loading

ajkr commented Jun 14, 2022 •

edited

Loading

mdcallag commented Jun 20, 2022 •

edited

Loading