Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change The Way Level Target And Compaction Score Are Calculated #10057

Closed
wants to merge 5 commits into from

Conversation

siying
Copy link
Contributor

@siying siying commented May 25, 2022

Summary:
The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most.
Basic idea:
(1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is:
(2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size.

Test Plan:
Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario.

@siying
Copy link
Contributor Author

siying commented May 25, 2022

#9423 is one symptom for the problem.

@ajkr
Copy link
Contributor

ajkr commented May 26, 2022

My understanding is this bundles two changes, which is fine assuming they're both desirable, but would be helpful if it were explicitly stated (or corrected):

(1) Switching target level size adjustment to score adjustment in order to stabilize the calculation of pending compaction bytes.
(2) Changing the prioritization of level to compact.

In terms of desirability, (1) seems clearly desirable. It does make us stall earlier compared to before in certain scenarios, but we have been reasonably successful in having customers increase/disable the stalling limits as needed, and could probably increase defaults too, so this is fine with me.

The new heuristic (2) is more difficult. I need to study it more closely tomorrow but it certainly appears to have the advantage that it can be "always-on", unlike the level multiplier smoothing we had before.

@siying
Copy link
Contributor Author

siying commented May 26, 2022

(1) Switching target level size adjustment to score adjustment in order to stabilize the calculation of pending compaction bytes.
To clarify, it's not just about calculating of pending compaction bytes. Once we have adjusted level targets, some levels won't qualify for compactions any more. For example, consider following size per level:

L0: 5GB
L1: 200 MB (unadjusted target 100MB)
L2: 2 GB (unadjusted target 1GB)
L3: 15 GB (unadjusted target 10GB)
L4: 100 GB (unadjusted target 100GB)

All levels would qualify for compaction before adjusting level size, so some L2->L3 and L3->L4 compactions would happen while L0->L1 is happening. However, with adjusted level sizing, the target would look like this:

L0: 5GB
L1: 200 MB (adjusted target 5 GB)
L2: 2 GB (adjusted target 13.6 GB)
L3: 15 GB (adjusted target 36.8 GB)
L4: 100 GB (adjusted target 100GB)

and only L0->L1 compaction will be going on and all other levels' compactions will be on hold. With this change, L3->L4 will also happen if there are free compaction slots.

@ajkr
Copy link
Contributor

ajkr commented Jun 7, 2022

This seems like a good thing to try. I would guess it helps space-amp during write burst and doesn't hurt write-amp (?). Some experimental data would be helpful.

Some notes:

  • Agreed this can increase parallelism
    • Reverting the adaptive level sizing alone might increase parallelism similarly
  • This may go well with Use Env::IO_MID for the L0->L0 && L0->L1 compaction #9999 since the parallel compactions it introduces in lower levels seem less urgent than ongoing intra-L0 and L0->Lbase compactions
  • I wonder if the estimated upper bytes coming down in the denominator encourages hourglass LSM shapes. It seems levels like Lbase, Lbase+1 will be deprioritized the most.

@mdcallag
Copy link
Contributor

mdcallag commented Jun 7, 2022

@siying how will this impact when stalls occur? Does it mean the stall conditions won't be adjusted?

My other question is that if we have feature targeted towards handling write bursts, then is it worth the additional complexity of trying to distinguish between bursts of writes vs steady state of high write rates? Because #9423 is caused by a steady state of high write rates.

@siying
Copy link
Contributor Author

siying commented Jun 7, 2022

@siying how will this impact when stalls occur? Does it mean the stall conditions won't be adjusted?
Stall condition would be the same as without adjustable level targets.

My other question is that if we have feature targeted towards handling write bursts, then is it worth the additional complexity of trying to distinguish between bursts of writes vs steady state of high write rates? Because #9423 is caused by a steady state of high write rates.
I hope this proposal can handle bursts of writes a little bit. If there are a burst of writes coming from L0, upper level compactions are taking lower priority compared to lower levels, relative to previously.

@mdcallag
Copy link
Contributor

mdcallag commented Jun 8, 2022

One more suggestion from Manos that I think is interesting...

Have we considered remove write stalls, keeping write slowdowns, but making the slowdown time a function of the badness of the write overload. The goal is to dynamically adjust the write slowdown to figure out what it needs to be to make ingest match outgest (outgest == how fast RocksDB can reduce compaction debt).

@mdcallag
Copy link
Contributor

mdcallag commented Jun 8, 2022

With a b-tree the behavior is close to "pay as you go" for writes. When the buffer pool is full of dirty pages a new RMW must do some writeback if before it reads the to-be-modified block into the buffer pool because it must evict a dirty page before doing the read. This limits the worst case write stall, ignoring other perf problems with checkpoint.

But an LSM decouples the write (debt creation) from compaction (debt repayment). Write slowdowns are a way to couple them but from memory the current write stall uses a fixed wait (maybe 1 millisecond). We can estimate the cost of debt repayment as: X = compaction-seconds / ingest-bytes and then make the slowdown ~= X * bytes-to-by-written. The debt repayment estimates assumes that compaction is fully sequential which is a worst-case assumption as some of the repayment is concurrent.

From recent benchmarks I have done the value for X approximately 0.1 microseconds per byte of ingest. One example is:

  • ingest = 318.4 GB
  • compaction wall clock seconds = 31901

This was measured via db_bench --benchmarks=overwrite,waitforcompaction

I know there is a limit on how short a wait we can implement if a thread is to sleep, although I don't know what that is. Short waits could be implemented by spinning on a CPU but that has bad side effects.

@ajkr
Copy link
Contributor

ajkr commented Jun 8, 2022

One more suggestion from Manos that I think is interesting...

Have we considered remove write stalls, keeping write slowdowns, but making the slowdown time a function of the badness of the write overload. The goal is to dynamically adjust the write slowdown to figure out what it needs to be to make ingest match outgest (outgest == how fast RocksDB can reduce compaction debt).

Does write slowdown before reaching a hard limit ever help write latencies in an open system? See Section 3.2 of https://arxiv.org/abs/1906.09667 for explanation on the limitations of using closed-loop benchmarks. I can see scenarios where write slowdowns hurt write latencies in an open system (cases where the workload could be handled without breaching the limits, but gets slowed down - thus building up a backlog - because the workload brought the DB near its limits) but have yet to see a scenario where it helps.

@siying
Copy link
Contributor Author

siying commented Jun 8, 2022

One more suggestion from Manos that I think is interesting...

Have we considered remove write stalls, keeping write slowdowns, but making the slowdown time a function of the badness of the write overload. The goal is to dynamically adjust the write slowdown to figure out what it needs to be to make ingest match outgest (outgest == how fast RocksDB can reduce compaction debt).

"making the slowdown time a function of the badness of the write overload" is already partially down. The more L0 files there are, the lower the write rate we set to. It is not expanded to estimated compaction debt and might not work well enough with L0->L0 compaction.

@mdcallag
Copy link
Contributor

mdcallag commented Jun 13, 2022

I tested 3 binaries using a96a4a2 as the base. Tests were repeated for an IO-bound (database larger than RAM) and cached (database cached by RocksDB) workloads. The test is benchmark.sh run the way I run it.

The binaries are:

  • pre - a96a4a2 as-is
  • post - a96a4a2 with Siying's RFC
  • nonadaptive - a96a4a2 with intra-L0 and dynamic target resizing disabled

First I will show throughput over time during overwrite which runs at the end of the benchmark. The nonadaptive binary has not much variance. The pre and post binaries have a lot.

This is for cached.

Inserts_second, IO-bound

This is for IO-bound

Inserts_second, IO-bound

From the benchmark summary for cached:

  • at the top are the results for fillseq where nonadaptive had the best insert rate, although not by much (980k, 960k, 1003k) for (pre, post, nonadaptive).
  • at the bottom are the results for overwrite where ...
    • nonadaptive has the best insert rate, then pre, then post (317k, 243k, 107k per second)
    • worst-case write stalls were similar
    • stall percentages were 49.6%, 76.2%, 33.5% for pre, post and nonadaptive

From the benchmark summary for IO-bound:

  • at the top are the results for fillseq where nonadaptive had the worst insert rate (343k, 329k, 213k) for (pre, post, nonadaptive). The nonadaptive binary only did trivial moves while pre/post did some regular compaction. Worst-case write stalls were much worse for pre/post.
  • at the bottom are the results for overwrite where ...
    • nonadaptive has the best insert rate, then pre, then post (129k, 127k, 96k per second)
    • worst-case write stalls were ~351, ~121, 1.6 seconds for pre, post, nonadaptive
    • stall percentages were 58.5%, 71.8%, 42.9% for pre, post and nonadaptive

Write stall counters are here.

  • for cached, nonadaptive gets more level0_slowdown
  • for IO-bound, nonadaptive gets more pending_compaction_bytes slowdown, while pre & post get more pending_compaction_bytes stops and more level0_slowdown

@ajkr
Copy link
Contributor

ajkr commented Jun 13, 2022

First I will show throughput over time during overwrite which runs at the end of the benchmark. The nonadaptive binary has not much variance. The pre and post binaries have a lot.

Each curve in these graphs is using a different workload. That's the problem with closed-loop benchmarks that I alluded to earlier: "See Section 3.2 of https://arxiv.org/abs/1906.09667 for explanation on the limitations of using closed-loop benchmarks". The graphs give no indication of whether the "pre" or "post" binaries could handle the workload that was sent to the "nonadaptive" binary with acceptable write latencies.

@siying
Copy link
Contributor Author

siying commented Jun 13, 2022

  • fillseq

To clarify, nonadaptive not only removes adaptive level sizing but also L0->L0 compaction, right?

@mdcallag
Copy link
Contributor

@ajkr All binaries get the same workload. The workload is send writes to RocksDB faster than compaction can handle. The goal is to see how well or how poorly RocksDB handles it.

@siying
Copy link
Contributor Author

siying commented Jun 13, 2022

First I will show throughput over time during overwrite which runs at the end of the benchmark. The nonadaptive binary has not much variance. The pre and post binaries have a lot.

Each curve in these graphs is using a different workload. That's the problem with closed-loop benchmarks that I alluded to earlier: "See Section 3.2 of https://arxiv.org/abs/1906.09667 for explanation on the limitations of using closed-loop benchmarks". The graphs give no indication of whether the "pre" or "post" binaries could handle the workload that was sent to the "nonadaptive" binary with acceptable write latencies.

Reading Section 3.2 of the paper you referred to, I think I got your point that the benchmark that writes as soon as it can isn't a good indication of what sustainable write throughput without stalling. I think @mdcallag probably doesn't say his benchmark is measuring the write throughput without stalling. The question is, do you think it is valuable to measure the stalling when users write as far as they can. The fact that most users probably won't write DB in this style doesn't necessarily mean it isn't a valid use case to measure.

@ajkr
Copy link
Contributor

ajkr commented Jun 13, 2022

The question is, do you think it is valuable to measure the stalling when users write as far as they can. The fact that most users probably won't write DB in this style doesn't necessarily mean it isn't a valid use case to measure.

Yes, I just want to be clear about the limitations and relevance to production so we don't overfit the system to this kind of benchmark. One example is we force slowdown when N-1 memtables are full and memtable limit is >= 3, although that should reduce peak sustainable throughput. Other ideas I've heard recently like replacing stops with slowdowns also sound harmful to peak sustainable throughput since they will necessarily slowdown writes before any limit has been breached.

@ajkr All binaries get the same workload. The workload is send writes to RocksDB faster than compaction can handle.

For me, same workload means same requests are sent at the same time. That can't be the case here because the inserts/second graph shows "pre" and "post" sometimes get higher QPS than "nonadaptive", and at other times get lower QPS. I believe that's because the workload is dictated by the binary (i.e., a RocksDB slowdown slows down the workload). So different binaries will produce different workloads.

@siying
Copy link
Contributor Author

siying commented Jun 13, 2022

The question is, do you think it is valuable to measure the stalling when users write as far as they can. The fact that most users probably won't write DB in this style doesn't necessarily mean it isn't a valid use case to measure.

Yes, I just want to be clear about the limitations and relevance to production so we don't overfit the system to this kind of benchmark. One example is we force slowdown when N-1 memtables are full and memtable limit is >= 3, although that should reduce peak sustainable throughput. Other ideas I've heard recently like replacing stops with slowdowns also sound harmful to peak sustainable throughput since they will necessarily slowdown writes before any limit has been breached.

There are several metrics:

  1. Sustainable write throughput without stalling/stopping
  2. Sustainable write throughput with stalling/stopping
  3. Longest single stalling time while the DB is written in infinite write throughput

I believe @mdcallag tried to measure 2 and 3 and claimed that non-adaptive is the best for these two metrics. Your point is that 1 is not measured. It is indeed the question that when 1, 2 and 3 are contradicting, how we should do trade-offs, but it's still not clear to me that they are contradicting with current implementation. I suspect whether it is the case. Indeed, 1 is very hard to measure, and now we are in deadlock and won't be able to make progress.

(My question about whether fillseq is a good benchmark to measure this PR is totally orthogonal to this).

@ajkr
Copy link
Contributor

ajkr commented Jun 14, 2022

but it's still not clear to me that they are contradicting with current implementation. I suspect whether it is the case. Indeed, 1 is very hard to measure, and now we are in deadlock and won't be able to make progress.

I don't know what idea we're talking about being blocked. For this PR, it is fine with me, I don't see a problem if it helps write-amp or some other metric. For other ideas mentioned like disabling intra-L0, or replacing stop with slowdowns, I suspect it'll make things worse for 1., so don't see those as progress right now.

@mdcallag
Copy link
Contributor

There are several metrics:

  1. Sustainable write throughput without stalling/stopping
  2. Sustainable write throughput with stalling/stopping
  3. Longest single stalling time while the DB is written in infinite write throughput

I believe @mdcallag tried to measure 2 and 3 and claimed that non-adaptive is the best for these two metrics. Your point is that 1 is not measured. It is indeed the question that when 1, 2 and 3 are contradicting, how we should do trade-offs, but it's still not clear to me that they are contradicting with current implementation. I suspect whether it is the case. Indeed, 1 is very hard to measure, and now we are in deadlock and won't be able to make progress.

I have no doubt that users encounter this. I assume that in most cases it isn't intentional. The goal is a DBMS that behaves better when overloaded. I encountered this with InnoDB and WiredTiger (usually via the insert benchmark). Worst-case write stalls with WiredTiger used to exceed 10 minutes, in recent versions that is reduced to less than 10 seconds. For both engines it took a while to fix as the problem is complicated.

I didn't encounter this with Postgres, but mostly because they worked on the problem for many years before I started to use it.

My point is that behaves well when overloaded is a feature and something worth having in RocksDB.

WRT to benchmarks that find the peak throughput for a DBMS while respecting an SLA -- that would be great to add to RocksDB and even on my TODO list, just not a high-pri given other things I work on. YCSB supports that, db_bench does not (today).

@mdcallag
Copy link
Contributor

mdcallag commented Jun 20, 2022

This wraps up my work on perf tests for this PR.

I repeated the overwrite benchmark using 1, 2, 4, 8, 16 and 32 client threads where writes were rate limited to 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 and 100 MB/second. The server has 40 CPUs and 80 HW threads (HT was enabled). I used three binaries labeled "pre", "post" and "nonadaptive" where 'pre" is upstream RocksDB, "post" is RocksDB with this PR and "nonadaptive" is RocksDB with intra-L0 and dynamic level resizing disabled.

Graphs are provided for:

  • ops_sec - writes/second
  • p99, p99.9, p99.99 - response time at that percentile
  • pmax - max response time
  • stallpct - stall percentage reported by compaction IO statistics

Summary (by QoS I mean variance):

  • Throughput is harder to characterize
    • at 1 & 2 client threads nonadaptive and post have the best throughput (to be fair post is slightly better than nonadaptive)
    • at 4+ client threads nonadaptive and post have the best throughput up to ~50MB/s write rate. After that throughput for post degrades and nonadaptive has the best throughput
  • QoS with the pre and post binaries is much worse with 4+ client threads than with 1 or 2
  • QoS at 4+ threads degrades more for the post binary than the pre binary
  • nonadaptive does better at p99 and pmax but worse at p99.9 and p99.99 vs pre and post. By better/worse here I mean absolute values. The results for nonadaptive (mostly) have less variance.
  • peak throughput improves with more client threads (perhaps this is the benefit from batching)
  • the stallpct graphs are the hardest to characterize
    • at 1, 2 & 4 threads: nonadaptive has no stalls
    • at 1 and 2 threads: pre has stalls at 40+ MB/s, post at 95+ MB/s
    • at 4 threads: pre still has stalls at 40+ MB/s, post at 65+ MB/s but the slope for post is more vertical
    • at 8, 16 & 32 threads: nonadaptive has stalls at 65+ MB/s, pre at 40+ MB/s, post at 60+ MB/s, the slope for post is more vertical

Graphs to follow in separate posts.

@mdcallag
Copy link
Contributor

These have writes/second (ops_sec)

1 client thread
ops_sec nt1

2 client threads
ops_sec nt2

4 client threads
ops_sec nt4

8 client threads
ops_sec nt8

16 client threads
ops_sec nt16

32 client threads
ops_sec nt32

@mdcallag
Copy link
Contributor

p99 response time in microseconds

1 client thread
p99 nt1

2 client threads
p99 nt2

4 client threads
p99 nt4

8 client threads
p99 nt8

16 client threads
p99 nt16

32 client threads
p99 nt32

@mdcallag
Copy link
Contributor

p99.9 percentile response time in microseconds

1 client thread
p99 9 nt1

2 client threads
p99 9 nt2

4 client threads
p99 9 nt4

8 client threads
p99 9 nt8

16 client threads
p99 9 nt16

32 client threads
p99 9 nt32

@mdcallag
Copy link
Contributor

p99.99 percentile response time

1 client thread
p99 99 nt1

2 client threads
p99 99 nt2

4 client threads
p99 99 nt4

8 client threads
p99 99 nt8

16 client threads
p99 99 nt16

32 client theads
p99 99 nt32

@mdcallag
Copy link
Contributor

Max response time in microseconds

1 client thread
pmax nt1

2 client threads
pmax nt2

4 client threads
pmax nt4

8 client threads
pmax nt8

16 client threads
pmax nt16

32 client threads
pmax nt32

@mdcallag
Copy link
Contributor

Stall percentage as reported by compaction IO statistics

1 client thread
stallpct nt1

2 client threads
stallpct nt2

4 client threads
stallpct nt4

8 client threads
stallpct nt8

16 client threads
stallpct nt16

32 client threads
stallpct nt32

@siying
Copy link
Contributor Author

siying commented Jun 21, 2022

@mdcallag thanks for helping with benchmarking. It looks like that for the area this PR is targeting (max stalling), the PR is only slightly better. There might be something wrong with the previous assumption and let me investigate.

@mdcallag
Copy link
Contributor

mdcallag commented Jul 5, 2022

This has graphs for 4 binaries:

  • pre - same as above, unchanged upstream as of 8f59c41
  • post - 8f59c41 with this diff (PR 10057) applied
  • post2 - post and then apply 6115254
  • nonadapt - 8f59c41 and then disable intra-L0 and dynamic level target resizing

Throughput (operations/second = ops_sec)

1 client thread
ops_sec nt1

16 client threads
ops_sec nt16

32 client threads
ops_sec nt32

@mdcallag
Copy link
Contributor

mdcallag commented Jul 5, 2022

Write amplification from overwrite with a wait for compaction to finish

1 client thread
w_amp nt1

16 client threads
w_amp nt16

32 client threads
w_amp nt32

@mdcallag
Copy link
Contributor

mdcallag commented Jul 5, 2022

Write stall percentage

1 client thread
stallpct nt1

16 client threads
stallpct nt16

32 client threads
stallpct nt32

@mdcallag
Copy link
Contributor

mdcallag commented Jul 5, 2022

p99 response time

1 client thread
p99 nt1

16 client threads
p99 nt16

32 client threads
p99 nt32

@mdcallag
Copy link
Contributor

mdcallag commented Jul 5, 2022

p99.9 response time

1 client thread
p99 9 nt1

16 client threads
p99 9 nt16

32 client threads
p99 9 nt32

@mdcallag
Copy link
Contributor

mdcallag commented Jul 5, 2022

p99.99 response time

1 client thread
p99 99 nt1

16 client threads
p99 99 nt16

32 client threads
p99 99 nt32

@mdcallag
Copy link
Contributor

mdcallag commented Jul 5, 2022

max response time

1 client thread
pmax nt1

16 client threads
pmax nt16

32 client threads
pmax nt32

@siying
Copy link
Contributor Author

siying commented Jul 5, 2022

max response time

1 client thread pmax nt1

16 client threads pmax nt16

32 client threads pmax nt32

Thanks for all the help rerunning the benchmarks
The result is quite different from a previous run. In the previous run, "post" still showed some second-level latency and in this result, it is gone. Any idea why?

@mdcallag
Copy link
Contributor

mdcallag commented Jul 6, 2022

Revisiting what I posted ~2 weeks ago and some of those graphs are bogus. I am not sure why. The graphs from today look good when I compare them with the data in text files.

@mdcallag
Copy link
Contributor

mdcallag commented Jul 6, 2022

Graphs for throughput vs time at 1-second intervals using 32 client threads.

First for 50MB/s write rates were stalls aren't a problem
g nt32 wmbps50 pre

g nt32 wmbps50 post

g nt32 wmbps50 post2

g nt32 wmbps50 nonadapt

@mdcallag
Copy link
Contributor

mdcallag commented Jul 6, 2022

And then for a 60 MB/s write rate where stalls are an issue
g nt32 wmbps60 pre

g nt32 wmbps60 post

g nt32 wmbps60 post2

g nt32 wmbps60 nonadapt

@siying
Copy link
Contributor Author

siying commented Jul 6, 2022

And then for a 60 MB/s write rate where stalls are an issue g nt32 wmbps60 pre

g nt32 wmbps60 post

g nt32 wmbps60 post2

g nt32 wmbps60 nonadapt

Just to clarify, is it also using delayed_write_rate = 8MB? If it is the case, then it's not surprising that throughput quickly drops to about 1/8 of the sustained rate and sometimes dip further. 8MB is about 1/7.5 of the 60MB fix rate, so any time a slowing down condition triggers, it is down to that level, and might got further down. If we zoom into the 8MB/s base range, the graph looks that "post2" is doing significantly better than "post", as "post2" rarely go more than one order of magnitude lower than 8MB/s, while "post" often goes much lower.

@mdcallag
Copy link
Contributor

mdcallag commented Jul 7, 2022

The benchmark scripts don't set delayed_write_rate so the default, 8MB, is used. Confirmed by looking at LOG.

@siying
Copy link
Contributor Author

siying commented Jul 7, 2022

The benchmark scripts don't set delayed_write_rate so the default, 8MB, is used. Confirmed by looking at LOG.
Indeed, db_bench has a default of 8MB: https://github.com/facebook/rocksdb/blob/main/tools/db_bench_tool.cc#L1343-L1345 which is different from RocksDB's default, which is 16MB, or the byte of rate limiting value given by options.rate_limiter if it is specified. It's not necessarily right or wrong, but could explain the 8-fold drop when stalling happens.

siying added a commit to siying/rocksdb that referenced this pull request Aug 10, 2022
Summary:
facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stucks in L1 without being moved down. We fix calculating a score of L0 in the same way of L1 so that L0->L1 is favored if L0 size is larger than L1.

Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.
facebook-github-bot pushed a commit that referenced this pull request Aug 12, 2022
Summary:
#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large..

Pull Request resolved: #10518

Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.

Reviewed By: ajkr

Differential Revision: D38603145

fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225
siying added a commit that referenced this pull request Aug 12, 2022
Summary:
#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large..

Pull Request resolved: #10518

Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.

Reviewed By: ajkr

Differential Revision: D38603145

fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225
@dongdongwcpp
Copy link

the wiki should be updated cause target level size no more adjusted? @siying

facebook-github-bot pushed a commit that referenced this pull request Jun 16, 2023
…11525)

Summary:
after #11321 and #11340 (both included in RocksDB v8.2), migration from `level_compaction_dynamic_level_bytes=false` to `level_compaction_dynamic_level_bytes=true` is automatic by RocksDB and requires no manual compaction from user. Making the option true by default as it has several advantages: 1. better space amplification guarantee (a more stable LSM shape). 2. compaction is more adaptive to write traffic. 3. automatic draining of unneeded levels. Wiki is updated with more detail: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size.

The PR mostly contains fixes for unit tests as they assumed `level_compaction_dynamic_level_bytes=false`. Most notable change is commit f742be3 and b1928e4 which override the default option in DBTestBase to still set `level_compaction_dynamic_level_bytes=false` by default. This helps to reduce the change needed for unit tests. I think this default option override in unit tests is okay since the behavior of `level_compaction_dynamic_level_bytes=true` is tested by explicitly setting this option. Also, `level_compaction_dynamic_level_bytes=false` may be more desired in unit tests as it makes it easier to create a desired LSM shape.

Comment for option `level_compaction_dynamic_level_bytes` is updated to reflect this change and change made in #10057.

Pull Request resolved: #11525

Test Plan: `make -j32 J=32 check` several times to try to catch flaky tests due to this option change.

Reviewed By: ajkr

Differential Revision: D46654256

Pulled By: cbi42

fbshipit-source-id: 6b5827dae124f6f1fdc8cca2ac6f6fcd878830e1
tabokie pushed a commit to tabokie/rocksdb that referenced this pull request Aug 14, 2023
…book#10057)

Summary:
The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most.
Basic idea:
(1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is:
(2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size.

Pull Request resolved: facebook#10057

Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario.

Reviewed By: ajkr

Differential Revision: D37539742

fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4
Signed-off-by: tabokie <xy.tao@outlook.com>
tabokie pushed a commit to tabokie/rocksdb that referenced this pull request Aug 14, 2023
Summary:
facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large..

Pull Request resolved: facebook#10518

Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.

Reviewed By: ajkr

Differential Revision: D38603145

fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225
Signed-off-by: tabokie <xy.tao@outlook.com>
tabokie pushed a commit to tabokie/rocksdb that referenced this pull request Aug 15, 2023
…book#10057)

Summary:
The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most.
Basic idea:
(1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is:
(2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size.

Pull Request resolved: facebook#10057

Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario.

Reviewed By: ajkr

Differential Revision: D37539742

fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4
Signed-off-by: tabokie <xy.tao@outlook.com>
tabokie pushed a commit to tabokie/rocksdb that referenced this pull request Aug 15, 2023
Summary:
facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large..

Pull Request resolved: facebook#10518

Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.

Reviewed By: ajkr

Differential Revision: D38603145

fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225
Signed-off-by: tabokie <xy.tao@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants