perf: L0->Lbase compactions not keeping up with flushing #203

petermattis · 2019-08-02T14:36:36Z

On a c5d.4xlarge instance, the pebble sync workload shows good write performance, but a problematic behavior. L0->Lbase compactions are not keeping up with flushing, leading to an ever growing number of files in L0.

~ ./pebble sync -c 100 -d 1m -w /mnt/data1/bench --batch 100 -v
...
level__files____size___score______in__ingest____move____read___write___w-amp
  WAL      4    92 M       -   5.8 G       -       -       -   5.8 G     1.0
    0    111   4.6 G   55.50   5.7 G     0 B     0 B     0 B   6.2 G     1.1
    1      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    2      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    3      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    4      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    5    374   1.4 G   23.20   1.5 G     0 B     0 B   2.8 G   2.8 G     1.9
    6     34   144 M    1.00   144 M     0 B     0 B   214 M   214 M     1.5
total    519   6.2 G    0.00   5.8 G     0 B     0 B   3.0 G    15 G     2.6

(I tweaked the Pebble options to set the L0 stop writes threshold to 1000).

The behavior that is happening is that Pebble sees the large number of L0 sstables and decides to compact them into Lbase (L5 in this case). The workload is generating uniformly random keys, so the L0 sstables overlap all of Lbase. That means that an L0->Lbase compaction will have 111+374==485 input sstables, totaling 6 GB. That compaction necessarily takes a long time, and while it is proceeding further L0 tables build up. When the L0->Lbase compaction finishes, there are enough L0 tables to require another L0->Lbase compaction. The real starvation here is the L5->L6 compactions.

RocksDB somehow avoids this egregiously bad behavior, though I'm not quite sure how yet. It seems to be a combination of L0->L0 compactions, and concurrent compactions. If I disable L0->L0 compactions, RocksDB sees the same behavior as Pebble. If I disable concurrent compactions, RocksDB sees the same behavior as Pebble. I'm somewhat suspicious it is also related to the lower write throughput of RocksDB I see on this workload. An interesting side-effect of L0->L0 compactions is that they lower the number of files in L0 which lowers the L0 compaction score. Perhaps that is allowing Lbase->Lbase+1 compactions to be scheduled.

A limitation both Pebble and RocksDB currently suffer, is that an L0->Lbase compaction locks out a concurrent Lbase->Lbase+1 compaction. This is mentioned in https://github.com/petermattis/pebble/issues/136.

The text was updated successfully, but these errors were encountered:

petermattis · 2019-08-02T14:44:27Z

Ah, I think I understand. RocksDB is not escaping from this trap. The L0->L0 compactions were just obscuring the issue. (Concurrent compactions are needed for L0->L0 compactions to take place).

Here is what RocksDB looks like after a longer run:

~ ./pebble sync -c 100 -d 5m -w /mnt/data1/bench --batch 100 -v --rocksdb
...
level__files____size___score______in__ingest____move____read___write___w-amp
  WAL      0     0 B       -     0 B       -       -       -    17 G     0.0
    0     16   5.6 G    7.40    17 G     0 B     0 B    26 G    43 G     2.5
    1      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    2      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    3      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    4      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    5   2539    10 G    0.00    11 G     0 B     0 B    32 G    32 G     2.9
    6    222   888 M    0.00   922 M     0 B     0 B   1.7 G   1.7 G     1.9
total   2777    17 G    0.00    17 G     0 B     0 B    60 G    94 G     5.5

Only 16 tables in L0. Yay! Except notice that the 16 tables consume 5.6 GB of disk space. Each table is huge. Also notice that L5 is much larger than L6. Debug logs show that the last L5->L6 compaction happened 51s into the 5m run. After that, all RocksDB was doing was L0->L0 and L0->L5 compactions.

On the bright side, I understand what is happening in both Pebble and RocksDB. I need to think more about what can be done. #136 is a possibility, where we allow compactions from more than 2 levels, and also allow flushing to generate more than 1 table so that the L0 tables do not cover all of Lbase.

@ajkr do you have an additional thoughts?

petermattis · 2019-08-02T18:16:02Z

It is odd that the RocksDB heuristics do not add another level at some point. On a whim, I disabled the "dynamic_level_bytes" option, which produced:

level__files____size___score______in__ingest____move____read___write___w-amp
  WAL      0     0 B       -     0 B       -       -       -    12 G     0.0
    0     11   1.8 G    4.10    12 G     0 B     0 B    12 G    24 G     2.0
    1    273   1.1 G    0.00    10 G     0 B     0 B    17 G    17 G     1.7
    2    298   1.4 G    2.20   5.8 G     0 B   3.4 G    13 G    13 G     2.3
    3    588   7.6 G    1.20   6.7 G     0 B   1.1 G    28 G    28 G     4.1
    4      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    5      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    6      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
total   1170    12 G    0.00    12 G     0 B   4.5 G    70 G    94 G     7.7

The structure of this LSM looks more sane than the one above, yet L0 is still too large in terms of both size and files. The overall throughput was also significantly lower in this run. It's interesting how the compaction heuristics are getting stuck in corners from which they don't seem to be able to break out of.

mwang1026 · 2021-12-20T21:21:59Z

Triaging with @jbowens we vote to close

petermattis self-assigned this Aug 2, 2019

petermattis mentioned this issue Aug 4, 2019

perf: investigate N-level compactions (N > 2) #136

Open

petermattis removed their assignment Sep 11, 2019

petermattis mentioned this issue Jan 15, 2020

storage/engine: pebble read amp spike on sysbench/oltp_update_index cockroachdb/cockroach#44028

Closed

This was referenced Feb 19, 2020

Severe performance deteoriation after inserting some millions of rows cockroachdb/cockroach#45181

Closed

perf: compaction improvements meta issue #552

Open

petermattis mentioned this issue Apr 28, 2020

db: incorporate range tombstones into compaction heuristics #319

Closed

petermattis mentioned this issue Jun 18, 2020

storage: large value performance degradation since switching to pebble cockroachdb/cockroach#49750

Closed

mwang1026 closed this as completed Dec 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: L0->Lbase compactions not keeping up with flushing #203

perf: L0->Lbase compactions not keeping up with flushing #203

petermattis commented Aug 2, 2019

petermattis commented Aug 2, 2019 •

edited

Loading

petermattis commented Aug 2, 2019

mwang1026 commented Dec 20, 2021

perf: L0->Lbase compactions not keeping up with flushing #203

perf: L0->Lbase compactions not keeping up with flushing #203

Comments

petermattis commented Aug 2, 2019

petermattis commented Aug 2, 2019 • edited Loading

petermattis commented Aug 2, 2019

mwang1026 commented Dec 20, 2021

petermattis commented Aug 2, 2019 •

edited

Loading