Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: L0->Lbase compactions not keeping up with flushing #203

Closed
petermattis opened this issue Aug 2, 2019 · 3 comments
Closed

perf: L0->Lbase compactions not keeping up with flushing #203

petermattis opened this issue Aug 2, 2019 · 3 comments

Comments

@petermattis
Copy link
Collaborator

On a c5d.4xlarge instance, the pebble sync workload shows good write performance, but a problematic behavior. L0->Lbase compactions are not keeping up with flushing, leading to an ever growing number of files in L0.

~ ./pebble sync -c 100 -d 1m -w /mnt/data1/bench --batch 100 -v
...
level__files____size___score______in__ingest____move____read___write___w-amp
  WAL      4    92 M       -   5.8 G       -       -       -   5.8 G     1.0
    0    111   4.6 G   55.50   5.7 G     0 B     0 B     0 B   6.2 G     1.1
    1      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    2      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    3      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    4      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    5    374   1.4 G   23.20   1.5 G     0 B     0 B   2.8 G   2.8 G     1.9
    6     34   144 M    1.00   144 M     0 B     0 B   214 M   214 M     1.5
total    519   6.2 G    0.00   5.8 G     0 B     0 B   3.0 G    15 G     2.6

(I tweaked the Pebble options to set the L0 stop writes threshold to 1000).

The behavior that is happening is that Pebble sees the large number of L0 sstables and decides to compact them into Lbase (L5 in this case). The workload is generating uniformly random keys, so the L0 sstables overlap all of Lbase. That means that an L0->Lbase compaction will have 111+374==485 input sstables, totaling 6 GB. That compaction necessarily takes a long time, and while it is proceeding further L0 tables build up. When the L0->Lbase compaction finishes, there are enough L0 tables to require another L0->Lbase compaction. The real starvation here is the L5->L6 compactions.

RocksDB somehow avoids this egregiously bad behavior, though I'm not quite sure how yet. It seems to be a combination of L0->L0 compactions, and concurrent compactions. If I disable L0->L0 compactions, RocksDB sees the same behavior as Pebble. If I disable concurrent compactions, RocksDB sees the same behavior as Pebble. I'm somewhat suspicious it is also related to the lower write throughput of RocksDB I see on this workload. An interesting side-effect of L0->L0 compactions is that they lower the number of files in L0 which lowers the L0 compaction score. Perhaps that is allowing Lbase->Lbase+1 compactions to be scheduled.

A limitation both Pebble and RocksDB currently suffer, is that an L0->Lbase compaction locks out a concurrent Lbase->Lbase+1 compaction. This is mentioned in https://github.com/petermattis/pebble/issues/136.

@petermattis petermattis self-assigned this Aug 2, 2019
@petermattis
Copy link
Collaborator Author

petermattis commented Aug 2, 2019

Ah, I think I understand. RocksDB is not escaping from this trap. The L0->L0 compactions were just obscuring the issue. (Concurrent compactions are needed for L0->L0 compactions to take place).

Here is what RocksDB looks like after a longer run:

~ ./pebble sync -c 100 -d 5m -w /mnt/data1/bench --batch 100 -v --rocksdb
...
level__files____size___score______in__ingest____move____read___write___w-amp
  WAL      0     0 B       -     0 B       -       -       -    17 G     0.0
    0     16   5.6 G    7.40    17 G     0 B     0 B    26 G    43 G     2.5
    1      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    2      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    3      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    4      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    5   2539    10 G    0.00    11 G     0 B     0 B    32 G    32 G     2.9
    6    222   888 M    0.00   922 M     0 B     0 B   1.7 G   1.7 G     1.9
total   2777    17 G    0.00    17 G     0 B     0 B    60 G    94 G     5.5

Only 16 tables in L0. Yay! Except notice that the 16 tables consume 5.6 GB of disk space. Each table is huge. Also notice that L5 is much larger than L6. Debug logs show that the last L5->L6 compaction happened 51s into the 5m run. After that, all RocksDB was doing was L0->L0 and L0->L5 compactions.

On the bright side, I understand what is happening in both Pebble and RocksDB. I need to think more about what can be done. #136 is a possibility, where we allow compactions from more than 2 levels, and also allow flushing to generate more than 1 table so that the L0 tables do not cover all of Lbase.

@ajkr do you have an additional thoughts?

@petermattis
Copy link
Collaborator Author

It is odd that the RocksDB heuristics do not add another level at some point. On a whim, I disabled the "dynamic_level_bytes" option, which produced:

level__files____size___score______in__ingest____move____read___write___w-amp
  WAL      0     0 B       -     0 B       -       -       -    12 G     0.0
    0     11   1.8 G    4.10    12 G     0 B     0 B    12 G    24 G     2.0
    1    273   1.1 G    0.00    10 G     0 B     0 B    17 G    17 G     1.7
    2    298   1.4 G    2.20   5.8 G     0 B   3.4 G    13 G    13 G     2.3
    3    588   7.6 G    1.20   6.7 G     0 B   1.1 G    28 G    28 G     4.1
    4      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    5      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
    6      0     0 B    0.00     0 B     0 B     0 B     0 B     0 B     0.0
total   1170    12 G    0.00    12 G     0 B   4.5 G    70 G    94 G     7.7

The structure of this LSM looks more sane than the one above, yet L0 is still too large in terms of both size and files. The overall throughput was also significantly lower in this run. It's interesting how the compaction heuristics are getting stuck in corners from which they don't seem to be able to break out of.

@mwang1026
Copy link

Triaging with @jbowens we vote to close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants