perf: read-based compaction heuristic #29

petermattis · 2018-12-20T16:48:49Z

Re-introduce a read-based compaction heuristic. LevelDB had a heuristic to trigger a compaction of a table if that table was being read frequently and there were tables from multiple levels involved in the read. This essentially reduced read-amplification in read heavy workloads. A read-based compaction is only selected if a size-based compaction is not needed.

In addition to reducing read-amplification, a read compaction would also squash deletion tombstones. So we'd get a scan performance benefit in not needing to skip the deletion tombstones.

petermattis · 2019-05-21T17:10:34Z

The read-based compaction heuristic works by maintaining a per-sstable counter of the number of allowed seeks on the sstable before it will be compacted. When an sstable is first created, this counter is populated as:

  // We arrange to automatically compact this file after
  // a certain number of seeks.  Let's assume:
  //   (1) One seek costs 10ms
  //   (2) Writing or reading 1MB costs 10ms (100MB/s)
  //   (3) A compaction of 1MB does 25MB of IO:
  //         1MB read from this level
  //         10-12MB read from next level (boundaries may be misaligned)
  //         10-12MB written to next level
  // This implies that 25 seeks cost the same as the compaction
  // of 1MB of data.  I.e., one seek costs approximately the
  // same as the compaction of 40KB of data.  We are a little
  // conservative and allow approximately one seek for every 16KB
  // of data before triggering a compaction.
  f->allowed_seeks = (f->file_size / 16384);
  if (f->allowed_seeks < 100) f->allowed_seeks = 100;

During iteration, LevelDB randomly samples the keys with a period of 1MB and updates the seek stats whenever there are 2 or more overlapping files for a particular key. Only the allowed_seeks field for the newest table is decremented, which seems reasonable as that is likely the smallest sstable and thus the quickest one to compact.

tbg · 2019-05-21T18:24:35Z

By re-introduce, are you implying that RocksDB doesn't use this one (if so, any idea why not)? Or was this once present in pebble and was removed for some reason?

ajkr · 2019-05-21T18:56:05Z

It was in leveldb but dropped from rocksdb. I do not know why it was originally dropped (as a random guess, that comment looks like it's all about disks and assumes reads/writes are all about seek cost, which isn't true anymore. Nowadays writes wear out your flash and reads don't so there's no clear crossover point where compacting becomes a clearly better choice.). Periodically we considered reimplementing it but couldn't think of use cases that would benefit besides read-only benchmarks, so it was never prioritized.

petermattis · 2019-05-21T20:01:18Z

@tbg To echo what @ajkr said, RocksDB dropped this heuristic from LevelDB. While Pebble was based on a Go port of LevelDB, it never had the heuristic (the port had not yet implemented it).

I'd definitely want a read-based compaction heuristic to be motivated by a workload which would benefit from it. My guess is that a read-heavy workload would. Perhaps I'm wrong about that.

aadityasondhi · 2020-11-02T15:14:44Z

Results of some early YCSB-A with the changes from #968:

name                old ops/sec  new ops/sec  delta
ycsb/A/values=1000     158 ± 0%     151 ± 0%   ~     (p=1.000 n=1+1)

name                old read     new read     delta
ycsb/A/values=1000    191M ± 0%    397M ± 0%   ~     (p=1.000 n=1+1)

name                old write    new write    delta
ycsb/A/values=1000    846M ± 0%   1049M ± 0%   ~     (p=1.000 n=1+1)

name                old r-amp    new r-amp    delta
ycsb/A/values=1000    7.45 ± 0%    7.46 ± 0%   ~     (p=1.000 n=1+1)

name                old w-amp    new w-amp    delta
ycsb/A/values=1000    2.51 ± 0%    3.13 ± 0%   ~     (p=1.000 n=1+1)

bench options:

duration: 300s
initial-keys: 300000 
concurrency: 4
workload: A

As discussed in storage morning sync, I am going to try running YCSB-C and YCSB-E with a large set of data. When I ran YCSB-C initially, I used initial-keys: 10000 and it was not triggering any read compactions.

petermattis · 2020-11-02T15:24:58Z

@aadityasondhi It would be good to include the LSM metrics so we can see how many levels there are during these runs? Also, are you measuring how often the read-triggered compactions are being performed. I can't tell from these numbers whether they are being performed at all. YCSB-A is 50% updates, 50% reads. I'd expect YCSB-C (100% reads) to provide the clearest signal of whether there is an improvement here. One thing that will be interesting is to track the ops/sec of YCSB-C over time as the expectation is that ops/sec should increase as the compactions are performed. I usually dump raw numbers into a Google sheet for one-off analysis like this.

aadityasondhi · 2020-11-02T20:28:31Z

Running YCSB-C with larger sets of data uncovered a bug in the read compaction logic where there were some level mismatches causing the stats to be off. After fixing that, here are some of the results:

Read compaction branch:

Benchmarkycsb/C/values=1000 34761674  115871.3 ops/sec  531188250 read  1094678068 write  5.72 r-amp  3.52 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    57 M       -   297 M       -       -       -       -   297 M       -       -       -     1.0
      0         0     0 B    0.00   240 M     0 B       0     0 B       0   244 M   4.2 K     0 B       0     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3         1   7.0 K    0.01   2.0 M     0 B       0   185 M   3.1 K   1.8 M      15   2.0 M       1     0.9
      4        25   4.0 M    0.66   6.4 M     0 B       0   238 M   3.9 K    13 M      92    13 M       1     2.1
      5        95    30 M    0.82    25 M     0 B       0   215 M   3.5 K    83 M     314    84 M       1     3.3
      6       346   207 M       -   161 M     0 B       0    49 M     751   404 M   1.1 K   408 M       1     2.5
  total       467   240 M       -   297 M     0 B       0   687 M    11 K   1.0 G   5.8 K   507 M       4     3.5
  flush         2
compact     12481     0 B             0 B  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1    64 M
zmemtbl         0     0 B
   ztbl        20   5.3 M
 bcache     9.7 K   246 M  100.0%  (score == hit-rate)
 tcache       487   293 K  100.0%  (score == hit-rate)
 titers         0
 filter         -       -    0.0%  (score == utility)

master branch:

Benchmarkycsb/C/values=1000 70481011  234936.6 ops/sec  213661966 read  778248398 write  6.16 r-amp  2.50 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    57 M       -   297 M       -       -       -       -   297 M       -       -       -     1.0
      0         0     0 B    0.00   240 M     0 B       0     0 B       0   244 M   4.2 K     0 B       0     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3         8   925 K    0.90   3.1 M     0 B       0   184 M   3.3 K   3.1 M      25   3.1 M       1     1.0
      4        77   6.0 M    0.99   5.3 M     0 B       0   238 M   4.1 K   8.3 M      59   8.3 M       1     1.6
      5       698    36 M    1.00     0 B     0 B       0   237 M   4.1 K     0 B       0     0 B       1     0.0
      6       852   198 M       -   145 M     0 B       0    56 M     852   190 M     724   192 M       1     1.3
  total      1635   242 M       -   297 M     0 B       0   716 M    12 K   742 M   5.0 K   204 M       4     2.5
  flush         2
compact     13061     0 B             0 B  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1    64 M
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache      14 K   241 M  100.0%  (score == hit-rate)
 tcache     1.6 K   984 K  100.0%  (score == hit-rate)
 titers        12
 filter         -       -    0.0%  (score == utility)

benchstat comparison:

name                old ops/sec  new ops/sec  delta
ycsb/C/values=1000    235k ± 0%    116k ± 0%   ~     (p=1.000 n=1+1)

name                old read     new read     delta
ycsb/C/values=1000    214M ± 0%    531M ± 0%   ~     (p=1.000 n=1+1)

name                old write    new write    delta
ycsb/C/values=1000    778M ± 0%   1095M ± 0%   ~     (p=1.000 n=1+1)

name                old r-amp    new r-amp    delta
ycsb/C/values=1000    6.16 ± 0%    5.72 ± 0%   ~     (p=1.000 n=1+1)

name                old w-amp    new w-amp    delta
ycsb/C/values=1000    2.50 ± 0%    3.52 ± 0%   ~     (p=1.000 n=1+1)

There seems to be a slight improvement in r-amp but a regression in other places. I am currently running a longer test with more data to hopefully produce more conclusive data. Will update this comment with the results from that.

petermattis · 2020-11-03T13:50:17Z

That is quite a significant perf hit. Looking only at the final ops/sec value doesn't give a full picture. I'll reiterate that it will be useful to dump the per-second ops/sec output to a Google sheet and graph it for the full run.

It is curious how many fewer sstables there are in your read-triggered branch vs the master branch runs: 467 vs 1635.

It would be worthwhile to verify that the sampling you've added to iterators isn't having a negative impact on read performance. One way to do this is to leave the sampling in place, but to disable the read-triggered compaction suggestions.

aadityasondhi · 2020-11-03T15:11:39Z

Here is a google sheet with the results of ops/sec over the 30min bench period: https://docs.google.com/spreadsheets/d/1UiYY68nEBOh12KDOOuiq2IoxLht5K5CUE3uTso4kI4E

I was also suspecting the same thing regarding the sampling. I will do a quick test to see if reduced and/or no sampling improves the read performance. I am also setting up a gce worker to do the tests so that any local machine inconsistencies don't affect the results.

petermattis · 2020-11-03T15:19:30Z

Here is a google sheet with the results of ops/sec over the 30min bench period: https://docs.google.com/spreadsheets/d/1UiYY68nEBOh12KDOOuiq2IoxLht5K5CUE3uTso4kI4E

How long were the runs above? I'm assuming this is for the read-triggered compaction branch. You'll also want a run for the master branch as comparison.

I am also setting up a gce worker to do the tests so that any local machine inconsistencies don't affect the results.

Oh, definitely do this. Or use roachprod to create machines to test on. Benchmarking on your laptop is very tricky to do right, and such results should always be viewed suspiciously. I'd personally use roachprod for this purpose. You can build the pebble Linux binary locally using build/builder.sh.

aadityasondhi · 2020-11-03T15:28:45Z

How long were the runs above? I'm assuming this is for the read-triggered compaction branch. You'll also want a run for the master branch as comparison.

They were 5mins, I am currently running the master branch for 30mins and will add the results to the sheet.

aadityasondhi · 2020-11-04T21:42:56Z

Update on progress/findings since the previous comments (for more visibility):

I was running into a few issues trying to run the benchmark using roachprod and roachtest with the local changes to pebble. I kept getting the following error message:

(1) /Users/aadityas/go/src/github.com/cockroachdb/cockroach/bin/roachprod run aadityas-1604520288-01-n5cpu16:1-5 -- rm -fr $(dirname {store-dir})/bench && tar xPf $(dirname {store-dir})/data.tar && ./pebble bench ycsb $(dirname {store-dir})/bench --workload=A --concurrency=256 --values=64 --keys=zipf --initial-keys=0 --prepopulated-keys=10000000 --cache=4294967296 --duration=10m0s > ycsb.log 2>&1 returned
  | stderr:
  | Error: COMMAND_PROBLEM: exit status 1
  | (1) COMMAND_PROBLEM
  | Wraps: (2) Node 1. Command with error:
  |   | ```
  |   | rm -fr $(dirname {store-dir})/bench && tar xPf $(dirname {store-dir})/data.tar && ./pebble bench ycsb $(dirname {store-dir})/bench --workload=A --concurrency=256 --values=64 --keys=zipf --initial-keys=0 --prepopulated-keys=10000000 --cache=4294967296 --duration=10m0s > ycsb.log 2>&1
  |   | ```
  | Wraps: (3) exit status 1
  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
  |
  | stdout:
  | aadityas-1604520288-01-n5cpu16: rm -fr $(dirname {store-dir..........
  |    1:
  | COMMAND_PROBLEM: exit status 1
  |    2:
  | COMMAND_PROBLEM: exit status 1
  |    3:
  | COMMAND_PROBLEM: exit status 1
  |    4:
  | COMMAND_PROBLEM: exit status 1
  |    5:
  | COMMAND_PROBLEM: exit status 1

Initially, I suspected something was wrong with the way I was building the binary and/or setting up the tests. Tried a few different things with no luck. Jackson helped me debug this using ssh into one of the cluster nodes. Found this error message:

L4->L5: 001103 not being compacted

I ran the benchmark with the same data using the master build and it worked fine. Seems like a bug in changes from #968. This error comes from: https://github.com/aadityasondhi/pebble/blob/read-triggered-compactions/compaction.go#L1131.

I am going to add some debug statements to find what is causing this and why it was not happening when running the same benchmark locally earlier.

aadityasondhi · 2020-11-06T16:01:15Z

I was able to run some more benchmarks using roachtets with the following settings:

test: pebble/ycsb/size=1024
cloud: aws
duration: 30m
workloads: C, E
nodes: 5
prepopulated-keys: 10000000
concurrency: 256

The results however are a little strange. I ran the same tests twice to make sure I was not making any errors in configuring them.

master branch vs read compaction:

benchstat master.txt read.txt
name                old ops/sec  new ops/sec  delta
ycsb/C/values=1024    619k ± 6%    322k ±22%  -47.90%  (p=0.008 n=5+5)
ycsb/E/values=1024   54.8k ± 8%   11.5k ±18%  -79.00%  (p=0.008 n=5+5)

name                old read     new read     delta
ycsb/C/values=1024   44.2G ± 4%   57.3G ± 1%  +29.60%  (p=0.008 n=5+5)
ycsb/E/values=1024    111G ± 6%     26G ±18%  -76.68%  (p=0.008 n=5+5)

name                old write    new write    delta
ycsb/C/values=1024   44.2G ± 4%   57.3G ± 1%  +29.61%  (p=0.008 n=5+5)
ycsb/E/values=1024    122G ± 6%     28G ±17%  -76.90%  (p=0.008 n=5+5)

name                old r-amp    new r-amp    delta
ycsb/C/values=1024    4.31 ± 0%    4.89 ± 4%  +13.41%  (p=0.008 n=5+5)
ycsb/E/values=1024    10.0 ± 4%    18.2 ±11%  +82.32%  (p=0.008 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/C/values=1024    0.00         0.00          ~     (all equal)
ycsb/E/values=1024    22.8 ± 2%    25.0 ± 6%   +9.87%  (p=0.008 n=5+5)

As suggested by @petermattis, I also ran another set of tests with sampling turned on but read compaction turned off to isolate any performance regression due to just the sampling portion:

 benchstat read.txt sampling-with-no-compaction.txt
name                old ops/sec  new ops/sec  delta
ycsb/C/values=1024    322k ±22%    262k ±13%      ~     (p=0.095 n=5+5)
ycsb/E/values=1024   11.5k ±18%   25.4k ±23%  +120.95%  (p=0.008 n=5+5)

name                old read     new read     delta
ycsb/C/values=1024   57.3G ± 1%   45.5G ± 3%   -20.52%  (p=0.008 n=5+5)
ycsb/E/values=1024   25.9G ±18%   72.2G ±10%  +178.35%  (p=0.008 n=5+5)

name                old write    new write    delta
ycsb/C/values=1024   57.3G ± 1%   45.5G ± 3%   -20.52%  (p=0.008 n=5+5)
ycsb/E/values=1024   28.1G ±17%   77.1G ±11%  +173.95%  (p=0.008 n=5+5)

name                old r-amp    new r-amp    delta
ycsb/C/values=1024    4.89 ± 4%    4.91 ± 2%      ~     (p=0.841 n=5+5)
ycsb/E/values=1024    18.2 ±11%    16.1 ±15%      ~     (p=0.151 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/C/values=1024    0.00         0.00           ~     (all equal)
ycsb/E/values=1024    25.0 ± 6%    31.3 ±11%   +25.25%  (p=0.008 n=5+5)

Based on this, it does seem that the sampling itself causes some slowdown in ops/sec on a read-only workload (C). But performing read triggered compactions is slowing down the ops/sec in a mixed workload (E). We might need to find a balance between the sampling rate and threshold for triggering read compaction to minimize the ops/sec performance regressions.

end state of master branch:

Benchmarkycsb/C/values=1024 1114854891  619359.0 ops/sec  43555147618 read  43537271453 write  4.30 r-amp  0.00 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1     0 B       -     0 B       -       -       -       -     0 B       -       -       -     0.0
      0         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3        19    62 M    0.97   3.6 G     0 B       0     0 B       0   6.9 G   1.8 K   6.9 G       1     1.9
      4        71   333 M    1.00   2.5 G     0 B       0   1.9 G     477   5.8 G     860   5.8 G       1     2.3
      5       176   1.7 G    0.99   4.3 G     0 B       0   948 M     200    10 G     879    10 G       1     2.4
      6       309   7.8 G       -   5.3 G     0 B       0     0 B       0    17 G     712    17 G       1     3.3
  total       575   9.9 G       -     0 B     0 B       0   2.8 G     677    40 G   4.2 K    41 G       4     0.0
  flush         0
compact      2146     0 B           144 M  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1   256 K
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     210 K   4.0 G   97.4%  (score == hit-rate)
 tcache       575   346 K  100.0%  (score == hit-rate)
 titers       552
 filter         -       -    0.0%  (score == utility)

Benchmarkycsb/E/values=1024 95245449  52914.0 ops/sec  107536846858 read  117679213825 write  9.89 r-amp  22.78 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    53 M       -   4.8 G       -       -       -       -   4.8 G       -       -       -     1.0
      0        12    24 M    0.93   4.8 G     0 B       0     0 B       0   4.7 G   3.2 K     0 B       1     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3        21    72 M    1.00   8.3 G     0 B       0   6.0 M       3    15 G   4.1 K    15 G       1     1.8
      4        59   402 M    1.07   6.7 G     0 B       0   2.3 G     630    20 G   3.1 K    20 G       1     3.0
      5       172   2.2 G    1.00   8.7 G     0 B       0   1.3 G     276    29 G   2.4 K    29 G       1     3.4
      6       454    12 G       -   9.5 G     0 B       0     0 B       0    36 G   1.5 K    36 G       1     3.8
  total       718    14 G       -   4.8 G     0 B       0   3.6 G     909   110 G    14 K   100 G       5    22.8
  flush        77
compact      4885   908 M           163 M  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1    64 M
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     257 K   3.9 G   91.0%  (score == hit-rate)
 tcache       718   432 K  100.0%  (score == hit-rate)
 titers      1092
 filter         -       -    0.0%  (score == utility)

end state of read compaction branch:

Benchmarkycsb/C/values=1024 630542930  350299.6 ops/sec  56664784685 read  56642231361 write  4.72 r-amp  0.00 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1     0 B       -     0 B       -       -       -       -     0 B       -       -       -     0.0
      0         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3         8    29 M    0.45   3.7 G     0 B       0     0 B       0   6.8 G   1.8 K   6.8 G       1     1.8
      4        15    93 M    0.28   2.9 G     0 B       0   1.7 G     426   8.1 G   1.2 K   8.1 G       1     2.8
      5        56   676 M    0.39   4.9 G     0 B       0   872 M     178    14 G   1.1 K    14 G       1     2.9
      6       357   9.1 G       -   6.8 G     0 B       0     0 B       0    24 G     984    24 G       1     3.5
  total       436   9.9 G       -     0 B     0 B       0   2.5 G     604    53 G   5.1 K    53 G       4     0.0
  flush         0
compact      2345     0 B           205 M  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1   256 K
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     213 K   4.0 G   98.0%  (score == hit-rate)
 tcache       436   262 K  100.0%  (score == hit-rate)
 titers       671
 filter         -       -    0.0%  (score == utility)

Benchmarkycsb/E/values=1024 18809706  10449.7 ops/sec  21894373100 read  23879267215 write  16.73 r-amp  23.43 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    22 M       -   962 M       -       -       -       -   972 M       -       -       -     1.0
      0      1355   384 M    0.08   950 M     0 B       0     0 B       0   934 M   2.6 K     0 B      10     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3      1094   4.3 G   41.12   4.3 G     0 B       0     0 B       0    14 G   3.5 K    14 G       1     3.2
      4       109   602 M    1.80   480 M     0 B       0    96 M      24   1.0 G     148   1.0 G       1     2.1
      5       192   1.8 G    0.97   1.0 G     0 B       0   221 M      48   2.2 G     191   2.2 G       1     2.2
      6       188   3.8 G       -   1.3 G     0 B       0     0 B       0   3.5 G     154   3.5 G       1     2.7
  total      2938    11 G       -   972 M     0 B       0   317 M      72    22 G   6.6 K    20 G      14    23.4
  flush        13
compact       403    24 G           1.4 G  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1    64 M
zmemtbl         0     0 B
   ztbl       284   800 M
 bcache     222 K   3.9 G   93.4%  (score == hit-rate)
 tcache     3.0 K   1.8 M  100.0%  (score == hit-rate)
 titers      3103
 filter         -       -    0.0%  (score == utility)

I am not sure if my test duration is too low or if other factors are coming into play here. The increase in r-amp seems to contradict the previous tests on my local machine as well. It is important to note that the number of compactions is significantly lower on the read compaction branch for workload E. Seems like the read compaction logic is interfering with the regular score based compactions. I am considering trying an approach where read triggered compactions are only checked for if there are no other compactions possible.

Edit: https://docs.google.com/spreadsheets/d/1UiYY68nEBOh12KDOOuiq2IoxLht5K5CUE3uTso4kI4E/edit#gid=239494140
For some charts comparing ops/sec over time on each of the tests I ran.

jbowens · 2020-11-06T17:56:37Z

Do you mind adding the output of benchstat master.txt sampling-with-no-compaction.txt?

aadityasondhi · 2020-11-06T18:17:15Z

benchstat master.txt sampling-with-no-compaction.txt
name                old ops/sec  new ops/sec  delta
ycsb/C/values=1024    619k ± 6%    262k ±13%  -57.64%  (p=0.008 n=5+5)
ycsb/E/values=1024   54.8k ± 8%   25.4k ±23%  -53.59%  (p=0.008 n=5+5)

name                old read     new read     delta
ycsb/C/values=1024   44.2G ± 4%   45.5G ± 3%     ~     (p=0.151 n=5+5)
ycsb/E/values=1024    111G ± 6%     72G ±10%  -35.10%  (p=0.008 n=5+5)

name                old write    new write    delta
ycsb/C/values=1024   44.2G ± 4%   45.5G ± 3%     ~     (p=0.151 n=5+5)
ycsb/E/values=1024    122G ± 6%     77G ±11%  -36.72%  (p=0.008 n=5+5)

name                old r-amp    new r-amp    delta
ycsb/C/values=1024    4.31 ± 0%    4.91 ± 2%  +13.83%  (p=0.008 n=5+5)
ycsb/E/values=1024    10.0 ± 4%    16.1 ±15%  +61.31%  (p=0.008 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/C/values=1024    0.00         0.00          ~     (all equal)
ycsb/E/values=1024    22.8 ± 2%    31.3 ±11%  +37.61%  (p=0.008 n=5+5)

I should have checked this earlier but this shows a more clear regression due to sampling alone.

jbowens · 2020-11-06T18:45:33Z

When you ran with just the sampling, did you avoid creating the readCompactions or did you accumulate them but not run them? If they accumulated, I recommend trying avoiding accumulating them altogether.

Do you know how many readCompactions the sampling typically creates? Maybe it's being too aggressive and we should increase the threshold?

I left some suggestions on #968, some of which might have an impact on performance. It's probably also a good idea to look at heap and memory profiles from a run. If your roachtest binary is recently built (and you have a very recently updated master), the pebble roachtest will now download the profiles into the artifacts directory along with the results. There should be a series of tar files that when expanded contain 10 second cpu profiles and a heap profile. I'd be happy to pair examining them if you're interested.

sumeerbhola · 2020-11-06T18:54:17Z

Seems like the read compaction logic is interfering with the regular score based compactions. I am considering trying an approach where read triggered compactions are only checked for if there are no other compactions possible.

This would be highly desirable, given how many bytes are stuck in higher levels (and their high scores) in the ycsb/E read-compaction LSM

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    22 M       -   962 M       -       -       -       -   972 M       -       -       -     1.0
      0      1355   384 M    0.08   950 M     0 B       0     0 B       0   934 M   2.6 K     0 B      10     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3      1094   4.3 G   41.12   4.3 G     0 B       0     0 B       0    14 G   3.5 K    14 G       1     3.2
      4       109   602 M    1.80   480 M     0 B       0    96 M      24   1.0 G     148   1.0 G       1     2.1
      5       192   1.8 G    0.97   1.0 G     0 B       0   221 M      48   2.2 G     191   2.2 G       1     2.2
      6       188   3.8 G       -   1.3 G     0 B       0     0 B       0   3.5 G     154   3.5 G       1     2.7
  total      2938    11 G       -   972 M     0 B       0   317 M      72    22 G   6.6 K    20 G      14    23.4

aadityasondhi · 2020-11-06T19:39:08Z

Thanks for the feedback @jbowens and @sumeerbhola. I have made some changes based on recommendations from @jbowens in #968, and can already see an improvement in ops/sec, r-amp, w-amp on small scale local tests. I am now running the roachtests on aws machines to confirm. I will update here once I have the results.

When you ran with just the sampling, did you avoid creating the readCompactions or did you accumulate them but not run them? If they accumulated, I recommend trying avoiding accumulating them altogether.

Yeah I removed the part where the readCompactions were being constructed to avoid that. I will still re-run the tests with the allocation related fixes you suggested in the PR.

If your roachtest binary is recently built (and you have a very recently updated master), the pebble roachtest will now download the profiles into the artifacts directory along with the results. There should be a series of tar files that when expanded contain 10 second cpu profiles and a heap profile.

This is very helpful, I will take it into account on my next few runs to find more of the allocations related performance hits.

aadityasondhi · 2020-11-06T22:16:51Z

Re-ran the same tests as before after making some of the changes:

de-prioritized the read compaction to only be considered if no other compactions are possible
optimized some memory allocations inside maybeSampleRead() in iterator.go

These changes led to a more expected result.

master branch vs read compaction:

benchstat master.txt read.txt
name                old ops/sec  new ops/sec  delta
ycsb/C/values=1024    619k ± 9%    522k ±29%     ~     (p=0.151 n=5+5)
ycsb/E/values=1024   55.3k ± 9%   28.8k ± 8%  -47.98%  (p=0.008 n=5+5)

name                old read     new read     delta
ycsb/C/values=1024   45.4G ± 6%   59.6G ± 2%  +31.24%  (p=0.008 n=5+5)
ycsb/E/values=1024    113G ± 5%     87G ± 9%  -23.13%  (p=0.008 n=5+5)

name                old write    new write    delta
ycsb/C/values=1024   45.4G ± 6%   59.6G ± 2%  +31.25%  (p=0.008 n=5+5)
ycsb/E/values=1024    123G ± 5%     92G ± 9%  -25.28%  (p=0.008 n=5+5)

name                old r-amp    new r-amp    delta
ycsb/C/values=1024    4.31 ± 1%    2.19 ±72%  -49.26%  (p=0.008 n=5+5)
ycsb/E/values=1024    10.1 ± 3%    14.8 ± 5%  +46.90%  (p=0.008 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/C/values=1024    0.00         0.00          ~     (all equal)
ycsb/E/values=1024    22.8 ± 3%    32.7 ± 1%  +43.42%  (p=0.008 n=5+5)

master branch vs sampling only:

benchstat master.txt sampling-no-compaction.txt
name                old ops/sec  new ops/sec  delta
ycsb/C/values=1024    619k ± 9%    307k ±16%  -50.46%  (p=0.008 n=5+5)
ycsb/E/values=1024   55.3k ± 9%   31.6k ±12%  -42.86%  (p=0.008 n=5+5)

name                old read     new read     delta
ycsb/C/values=1024   45.4G ± 6%   43.6G ± 5%     ~     (p=0.095 n=5+5)
ycsb/E/values=1024    113G ± 5%     78G ± 7%  -30.51%  (p=0.008 n=5+5)

name                old write    new write    delta
ycsb/C/values=1024   45.4G ± 6%   43.6G ± 5%     ~     (p=0.095 n=5+5)
ycsb/E/values=1024    123G ± 5%     84G ± 7%  -31.59%  (p=0.008 n=5+5)

name                old r-amp    new r-amp    delta
ycsb/C/values=1024    4.31 ± 1%    4.72 ± 3%   +9.51%  (p=0.008 n=5+5)
ycsb/E/values=1024    10.1 ± 3%    13.5 ± 8%  +34.55%  (p=0.008 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/C/values=1024    0.00         0.00          ~     (all equal)
ycsb/E/values=1024    22.8 ± 3%    27.4 ± 5%  +19.97%  (p=0.008 n=5+5)

end state of master branch:

Benchmarkycsb/C/values=1024 1153003871  640557.7 ops/sec  47954528218 read  47935780146 write  4.32 r-amp  0.00 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1     0 B       -     0 B       -       -       -       -     0 B       -       -       -     0.0
      0         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3        21    63 M    0.99   3.6 G     0 B       0   1.4 M       2   8.0 G   2.1 K   8.0 G       1     2.3
      4        58   330 M    0.99   3.2 G     0 B       0   1.2 G     321   7.9 G   1.2 K   7.9 G       1     2.5
      5       175   1.7 G    0.99   4.5 G     0 B       0   870 M     180    11 G     915    11 G       1     2.4
      6       303   7.8 G       -   5.3 G     0 B       0     0 B       0    18 G     729    18 G       1     3.3
  total       557   9.9 G       -     0 B     0 B       0   2.1 G     503    45 G   4.9 K    45 G       4     0.0
  flush         0
compact      2224     0 B           144 M  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1   256 K
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     211 K   4.0 G   97.4%  (score == hit-rate)
 tcache       557   335 K  100.0%  (score == hit-rate)
 titers       570
 filter         -       -    0.0%  (score == utility)

Benchmarkycsb/E/values=1024 103920282  57733.3 ops/sec  116077826496 read  127142840189 write  9.90 r-amp  22.57 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    57 M       -   5.2 G       -       -       -       -   5.2 G       -       -       -     1.0
      0        18    36 M    0.97   5.2 G     0 B       0     0 B       0   5.1 G   3.5 K     0 B       1     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3        18    66 M    1.04   8.6 G     0 B       0   6.0 M       3    15 G   4.2 K    15 G       1     1.8
      4        60   382 M    1.00   7.6 G     0 B       0   1.9 G     530    23 G   3.6 K    23 G       1     3.1
      5       173   2.3 G    1.00   9.5 G     0 B       0   900 M     207    32 G   2.6 K    32 G       1     3.4
      6       460    12 G       -   9.8 G     0 B       0     0 B       0    37 G   1.5 K    37 G       1     3.8
  total       729    15 G       -   5.2 G     0 B       0   2.7 G     740   118 G    15 K   108 G       5    22.6
  flush        84
compact      5105   806 M           162 M  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1    64 M
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     261 K   3.9 G   90.7%  (score == hit-rate)
 tcache       729   438 K  100.0%  (score == hit-rate)
 titers       907
 filter         -       -    0.0%  (score == utility)

end state of read compaction branch:

Benchmarkycsb/C/values=1024 664171307  368983.4 ops/sec  60329042533 read  60305716531 write  3.76 r-amp  0.00 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1     0 B       -     0 B       -       -       -       -     0 B       -       -       -     0.0
      0         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3         0     0 B    0.00   3.6 G     0 B       0   5.4 M       8   6.9 G   1.8 K   6.9 G       0     1.9
      4         1   4.0 M    0.01   2.7 G     0 B       0   1.6 G     407   7.1 G   1.1 K   7.2 G       1     2.6
      5         1    13 M    0.01   4.6 G     0 B       0   1.0 G     213    12 G     968    12 G       1     2.6
      6       365   9.9 G       -   7.4 G     0 B       0     0 B       0    30 G   1.2 K    30 G       1     4.1
  total       367   9.9 G       -     0 B     0 B       0   2.6 G     628    56 G   5.0 K    56 G       3     0.0
  flush         0
compact      2366     0 B           205 M  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1   256 K
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     214 K   4.0 G   97.7%  (score == hit-rate)
 tcache       367   221 K  100.0%  (score == hit-rate)
 titers       585
 filter         -       -    0.0%  (score == utility)


Benchmarkycsb/E/values=1024 51525747  28625.4 ops/sec  86421011573 read  91922532293 write  15.45 r-amp  32.88 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1   7.9 M       -   2.6 G       -       -       -       -   2.6 G       -       -       -     1.0
      0        39    72 M    1.68   2.6 G     0 B       0     0 B       0   2.5 G   1.8 K     0 B       2     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3        25    76 M    1.95   5.8 G     0 B       0   202 M     101    12 G   3.1 K    12 G       1     2.0
      4        41   220 M    0.61   4.7 G     0 B       0   2.0 G     533    14 G   2.1 K    14 G       1     2.8
      5       153   1.9 G    0.91   6.9 G     0 B       0   911 M     203    24 G   2.0 K    24 G       1     3.5
      6       374    10 G       -   7.6 G     0 B       0     0 B       0    31 G   1.2 K    31 G       1     4.1
  total       632    12 G       -   2.6 G     0 B       0   3.0 G     837    86 G    10 K    80 G       6    32.9
  flush        42
compact      3596   597 M           579 M  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1    64 M
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     236 K   3.9 G   93.7%  (score == hit-rate)
 tcache       632   380 K  100.0%  (score == hit-rate)
 titers      1246
 filter         -       -    0.0%  (score == utility)

Chart comparing ops/sec over time: https://docs.google.com/spreadsheets/d/1UiYY68nEBOh12KDOOuiq2IoxLht5K5CUE3uTso4kI4E/edit#gid=239494140

There is ops/sec regression in sampling only but it is made up when read compactions start to get triggered as seen by the chart. As expected in workload C (read-only), the read compactions help reduce the r-amp and over time the ops/sec also increases. However in a more mixed workload, there is still performance regression as seen in Workload E. Overall, it seems to work as initially expected.

I am going to start looking for more optimizations using memory profiles. I will also look into testing different sampling periods to help with the reduced ops/sec in mixed workloads.

aadityasondhi · 2020-11-11T16:06:16Z

Update: ran tests after the changes in 19c1850.

There seems to be a big improvement on the performance regression due to sampling. Seems like most of the overhead introduced by sampling was due the use of version.Overlaps() as it uses multiple seeks to generate a files list for the level.

master branch vs read compaction:

benchstat master.txt read.txt
name                old ops/sec  new ops/sec  delta
ycsb/C/values=1024    605k ± 5%   1110k ±11%  +83.42%  (p=0.008 n=5+5)
ycsb/E/values=1024   52.9k ± 8%   57.6k ± 1%   +8.82%  (p=0.016 n=5+4)

name                old read     new read     delta
ycsb/C/values=1024   44.0G ± 5%   58.0G ± 3%  +31.72%  (p=0.008 n=5+5)
ycsb/E/values=1024    109G ± 5%    153G ± 1%  +40.71%  (p=0.016 n=5+4)

name                old write    new write    delta
ycsb/C/values=1024   44.0G ± 5%   58.0G ± 3%  +31.73%  (p=0.008 n=5+5)
ycsb/E/values=1024    119G ± 5%    165G ± 1%  +38.01%  (p=0.016 n=5+4)

name                old r-amp    new r-amp    delta
ycsb/C/values=1024    4.30 ± 1%    2.21 ± 1%  -48.54%  (p=0.016 n=5+4)
ycsb/E/values=1024    10.1 ± 3%    10.3 ± 3%     ~     (p=0.310 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/C/values=1024    0.00         0.00          ~     (all equal)
ycsb/E/values=1024    23.1 ± 3%    29.4 ± 2%  +27.55%  (p=0.008 n=5+5)

master branch vs samping only branch:

benchstat master.txt sampling-no-compaction.txt
name                old ops/sec  new ops/sec  delta
ycsb/C/values=1024    605k ± 5%    552k ± 7%  -8.81%  (p=0.016 n=5+5)
ycsb/E/values=1024   52.9k ± 8%   52.4k ±10%    ~     (p=0.548 n=5+5)

name                old read     new read     delta
ycsb/C/values=1024   44.0G ± 5%   45.5G ± 3%    ~     (p=0.151 n=5+5)
ycsb/E/values=1024    109G ± 5%    109G ± 7%    ~     (p=0.421 n=5+5)

name                old write    new write    delta
ycsb/C/values=1024   44.0G ± 5%   45.5G ± 3%    ~     (p=0.151 n=5+5)
ycsb/E/values=1024    119G ± 5%    119G ± 7%    ~     (p=0.421 n=5+5)

name                old r-amp    new r-amp    delta
ycsb/C/values=1024    4.30 ± 1%    4.34 ± 1%    ~     (p=0.119 n=5+5)
ycsb/E/values=1024    10.1 ± 3%    10.2 ± 5%    ~     (p=0.841 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/C/values=1024    0.00         0.00         ~     (all equal)
ycsb/E/values=1024    23.1 ± 3%    23.2 ± 3%    ~     (p=0.802 n=5+5)

master branch:

Benchmarkycsb/C/values=1024 1086422926  603568.2 ops/sec  44053988592 read  44036518080 write  4.31 r-amp  0.00 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1     0 B       -     0 B       -       -       -       -     0 B       -       -       -     0.0
      0         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3        21    60 M    0.94   3.6 G     0 B       0   827 K       1   7.2 G   1.8 K   7.2 G       1     2.0
      4        79   333 M    1.00   2.4 G     0 B       0   1.9 G     484   6.4 G     950   6.4 G       1     2.6
      5       195   1.7 G    1.00   4.1 G     0 B       0   1.1 G     235   9.9 G     830   9.9 G       1     2.4
      6       320   7.8 G       -   5.3 G     0 B       0     0 B       0    18 G     744    18 G       1     3.3
  total       615   9.9 G       -     0 B     0 B       0   3.0 G     720    41 G   4.4 K    41 G       4     0.0
  flush         0
compact      2219     0 B            82 M  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1   256 K
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     210 K   4.0 G   97.4%  (score == hit-rate)
 tcache       615   370 K  100.0%  (score == hit-rate)
 titers       496
 filter         -       -    0.0%  (score == utility)

Benchmarkycsb/E/values=1024 93760980  52089.2 ops/sec  108267080606 read  118268617684 write  10.27 r-amp  23.24 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    43 M       -   4.7 G       -       -       -       -   4.7 G       -       -       -     1.0
      0        14    28 M    0.89   4.7 G     0 B       0     0 B       0   4.6 G   3.2 K     0 B       1     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3        21    72 M    1.08   8.1 G     0 B       0   4.0 M       2    16 G   4.2 K    16 G       1     1.9
      4        65   403 M    1.04   6.9 G     0 B       0   1.9 G     546    21 G   3.2 K    21 G       1     3.0
      5       165   2.2 G    1.11   8.6 G     0 B       0   1.1 G     257    29 G   2.3 K    29 G       1     3.4
      6       460    12 G       -   9.3 G     0 B       0     0 B       0    35 G   1.4 K    35 G       1     3.8
  total       725    14 G       -   4.7 G     0 B       0   3.0 G     805   110 G    14 K   101 G       5    23.2
  flush        76
compact      4722  1016 M           116 M  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1    64 M
zmemtbl         0     0 B
   ztbl         3    80 M
 bcache     257 K   3.9 G   91.1%  (score == hit-rate)
 tcache       728   438 K  100.0%  (score == hit-rate)
 titers      1098
 filter         -       -    0.0%  (score == utility)

read compaction branch:

Benchmarkycsb/C/values=1024 2207282063  1226262.2 ops/sec  58393988086 read  58371079815 write  2.20 r-amp  0.00 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1     0 B       -     0 B       -       -       -       -     0 B       -       -       -     0.0
      0         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3         0     0 B    0.00   3.6 G     0 B       0   2.0 M       3   6.6 G   1.7 K   6.6 G       0     1.8
      4         0     0 B    0.00   2.6 G     0 B       0   1.6 G     418   6.5 G     980   6.5 G       0     2.6
      5         2    20 M    0.01   4.6 G     0 B       0   941 M     194    12 G   1.0 K    12 G       1     2.7
      6       334   9.9 G       -   7.4 G     0 B       0     0 B       0    29 G   1.1 K    29 G       1     3.9
  total       336   9.9 G       -     0 B     0 B       0   2.5 G     615    54 G   4.8 K    54 G       2     0.0
  flush         0
compact      2228     0 B           144 M  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1   256 K
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     215 K   4.0 G   93.9%  (score == hit-rate)
 tcache       336   202 K  100.0%  (score == hit-rate)
 titers       188
 filter         -       -    0.0%  (score == utility)

Benchmarkycsb/E/values=1024 102899544  57166.3 ops/sec  153403417506 read  164410366650 write  10.16 r-amp  29.48 w-amp

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1   3.1 M       -   5.1 G       -       -       -       -   5.2 G       -       -       -     1.0
      0        45    82 M    2.20   5.2 G     0 B       0     0 B       0   5.1 G   3.2 K     0 B       2     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3        16    58 M    0.91   8.3 G     0 B       0   296 M     150    14 G   4.0 K    14 G       1     1.7
      4        31   156 M    0.41   6.9 G     0 B       0   2.2 G     607    19 G   3.1 K    19 G       1     2.7
      5       133   1.6 G    0.71   9.4 G     0 B       0  1016 M     213    46 G   3.7 K    46 G       1     4.9
      6       446    13 G       -    11 G     0 B       0     0 B       0    64 G   2.3 K    64 G       1     6.1
  total       671    15 G       -   5.2 G     0 B       0   3.5 G     970   153 G    16 K   143 G       6    29.5
  flush        84
compact      4754   301 M           666 M  (size == estimated-debt, in = in-progress-bytes)
 memtbl         1    64 M
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     263 K   3.9 G   91.3%  (score == hit-rate)
 tcache       671   404 K  100.0%  (score == hit-rate)
 titers      1271
 filter         -       -    0.0%  (score == utility

There is some extra w-amp in workload E which seems odd since read compaction is the last type of compaction considered by compactionPicker and is limited to one read compaction at a time (66cf3ab#diff-c839f7b8d42ad3e5a36ad9bfac1b3101227a8261fd3b2a1a9556e56b5d5b34d3R1522). I will run more workloads (possibly all YCSB) to see if I can find a pattern that could help narrow down what is causing it.

aadityasondhi · 2020-11-12T16:13:53Z

Results from all YCSB workloads. This suggests that most of the sampling based perf drops have been solved by the latest change.

master branch vs read compaction branch*

benchstat master.txt read.txt
name                old ops/sec  new ops/sec  delta
ycsb/A/values=1024   92.4k ± 1%   83.7k ± 1%   -9.41%  (p=0.008 n=5+5)
ycsb/B/values=1024    418k ± 4%    347k ±12%  -16.92%  (p=0.008 n=5+5)
ycsb/C/values=1024    627k ± 4%   1191k ±15%  +89.80%  (p=0.008 n=5+5)
ycsb/D/values=1024    105k ±11%     91k ±31%     ~     (p=0.151 n=5+5)
ycsb/E/values=1024   55.3k ± 7%   55.0k ± 7%     ~     (p=0.548 n=5+5)

name                old read     new read     delta
ycsb/A/values=1024    498G ± 1%    509G ± 1%   +2.12%  (p=0.016 n=5+5)
ycsb/B/values=1024    295G ± 3%    518G ±11%  +75.57%  (p=0.008 n=5+5)
ycsb/C/values=1024   44.6G ± 2%   57.9G ± 3%  +30.00%  (p=0.008 n=5+5)
ycsb/D/values=1024    187G ±10%    232G ±21%     ~     (p=0.056 n=5+5)
ycsb/E/values=1024    113G ± 5%    148G ± 5%  +30.95%  (p=0.008 n=5+5)

name                old write    new write    delta
ycsb/A/values=1024    591G ± 1%    593G ± 1%     ~     (p=0.548 n=5+5)
ycsb/B/values=1024    338G ± 3%    552G ±11%  +63.24%  (p=0.008 n=5+5)
ycsb/C/values=1024   44.5G ± 2%   57.9G ± 3%  +30.01%  (p=0.008 n=5+5)
ycsb/D/values=1024    207G ±10%    250G ±21%     ~     (p=0.056 n=5+5)
ycsb/E/values=1024    123G ± 5%    158G ± 5%  +28.22%  (p=0.008 n=5+5)

name                old r-amp    new r-amp    delta
ycsb/A/values=1024    11.1 ± 4%    10.7 ± 7%     ~     (p=0.421 n=5+5)
ycsb/B/values=1024    6.43 ± 1%    7.62 ± 1%  +18.51%  (p=0.008 n=5+5)
ycsb/C/values=1024    4.29 ± 0%    1.21 ± 2%  -71.88%  (p=0.016 n=5+4)
ycsb/D/values=1024    8.46 ± 2%    9.09 ± 6%   +7.45%  (p=0.016 n=5+5)
ycsb/E/values=1024    10.0 ± 3%    10.3 ± 4%     ~     (p=0.103 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/A/values=1024    6.54 ± 1%    7.24 ± 2%  +10.70%  (p=0.008 n=5+5)
ycsb/B/values=1024    8.29 ± 1%   16.30 ± 2%  +96.67%  (p=0.008 n=5+5)
ycsb/C/values=1024    0.00         0.00          ~     (all equal)
ycsb/D/values=1024    20.3 ± 2%    28.5 ±12%  +40.32%  (p=0.008 n=5+5)
ycsb/E/values=1024    22.8 ± 2%    29.5 ± 6%  +29.10%  (p=0.008 n=5+5)

master branch vs sampling only

benchstat master.txt sampling-no-compaction.txt
name                old ops/sec  new ops/sec  delta
ycsb/A/values=1024   92.4k ± 1%   92.3k ± 1%     ~     (p=0.841 n=5+5)
ycsb/B/values=1024    418k ± 4%    387k ±11%     ~     (p=0.056 n=5+5)
ycsb/C/values=1024    627k ± 4%    560k ±10%  -10.80%  (p=0.008 n=5+5)
ycsb/D/values=1024    105k ±11%     85k ± 7%  -18.58%  (p=0.008 n=5+5)
ycsb/E/values=1024   55.3k ± 7%   54.5k ± 7%     ~     (p=0.690 n=5+5)

name                old read     new read     delta
ycsb/A/values=1024    498G ± 1%    498G ± 1%     ~     (p=1.000 n=5+5)
ycsb/B/values=1024    295G ± 3%    278G ±10%     ~     (p=0.151 n=5+5)
ycsb/C/values=1024   44.6G ± 2%   44.7G ± 6%     ~     (p=0.841 n=5+5)
ycsb/D/values=1024    187G ±10%    158G ± 7%  -15.48%  (p=0.008 n=5+5)
ycsb/E/values=1024    113G ± 5%    111G ± 5%     ~     (p=0.421 n=5+5)

name                old write    new write    delta
ycsb/A/values=1024    591G ± 1%    591G ± 1%     ~     (p=1.000 n=5+5)
ycsb/B/values=1024    338G ± 3%    318G ±10%     ~     (p=0.095 n=5+5)
ycsb/C/values=1024   44.5G ± 2%   44.6G ± 6%     ~     (p=0.841 n=5+5)
ycsb/D/values=1024    207G ±10%    175G ± 7%  -15.78%  (p=0.008 n=5+5)
ycsb/E/values=1024    123G ± 5%    122G ± 5%     ~     (p=0.421 n=5+5)

name                old r-amp    new r-amp    delta
ycsb/A/values=1024    11.1 ± 4%    10.8 ± 3%     ~     (p=0.222 n=5+5)
ycsb/B/values=1024    6.43 ± 1%    6.43 ± 1%     ~     (p=1.000 n=5+5)
ycsb/C/values=1024    4.29 ± 0%    4.32 ± 1%     ~     (p=0.143 n=5+5)
ycsb/D/values=1024    8.46 ± 2%    8.90 ± 1%   +5.25%  (p=0.008 n=5+5)
ycsb/E/values=1024    10.0 ± 3%     9.9 ± 5%     ~     (p=1.000 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/A/values=1024    6.54 ± 1%    6.55 ± 0%     ~     (p=1.000 n=5+5)
ycsb/B/values=1024    8.29 ± 1%    8.43 ± 2%   +1.76%  (p=0.040 n=5+5)
ycsb/C/values=1024    0.00         0.00          ~     (all equal)
ycsb/D/values=1024    20.3 ± 2%    21.0 ± 1%   +3.39%  (p=0.008 n=5+5)
ycsb/E/values=1024    22.8 ± 2%    22.9 ± 3%     ~     (p=0.889 n=5+5)

The read compaction comparison suggests that it may be useful to optimize the compaction trigger thresholds a little, by possibly reducing the frequency. This could help with perf drop in workload B. In terms of the slight perf drop due to sampling, the mergeIter approach should help with that.

aadityasondhi · 2020-11-16T16:16:22Z

Updated test results with the latest commit using mergingIter to avoid the B-tree traversal in sampling:

benchstat master.txt read.txt
name                old ops/sec  new ops/sec  delta
ycsb/A/values=1024   92.0k ± 0%   84.1k ± 2%    -8.53%  (p=0.008 n=5+5)
ycsb/B/values=1024    405k ± 4%    351k ±15%   -13.28%  (p=0.008 n=5+5)
ycsb/C/values=1024    607k ± 3%   1141k ±20%   +87.95%  (p=0.008 n=5+5)
ycsb/D/values=1024   98.9k ± 8%   93.3k ±20%      ~     (p=0.421 n=5+5)
ycsb/E/values=1024   52.5k ± 8%   54.3k ±13%      ~     (p=0.222 n=5+5)

name                old read     new read     delta
ycsb/A/values=1024    495G ± 0%    508G ± 1%    +2.54%  (p=0.008 n=5+5)
ycsb/B/values=1024    290G ± 5%    506G ±17%   +74.95%  (p=0.008 n=5+5)
ycsb/C/values=1024   45.8G ± 5%  315.1G ± 6%  +588.43%  (p=0.016 n=5+4)
ycsb/D/values=1024    178G ± 7%    242G ±11%   +35.79%  (p=0.008 n=5+5)
ycsb/E/values=1024    110G ± 8%    145G ±10%   +32.55%  (p=0.008 n=5+5)

name                old write    new write    delta
ycsb/A/values=1024    588G ± 0%    593G ± 1%      ~     (p=0.056 n=5+5)
ycsb/B/values=1024    332G ± 4%    541G ±17%   +63.25%  (p=0.008 n=5+5)
ycsb/C/values=1024   45.8G ± 5%  315.1G ± 6%  +588.67%  (p=0.016 n=5+4)
ycsb/D/values=1024    197G ± 7%    260G ±11%   +31.80%  (p=0.008 n=5+5)
ycsb/E/values=1024    120G ± 8%    156G ±10%   +30.11%  (p=0.008 n=5+5)

name                old r-amp    new r-amp    delta
ycsb/A/values=1024    11.4 ± 3%    11.0 ± 8%      ~     (p=0.151 n=5+5)
ycsb/B/values=1024    6.42 ± 0%    7.17 ± 1%   +11.65%  (p=0.008 n=5+5)
ycsb/C/values=1024    4.32 ± 0%    2.25 ± 3%   -47.75%  (p=0.016 n=5+4)
ycsb/D/values=1024    8.69 ± 3%    8.84 ± 3%      ~     (p=0.222 n=5+5)
ycsb/E/values=1024    10.4 ± 2%    10.1 ± 9%      ~     (p=0.222 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/A/values=1024    6.54 ± 0%    7.22 ± 1%   +10.40%  (p=0.008 n=5+5)
ycsb/B/values=1024    8.38 ± 0%   15.77 ± 3%   +88.21%  (p=0.008 n=5+5)
ycsb/C/values=1024    0.00         0.00           ~     (all equal)
ycsb/D/values=1024    20.5 ± 3%    28.8 ±10%   +40.77%  (p=0.008 n=5+5)
ycsb/E/values=1024    23.4 ± 1%    29.4 ± 5%   +25.87%  (p=0.008 n=5+5)

benchstat master.txt sampling-no-compaction.txt
name                old ops/sec  new ops/sec  delta
ycsb/A/values=1024   92.0k ± 0%   92.8k ± 1%  +0.88%  (p=0.008 n=5+5)
ycsb/B/values=1024    405k ± 4%    403k ± 8%    ~     (p=1.000 n=5+5)
ycsb/C/values=1024    607k ± 3%    601k ± 8%    ~     (p=0.841 n=5+5)
ycsb/D/values=1024   98.9k ± 8%   99.4k ±19%    ~     (p=1.000 n=5+5)
ycsb/E/values=1024   52.5k ± 8%   55.8k ±12%    ~     (p=0.222 n=5+5)

name                old read     new read     delta
ycsb/A/values=1024    495G ± 0%    500G ± 1%  +1.02%  (p=0.008 n=5+5)
ycsb/B/values=1024    290G ± 5%    287G ± 8%    ~     (p=1.000 n=5+5)
ycsb/C/values=1024   45.8G ± 5%   43.2G ± 4%    ~     (p=0.056 n=5+5)
ycsb/D/values=1024    178G ± 7%    176G ±15%    ~     (p=1.000 n=5+5)
ycsb/E/values=1024    110G ± 8%    114G ± 2%    ~     (p=0.190 n=5+4)

name                old write    new write    delta
ycsb/A/values=1024    588G ± 0%    594G ± 1%  +0.99%  (p=0.008 n=5+5)
ycsb/B/values=1024    332G ± 4%    329G ± 8%    ~     (p=1.000 n=5+5)
ycsb/C/values=1024   45.8G ± 5%   43.1G ± 4%    ~     (p=0.056 n=5+5)
ycsb/D/values=1024    197G ± 7%    195G ±16%    ~     (p=1.000 n=5+5)
ycsb/E/values=1024    120G ± 8%    126G ± 1%    ~     (p=0.190 n=5+4)

name                old r-amp    new r-amp    delta
ycsb/A/values=1024    11.4 ± 3%    10.6 ± 3%  -7.02%  (p=0.008 n=5+5)
ycsb/B/values=1024    6.42 ± 0%    6.38 ± 1%    ~     (p=0.087 n=5+5)
ycsb/C/values=1024    4.32 ± 0%    4.29 ± 0%  -0.70%  (p=0.016 n=5+5)
ycsb/D/values=1024    8.69 ± 3%    8.68 ± 7%    ~     (p=0.690 n=5+5)
ycsb/E/values=1024    10.4 ± 2%     9.7 ± 3%  -6.08%  (p=0.008 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/A/values=1024    6.54 ± 0%    6.55 ± 0%    ~     (p=0.381 n=5+5)
ycsb/B/values=1024    8.38 ± 0%    8.36 ± 1%    ~     (p=0.603 n=5+5)
ycsb/C/values=1024    0.00         0.00         ~     (all equal)
ycsb/D/values=1024    20.5 ± 3%    20.2 ± 4%    ~     (p=0.841 n=5+5)
ycsb/E/values=1024    23.4 ± 1%    22.5 ± 2%  -3.82%  (p=0.008 n=5+5)

Sampling seems to cause minimal overhead with this new change. The result to note is the ops/sec regression in workloads A and B for the read compactions. I will adjust the read compaction parameters to reduce their frequency which might help with the perf there.

petermattis · 2020-11-17T01:23:23Z

Yeah, the sampling overhead looks non-existent with the latest change.

If there is significant write traffic, perhaps any queued read compactions should be dropped. Can that be done in such a way that if the write traffic stops and reads continue we'll eventually queue the active sstables for read compactions again?

aadityasondhi · 2020-11-17T15:10:23Z

As it is currently designed, read compactions are queued when the files reach 0 AllowedSeeks, but only executed if there are no other compactions possible. They are also limited to one compaction at a time. My assumption was that if we see significant write traffic, it would start triggering our score based compactions which are write dependant and automatically stop executing the read compactions until there are no more score based compactions possible. This assumption is based on the code in tryReadTriggeredCompaction https://github.com/cockroachdb/pebble/pull/968/files#diff-c839f7b8d42ad3e5a36ad9bfac1b3101227a8261fd3b2a1a9556e56b5d5b34d3R1511

petermattis · 2020-11-17T16:11:57Z

My assumption was that if we see significant write traffic, it would start triggering our score based compactions which are write dependant and automatically stop executing the read compactions until there are no more score based compactions possible.

ycsb/A is 50% updates. I would think that is sufficient write traffic to not perform any read-triggered compactions. Are read-triggered compactions being performed on the ycsb/A workload? If so, why?

itsbilal · 2020-11-17T16:16:43Z

The other possibility (and a motivation to increase the default AllowedSeeks) is that read compactions could be getting scheduled in the short durations after a score-based compaction, or at times when all score-based candidates are already compacting. Then, while a read compaction runs, we're unable to do more score-based compactions until it finishes.

It might be worth tracking the sizes of read - based compactions; if they're too large, that would also explain the ycsb/{A,B} slowness.

aadityasondhi · 2020-11-17T16:26:35Z

ycsb/A is 50% updates. I would think that is sufficient write traffic to not perform any read-triggered compactions. Are read-triggered compactions being performed on the ycsb/A workload? If so, why?

I see the problem now. There are times during a workload like A where read compactions do occur when there are no score based compactions possible either because those have already occurred and we fail early, or because there is a small window where there truly isn't a suitable score based compactions.

A possible solution I can think of is to track the reason why a score based compaction was not possible. If it is because the scores on the levels are low enough to not need a compaction, we can try read compacting. But if it is because a compaction was already in progress for the picked file, we should try another score based compaction rather than trying a read compaction.

and a motivation to increase the default AllowedSeeks

I did try reducing the threshold by 50%, which doubles the AllowedSeeks in my latest commit. The results only show a marginal improvement, which leads me to believe that @petermattis's point above is something I should focus on. I could try implementing the approach I suggested and see if that helps.

Results from latest changes in AllowedSeeks:

benchstat master.txt read.txt
name                old ops/sec  new ops/sec  delta
ycsb/A/values=1024   92.2k ± 0%   84.9k ± 1%   -7.97%  (p=0.008 n=5+5)
ycsb/B/values=1024    411k ± 4%    370k ± 2%   -9.96%  (p=0.008 n=5+5)

name                old read     new read     delta
ycsb/A/values=1024    497G ± 1%    507G ± 1%   +2.03%  (p=0.008 n=5+5)
ycsb/B/values=1024    292G ± 4%    529G ± 1%  +80.92%  (p=0.008 n=5+5)

name                old write    new write    delta
ycsb/A/values=1024    590G ± 1%    593G ± 1%     ~     (p=0.222 n=5+5)
ycsb/B/values=1024    335G ± 4%    566G ± 1%  +68.89%  (p=0.008 n=5+5)

name                old r-amp    new r-amp    delta
ycsb/A/values=1024    11.1 ± 3%    11.2 ± 4%     ~     (p=0.310 n=5+5)
ycsb/B/values=1024    6.41 ± 1%    7.05 ± 2%   +9.89%  (p=0.008 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/A/values=1024    6.53 ± 0%    7.15 ± 1%   +9.52%  (p=0.008 n=5+5)
ycsb/B/values=1024    8.32 ± 1%   15.67 ± 2%  +88.20%  (p=0.008 n=5+5)

petermattis · 2020-11-17T16:29:24Z

Another area to explore: don't do a read-based compaction if a flush is in progress. Perhaps even better would be to notice if a flush is likely to happen soon, though I don't have an immediate suggestion for how to accomplish that.

aadityasondhi · 2020-11-17T19:54:25Z

Another area to explore: don't do a read-based compaction if a flush is in progress. Perhaps even better would be to notice if a flush is likely to happen soon, though I don't have an immediate suggestion for how to accomplish that.

Did a quick implementation of this to see how much of an impact it would make: f8fdd69. Looks like it is significant enough to make it worthwhile to explore the second suggestion as well to anticipate flushes and stop read compacting. I will see if I can come up with something using the memtable sizes.

benchstat v4-mergeiter/run-4/master.txt v4-mergeiter/run-5/read.txt
name                old ops/sec  new ops/sec  delta
ycsb/A/values=1024   92.2k ± 0%   89.7k ± 2%   -2.70%  (p=0.008 n=5+5)
ycsb/B/values=1024    411k ± 4%    377k ± 3%   -8.29%  (p=0.008 n=5+5)

name                old read     new read     delta
ycsb/A/values=1024    497G ± 1%    497G ± 1%     ~     (p=0.690 n=5+5)
ycsb/B/values=1024    292G ± 4%    522G ± 1%  +78.59%  (p=0.008 n=5+5)

name                old write    new write    delta
ycsb/A/values=1024    590G ± 1%    588G ± 1%     ~     (p=0.421 n=5+5)
ycsb/B/values=1024    335G ± 4%    560G ± 2%  +67.00%  (p=0.008 n=5+5)

name                old r-amp    new r-amp    delta
ycsb/A/values=1024    11.1 ± 3%    11.3 ± 3%     ~     (p=0.167 n=5+5)
ycsb/B/values=1024    6.41 ± 1%    7.01 ± 2%   +9.26%  (p=0.008 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/A/values=1024    6.53 ± 0%    6.70 ± 2%   +2.63%  (p=0.008 n=5+5)
ycsb/B/values=1024    8.32 ± 1%   15.21 ± 2%  +82.72%  (p=0.008 n=5+5)

Previously, our compactions were triggered on write based heuristics. This change introduces compactions based on high read activity to improve read performance in read-heavy workloads. It is inspired from LevelDB's read based compaction heuristic which is outlined in the following issue: cockroachdb#29. These compactions are triggered using an `AllowedSeeks` parameter on each file's `FileMetadata`. Reads are sampled based on a rate defined in `iterator.MaybeSampleRead()`. If a read is sampled, `AllowedSeeks` is decremented on the file in the top-most level containing the key. Once `AllowedSeeks` reaches 0, a compaction for the key range is scheduled. Read triggered compactions are only considered if no other compaction is possible at the time. This helps prioritize score based compactions to maintain a healthy LSM shape. The results of this change in benchmarks are outlined in the github issue: cockroachdb#29.

aadityasondhi · 2020-11-23T17:09:46Z

Results after the most recent PR:

For values=1024, the results are promising and read compactions are helping with performance on workload C. For values=64, there is perf regression in workload A. I will try tweaking the sampling rate to see if that helps.

1024 master vs read compactions

benchstat master.txt read-32C.txt
name                old ops/sec  new ops/sec   delta
ycsb/A/values=1024   92.3k ± 1%    90.4k ± 1%    -1.99%  (p=0.008 n=5+5)
ycsb/B/values=1024    409k ±12%     386k ± 1%      ~     (p=0.190 n=5+4)
ycsb/C/values=1024    640k ± 2%    1256k ±29%   +96.17%  (p=0.008 n=5+5)
ycsb/D/values=1024    116k ±10%     115k ± 9%      ~     (p=0.690 n=5+5)
ycsb/E/values=1024   73.5k ± 3%    71.6k ± 7%      ~     (p=0.222 n=5+5)

name                old read     new read      delta
ycsb/A/values=1024    497G ± 0%     499G ± 1%      ~     (p=0.151 n=5+5)
ycsb/B/values=1024    284G ±20%     529G ± 0%   +86.63%  (p=0.016 n=5+4)
ycsb/C/values=1024   44.8G ± 3%  286.5G ±110%  +539.36%  (p=0.008 n=5+5)
ycsb/D/values=1024    203G ± 8%     281G ± 6%   +38.25%  (p=0.008 n=5+5)
ycsb/E/values=1024    138G ± 3%     197G ± 5%   +42.06%  (p=0.008 n=5+5)

name                old write    new write     delta
ycsb/A/values=1024    590G ± 0%     590G ± 1%      ~     (p=0.690 n=5+5)
ycsb/B/values=1024    326G ±19%     567G ± 0%   +74.04%  (p=0.016 n=5+4)
ycsb/C/values=1024   44.8G ± 3%  286.4G ±110%  +539.57%  (p=0.008 n=5+5)
ycsb/D/values=1024    226G ± 8%     303G ± 7%   +34.35%  (p=0.008 n=5+5)
ycsb/E/values=1024    153G ± 3%     211G ± 5%   +37.92%  (p=0.008 n=5+5)

name                old r-amp    new r-amp     delta
ycsb/A/values=1024    10.8 ± 3%     11.0 ±12%      ~     (p=0.690 n=5+5)
ycsb/B/values=1024    6.38 ± 1%     7.22 ± 1%   +13.15%  (p=0.016 n=4+5)
ycsb/C/values=1024    4.28 ± 0%     1.81 ±77%   -57.60%  (p=0.008 n=5+5)
ycsb/D/values=1024    8.07 ± 2%     8.31 ± 3%    +2.95%  (p=0.032 n=5+5)
ycsb/E/values=1024    8.64 ± 2%     9.16 ± 3%    +6.00%  (p=0.008 n=5+5)

name                old w-amp    new w-amp     delta
ycsb/A/values=1024    6.55 ± 1%     6.68 ± 1%    +1.98%  (p=0.008 n=5+5)
ycsb/B/values=1024    8.15 ± 7%    15.04 ± 0%   +84.45%  (p=0.016 n=5+4)
ycsb/C/values=1024    0.00          0.00           ~     (all equal)
ycsb/D/values=1024    19.9 ± 1%     27.0 ± 3%   +35.83%  (p=0.008 n=5+5)
ycsb/E/values=1024    21.3 ± 1%     30.1 ± 3%   +41.54%  (p=0.008 n=5+5)

64 master vs read compactions

benchstat master-64.txt read-32C-64.txt
name              old ops/sec  new ops/sec   delta
ycsb/A/values=64    787k ± 4%     570k ± 4%     -27.53%  (p=0.008 n=5+5)
ycsb/B/values=64    977k ± 8%    1141k ± 2%     +16.69%  (p=0.008 n=5+5)
ycsb/C/values=64   1.38M ± 8%    1.25M ±21%        ~     (p=0.151 n=5+5)
ycsb/D/values=64    645k ± 3%     618k ± 2%      -4.10%  (p=0.032 n=5+5)
ycsb/E/values=64    192k ± 5%     226k ± 2%     +18.05%  (p=0.008 n=5+5)

name              old read     new read      delta
ycsb/A/values=64    195G ± 2%     454G ± 1%    +132.89%  (p=0.008 n=5+5)
ycsb/B/values=64   35.5G ± 7%   345.8G ± 2%    +873.12%  (p=0.008 n=5+5)
ycsb/C/values=64   1.52G ± 6%  393.96G ± 4%  +25836.80%  (p=0.008 n=5+5)
ycsb/D/values=64   74.8G ± 4%   350.1G ± 2%    +368.26%  (p=0.008 n=5+5)
ycsb/E/values=64   19.4G ± 4%   161.4G ± 5%    +730.87%  (p=0.008 n=5+5)

name              old write    new write     delta
ycsb/A/values=64    283G ± 2%     518G ± 2%     +83.02%  (p=0.008 n=5+5)
ycsb/B/values=64   46.6G ± 7%   358.4G ± 2%    +668.50%  (p=0.008 n=5+5)
ycsb/C/values=64   1.47G ± 6%  393.89G ± 4%  +26640.54%  (p=0.008 n=5+5)
ycsb/D/values=64   86.8G ± 4%   361.6G ± 2%    +316.67%  (p=0.008 n=5+5)
ycsb/E/values=64   23.0G ± 4%   165.5G ± 4%    +620.53%  (p=0.008 n=5+5)

name              old r-amp    new r-amp     delta
ycsb/A/values=64    6.52 ± 3%     4.99 ± 1%     -23.40%  (p=0.008 n=5+5)
ycsb/B/values=64    4.18 ± 0%     3.22 ± 1%     -22.80%  (p=0.008 n=5+5)
ycsb/C/values=64    3.03 ± 0%     2.90 ±30%        ~     (p=0.683 n=5+5)
ycsb/D/values=64    4.47 ± 0%     4.78 ± 2%      +6.93%  (p=0.008 n=5+5)
ycsb/E/values=64    4.12 ± 0%     2.64 ± 2%     -35.76%  (p=0.008 n=5+5)

name              old w-amp    new w-amp     delta
ycsb/A/values=64    3.23 ± 2%     8.17 ± 4%    +152.75%  (p=0.008 n=5+5)
ycsb/B/values=64    4.29 ± 0%    28.29 ± 3%    +559.35%  (p=0.016 n=4+5)
ycsb/C/values=64    0.00          0.00             ~     (all equal)
ycsb/D/values=64    12.1 ± 1%     52.7 ± 2%    +334.58%  (p=0.008 n=5+5)
ycsb/E/values=64    10.8 ± 1%     65.8 ± 6%    +510.61%  (p=0.008 n=5+5)

Previously, our compactions were triggered on write based heuristics. This change introduces compactions based on high read activity to improve read performance in read-heavy workloads. It is inspired from LevelDB's read based compaction heuristic which is outlined in the following issue: cockroachdb#29. These compactions are triggered using an `AllowedSeeks` parameter on each file's `FileMetadata`. Reads are sampled based on a rate defined in `iterator.MaybeSampleRead()`. If a read is sampled, `AllowedSeeks` is decremented on the file in the top-most level containing the key. Once `AllowedSeeks` reaches 0, a compaction for the key range is scheduled. Read triggered compactions are only considered if no other compaction is possible at the time. This helps prioritize score based compactions to maintain a healthy LSM shape. The results of this change in benchmarks are outlined in the github issue: cockroachdb#29.

aadityasondhi · 2020-12-14T15:25:04Z

After the most recent changes, the benchmarks are aligning with what we would have expected to see from this project.

1024 master vs read compactions

benchstat master.txt read.txt
name                old ops/sec  new ops/sec  delta
ycsb/A/values=1024   92.2k ± 1%   92.3k ± 1%      ~     (p=1.000 n=5+5)
ycsb/B/values=1024    404k ± 8%    419k ± 5%      ~     (p=0.548 n=5+5)
ycsb/C/values=1024    605k ± 8%   1415k ± 5%  +133.93%  (p=0.008 n=5+5)
ycsb/D/values=1024    113k ±16%    108k ±17%      ~     (p=0.690 n=5+5)
ycsb/E/values=1024   57.1k ±28%   71.0k ±16%      ~     (p=0.056 n=5+5)

name                old read     new read     delta
ycsb/A/values=1024    501G ± 0%    502G ± 1%      ~     (p=0.421 n=5+5)
ycsb/B/values=1024    288G ± 6%    301G ± 3%      ~     (p=0.095 n=5+5)
ycsb/C/values=1024   44.1G ± 1%   61.0G ± 4%   +38.20%  (p=0.008 n=5+5)
ycsb/D/values=1024    196G ±12%    193G ±16%      ~     (p=0.841 n=5+5)
ycsb/E/values=1024    116G ±20%    135G ±12%      ~     (p=0.056 n=5+5)

name                old write    new write    delta
ycsb/A/values=1024    594G ± 0%    595G ± 1%      ~     (p=0.421 n=5+5)
ycsb/B/values=1024    330G ± 7%    344G ± 3%      ~     (p=0.095 n=5+5)
ycsb/C/values=1024   44.1G ± 1%   61.0G ± 4%   +38.21%  (p=0.008 n=5+5)
ycsb/D/values=1024    218G ±13%    214G ±16%      ~     (p=0.841 n=5+5)
ycsb/E/values=1024    127G ±20%    149G ±12%      ~     (p=0.056 n=5+5)

name                old r-amp    new r-amp    delta
ycsb/A/values=1024    10.2 ± 3%    10.4 ± 2%      ~     (p=0.222 n=5+5)
ycsb/B/values=1024    6.36 ± 1%    6.31 ± 1%      ~     (p=0.310 n=5+5)
ycsb/C/values=1024    4.28 ± 1%    1.24 ± 0%   -71.00%  (p=0.016 n=5+4)
ycsb/D/values=1024    8.05 ± 3%    8.20 ± 4%      ~     (p=0.460 n=5+5)
ycsb/E/values=1024    9.33 ± 8%    8.76 ± 5%      ~     (p=0.421 n=5+5)

name                old w-amp    new w-amp    delta
ycsb/A/values=1024    6.57 ± 0%    6.59 ± 1%      ~     (p=0.357 n=5+5)
ycsb/B/values=1024    8.36 ± 1%    8.40 ± 2%      ~     (p=0.524 n=5+5)
ycsb/C/values=1024    0.00         0.00           ~     (all equal)
ycsb/D/values=1024    19.8 ± 3%    20.3 ± 1%      ~     (p=0.095 n=5+5)
ycsb/E/values=1024    23.0 ± 7%    21.2 ± 0%    -7.76%  (p=0.016 n=5+4)

64 master vs read compactions

benchstat master-64.txt read-64.txt
name              old ops/sec  new ops/sec  delta
ycsb/A/values=64    795k ± 6%    789k ± 2%      ~     (p=1.000 n=5+4)
ycsb/B/values=64    981k ±11%   1178k ± 2%   +20.14%  (p=0.016 n=5+4)
ycsb/C/values=64   1.39M ± 8%   2.76M ± 4%   +97.99%  (p=0.008 n=5+5)
ycsb/D/values=64    644k ± 7%    650k ±14%      ~     (p=0.548 n=5+5)
ycsb/E/values=64    192k ± 7%    230k ± 2%   +20.20%  (p=0.008 n=5+5)

name              old read     new read     delta
ycsb/A/values=64    199G ± 2%    196G ± 2%      ~     (p=0.286 n=5+4)
ycsb/B/values=64   35.6G ±11%  182.8G ± 2%  +413.82%  (p=0.016 n=5+4)
ycsb/C/values=64   1.42G ±19%   3.08G ± 5%  +116.59%  (p=0.008 n=5+5)
ycsb/D/values=64   74.9G ± 8%  138.3G ± 1%   +84.57%  (p=0.016 n=5+4)
ycsb/E/values=64   19.1G ± 9%   82.7G ± 3%  +333.09%  (p=0.008 n=5+5)

name              old write    new write    delta
ycsb/A/values=64    287G ± 3%    284G ± 2%      ~     (p=0.730 n=5+4)
ycsb/B/values=64   46.7G ±11%  195.9G ± 1%  +319.21%  (p=0.016 n=5+4)
ycsb/C/values=64   1.38G ±19%   2.99G ± 5%  +116.60%  (p=0.008 n=5+5)
ycsb/D/values=64   86.9G ± 8%  150.8G ± 1%   +73.38%  (p=0.016 n=5+4)
ycsb/E/values=64   22.6G ± 9%   86.9G ± 2%  +283.91%  (p=0.008 n=5+5)

name              old r-amp    new r-amp    delta
ycsb/A/values=64    6.43 ± 6%    6.35 ±10%      ~     (p=0.452 n=5+5)
ycsb/B/values=64    4.18 ± 0%    3.53 ± 1%   -15.61%  (p=0.008 n=5+5)
ycsb/C/values=64    3.03 ± 0%    1.07 ± 2%   -64.78%  (p=0.008 n=5+5)
ycsb/D/values=64    4.47 ± 1%    4.17 ± 0%    -6.76%  (p=0.008 n=5+5)
ycsb/E/values=64    4.12 ± 0%    2.58 ± 6%   -37.22%  (p=0.008 n=5+5)

name              old w-amp    new w-amp    delta
ycsb/A/values=64    3.26 ± 4%    3.29 ± 5%      ~     (p=0.452 n=5+5)
ycsb/B/values=64    4.29 ± 1%   14.86 ± 3%  +246.80%  (p=0.008 n=5+5)
ycsb/C/values=64    0.00         0.00           ~     (all equal)
ycsb/D/values=64    12.1 ± 1%    19.9 ± 6%   +63.98%  (p=0.008 n=5+5)
ycsb/E/values=64    10.6 ± 2%    34.0 ± 1%  +219.53%  (p=0.008 n=5+5)

Compared to the last benchmark, this helps remove all of the regression in the small value size tests. We see noticeable improvements throughout all read-heavy workloads.

Previously, our compactions were triggered on write based heuristics. This change introduces compactions based on high read activity to improve read performance in read-heavy workloads. It is inspired from LevelDB's read based compaction heuristic which is outlined in the following issue: cockroachdb#29. These compactions are triggered using an `AllowedSeeks` parameter on each file's `FileMetadata`. Reads are sampled based on a rate defined in `iterator.MaybeSampleRead()`. If a read is sampled, `AllowedSeeks` is decremented on the file in the top-most level containing the key. Once `AllowedSeeks` reaches 0, a compaction for the key range is scheduled. Read triggered compactions are only considered if no other compaction is possible at the time. This helps prioritize score based compactions to maintain a healthy LSM shape. The results of this change in benchmarks are outlined in the github issue: cockroachdb#29.

Previously, our compactions were triggered on write based heuristics. This change introduces compactions based on high read activity to improve read performance in read-heavy workloads. It is inspired from LevelDB's read based compaction heuristic which is outlined in the following issue: cockroachdb#29. These compactions are triggered using an `AllowedSeeks` parameter on each file's `FileMetadata`. Reads are sampled based on a rate defined in `iterator.MaybeSampleRead()`. If a read is sampled, `AllowedSeeks` is decremented on the file in the top-most level containing the key. Once `AllowedSeeks` reaches 0, a compaction for the key range is scheduled. Read triggered compactions are only considered if no other compaction is possible at the time. This helps prioritize score based compactions to maintain a healthy LSM shape. The results of this change in benchmarks are outlined in the github issue: cockroachdb#29. Fixes cockroachdb#29.

This change enables read based compactions by default and sets the read sampling threshold. This is based on the findings in cockroachdb#29.

This change enables read based compactions by default and sets the read sampling threshold. This is based on the findings in #29.

petermattis mentioned this issue Feb 25, 2020

perf: compaction improvements meta issue #552

Open

jbowens mentioned this issue Sep 17, 2020

perf: heuristics to reduce impact of point tombstones on read latency #918

Closed

aadityasondhi self-assigned this Oct 21, 2020

aadityasondhi mentioned this issue Oct 21, 2020

[WIP] read triggered compaction based on numReads #968

Closed

aadityasondhi mentioned this issue Nov 23, 2020

*: Introduce read triggered compaction heuristic #1009

Merged

aadityasondhi closed this as completed in #1009 Dec 21, 2020

aadityasondhi added a commit to aadityasondhi/pebble that referenced this issue Dec 22, 2020

options: enable read based compactions

3ebb387

This change enables read based compactions by default and sets the read sampling threshold. This is based on the findings in cockroachdb#29.

aadityasondhi mentioned this issue Dec 22, 2020

options: enable read based compactions #1032

Merged

aadityasondhi added a commit to aadityasondhi/pebble that referenced this issue Dec 22, 2020

options: enable read based compactions

249de1f

This change enables read based compactions by default and sets the read sampling threshold. This is based on the findings in cockroachdb#29.

aadityasondhi added a commit that referenced this issue Dec 22, 2020

options: enable read based compactions

23ad85a

This change enables read based compactions by default and sets the read sampling threshold. This is based on the findings in #29.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: read-based compaction heuristic #29

perf: read-based compaction heuristic #29

petermattis commented Dec 20, 2018

petermattis commented May 21, 2019

tbg commented May 21, 2019

ajkr commented May 21, 2019

petermattis commented May 21, 2019

aadityasondhi commented Nov 2, 2020 •

edited

Loading

petermattis commented Nov 2, 2020

aadityasondhi commented Nov 2, 2020

petermattis commented Nov 3, 2020

aadityasondhi commented Nov 3, 2020

petermattis commented Nov 3, 2020

aadityasondhi commented Nov 3, 2020

aadityasondhi commented Nov 4, 2020

aadityasondhi commented Nov 6, 2020 •

edited

Loading

jbowens commented Nov 6, 2020

aadityasondhi commented Nov 6, 2020

jbowens commented Nov 6, 2020

sumeerbhola commented Nov 6, 2020

aadityasondhi commented Nov 6, 2020 •

edited

Loading

aadityasondhi commented Nov 6, 2020

aadityasondhi commented Nov 11, 2020

aadityasondhi commented Nov 12, 2020 •

edited

Loading

aadityasondhi commented Nov 16, 2020

petermattis commented Nov 17, 2020

aadityasondhi commented Nov 17, 2020

petermattis commented Nov 17, 2020

itsbilal commented Nov 17, 2020

aadityasondhi commented Nov 17, 2020

petermattis commented Nov 17, 2020

aadityasondhi commented Nov 17, 2020

aadityasondhi commented Nov 23, 2020

aadityasondhi commented Dec 14, 2020

perf: read-based compaction heuristic #29

perf: read-based compaction heuristic #29

Comments

petermattis commented Dec 20, 2018

petermattis commented May 21, 2019

tbg commented May 21, 2019

ajkr commented May 21, 2019

petermattis commented May 21, 2019

aadityasondhi commented Nov 2, 2020 • edited Loading

petermattis commented Nov 2, 2020

aadityasondhi commented Nov 2, 2020

petermattis commented Nov 3, 2020

aadityasondhi commented Nov 3, 2020

petermattis commented Nov 3, 2020

aadityasondhi commented Nov 3, 2020

aadityasondhi commented Nov 4, 2020

aadityasondhi commented Nov 6, 2020 • edited Loading

jbowens commented Nov 6, 2020

aadityasondhi commented Nov 6, 2020

jbowens commented Nov 6, 2020

sumeerbhola commented Nov 6, 2020

aadityasondhi commented Nov 6, 2020 • edited Loading

aadityasondhi commented Nov 6, 2020

aadityasondhi commented Nov 11, 2020

aadityasondhi commented Nov 12, 2020 • edited Loading

aadityasondhi commented Nov 16, 2020

petermattis commented Nov 17, 2020

aadityasondhi commented Nov 17, 2020

petermattis commented Nov 17, 2020

itsbilal commented Nov 17, 2020

aadityasondhi commented Nov 17, 2020

petermattis commented Nov 17, 2020

aadityasondhi commented Nov 17, 2020

aadityasondhi commented Nov 23, 2020

aadityasondhi commented Dec 14, 2020

aadityasondhi commented Nov 2, 2020 •

edited

Loading

aadityasondhi commented Nov 6, 2020 •

edited

Loading

aadityasondhi commented Nov 6, 2020 •

edited

Loading

aadityasondhi commented Nov 12, 2020 •

edited

Loading