vfs: open sstables with POSIX_FADV_RANDOM #198

ajkr · 2019-07-31T06:16:19Z

Linux supports a syscall, fadvise64() (typically accessed via libc's posix_fadvise()), for the application to give hints to the kernel about the access pattern it'll follow when reading from a file descriptor. If not specified, the default hint is POSIX_FADV_NORMAL. This hint enables readahead that is device-dependent. On my system it is 128KB (the units for the below command are 512-byte sectors):

$ sudo blockdev --getra /dev/nvme0n1
256

This causes a problem for Pebble on short to medium range scans. Since data blocks are only 32KB (and smaller when compressed), doing 128KB readahead causes us to do I/O for data blocks that end up not being needed for the scan. In an I/O bound use case, this results in decreased user-visible throughput.

To repro, let's use the following setup.

c5d.4xlarge instance
mlock 26GB, then there's ~4GB available memory
1GB block cache
64 byte values, 16GB database
workload: 1000 key scans, uniform distribution, 16 concurrent scanners

We can generate the DB as follows:

$ ./pebble ycsb /mnt/data1/bench_16G/ --wipe --workload insert=100 --values 64 --initial-keys 0 --keys uniform --concurrency 64 --batch 64 --num-ops $[(16 << 30) / 64 / 64] --duration 0s --wait-compactions

Or we can use my pre-generated DB to save a bit of time:

$ aws s3 cp s3://andrewk-artifacts/bench_val64_db16G.tar /mnt/data1/ --region us-east-2
$ mkdir -p /mnt/data1/bench_16G && tar -xf /mnt/data1/bench_val64_db16G.tar -C /mnt/data1/bench_16G

The purpose of mlock()ing memory is to artificially make the available memory smaller so we can test the scenario of DB size larger than memory. This saves us time in not having to prepopulate an enormous DB.

To mlock 26GB, we can compile this simple program (https://gist.github.com/ajkr/f7501a1177647e9a45cb636002c76e39) and run it as root:

$ gcc ./mlock.c -o ./mlock
$ chmod +x ./mlock
$ sudo ./mlock 26 $[1024*1024*1024] &

Then the benchmark commands are below. Make sure you've built pebble with -tags rocksdb for the second command to work:

$ ./pebble ycsb /mnt/data1/bench_16G --workload scan=100 --concurrency 16 --prepopulated-keys $[16*1024*1024*1024/64] --initial-keys 0 --keys uniform --duration 2m --cache $[1*1024*1024*1024] --scans 1000
$ ./pebble ycsb /mnt/data1/bench_16G --workload scan=100 --concurrency 16 --prepopulated-keys $[16*1024*1024*1024/64] --initial-keys 0 --keys uniform --duration 2m --cache $[1*1024*1024*1024] --scans 1000 --rocksdb

Results:

engine	ops/sec	p50 ms	p99 ms
pebble	2508.9	6.6	12.1
rocksdb	6345.6	2.6	4.5

It's somewhat informative to look at the resource usage of those benchmark commands. For Pebble, measuring resource usage with /usr/bin/time shows:

93.30user 162.88system 2:00.73elapsed 212%CPU (0avgtext+0avgdata 2417616maxresident)k
186368744inputs+608outputs (282major+505178minor)pagefaults 0swaps

While with RocksDB we see:

286.87user 134.34system 2:00.68elapsed 349%CPU (0avgtext+0avgdata 2517884maxresident)k
186770144inputs+504outputs (6186major+640093minor)pagefaults 0swaps

This indicates the engines do the same amount of read I/O, while RocksDB achieves a much higher user-visible throughput than Pebble. Peter's experiments confirmed this is indeed due to the different readahead settings. We should use POSIX_FADV_RANDOM (i.e., hint kernel to do zero readahead) to be competitive on short or medium scans.

The text was updated successfully, but these errors were encountered:

ajkr · 2019-07-31T06:34:29Z

@itsbilal this task could give you some exposure to our low level performance work. feel free to claim it from me if you're interested.

petermattis · 2019-07-31T13:01:37Z

That’s interesting. Pebble is doing the same amount of I/Os but using less CPU and achieving less scan ops/sec. To speculate, maybe it’s I/O bound in both cases but the useful I/Os are diluted in Pebble’s case because we didn’t fadvise random the file descriptors so we’re using default kernel readahead.

The difference in user and system time are both quite dramatic. It would be a fairly simple experiment to mark all files as POSIX_FADV_RANDOM and see if that fixes the discrepancy. Of course, it could be something else. I wonder what the CPU and blocking profiles show.

ajkr · 2019-07-31T19:37:51Z

Talked with Bilal. He'll try out the POSIX_FADV_RANDOM experiment like we described, though not necessarily in the next few weeks. I think this is a good intro to performance work since the gap is so large that it shouldn't be caused by something super subtle.

I will improve the issue description to be more instructive. It was originally just a dump of my thoughts/commands so as not to forget, under the assumption I'd be working on it.

petermattis · 2019-07-31T21:52:48Z

You've dropped this bone in front of me. My will power is not strong enough to wait days to pick it up, let alone weeks.

I was able to replicate your numbers above. Adding in a hack to fadvise(..., POSIX_FADV_RANDOM) immediately after opening any file results in Pebble achieving:

____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100   120.0s         757011         6308.4      6308373.9      2.5      2.4      3.4      8.4   3623.9

The higher p99 latencies seemed to occur all at startup, after which the p99 latencies settled down to around 3.7ms, but that startup regime skewed the overall numbers. I saw similar behavior from RocksDB. High initial p99 latencies which then leveled off at around 3.7ms. Here are the numbers from my RocksDB run:

____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100   120.1s         757810         6308.8      6308828.3      2.5      2.6      3.4      5.0     71.3

POSIX_FADV_NORMAL results in the default readahead specified by the block device, while POSIX_FADV_RANDOM causes no additional readahead, but performs reads at whatever size the application is requesting. The nvme1n1 device shows:

ubuntu@ip-10-12-40-64:~$ cat /sys/block/nvme1n1/bdi/read_ahead_kb
128

Given that we're using 32kb blocks in this database, I'm surprised the perf diff wasn't higher than 2.5x.

I also tested out different scan sizes, and with the POSIX_FADV_RANDOM hack, Pebble and RocksDB appear to have equivalent performance for scan sizes of 1, 10, and 100.

@itsbilal Sorry for stealing your thunder. I'll leave implementing real support for POSIX_FADV_RANDOM in your hands.

ajkr · 2019-07-31T21:56:26Z

You've dropped this bone in front of me. My will power is not strong enough to wait days to pick it up, let alone weeks.

Ha, I know, opportunities for small changes with 2x performance improvement do not show up often! I was also thinking of experimenting out of curiosity.

ajkr · 2019-07-31T21:59:54Z

Given that we're using 32kb blocks in this database, I'm surprised the perf diff wasn't higher than 2.5x.

Since we're scanning 1000 keys probably some of the readahead is actually useful particularly at the bottom level.

This is a hack because it sets POSIX_FADV_RANDOM on all opened files, not just sstables. Is that a problem? We rarely read the WAL or MANIFEST. See #198

petermattis · 2019-08-01T13:26:46Z

petermattis@2720e5b contains my hack. See also the fadvise-hack branch.

itsbilal · 2019-08-01T13:51:07Z

@petermattis no worries! I overestimated the amount of work on this at first, but it makes total sense to do the hack for now. I assume the proper fix is to only POSIX_FADV_RANDOM sstables, which I can do soon enough.

petermattis · 2019-08-01T13:54:19Z

I assume the proper fix is to only POSIX_FADV_RANDOM sstables, which I can do soon enough.

Yep. That will involve plumbing a new interface on either vfs.File or vfs.VFS.Open(). Sstable are opened in one place: https://github.com/petermattis/pebble/blob/master/table_cache.go#L335.

Trickiest part is to see if there is some way to test this so that it isn't accidentally regressed.

ajkr · 2019-08-07T20:16:05Z

I tried rewriting the description in a more informative way. LMK any questions.

This change calls fadvise with FADV_RANDOM on sstable file descriptors, to ensure that readahead is disabled. This reduces wasted I/Os when reading from sstables, since sstable reads especially for short to medium range scans do not read large enough contiguous blocks to be able to take advantage of readahead. Instead, readahead ends up reducing user-visible I/O performance. See cockroachdb#198 .

This change calls fadvise with FADV_RANDOM on sstable file descriptors, to ensure that readahead is disabled. This reduces wasted I/Os when reading from sstables, since sstable reads especially for short to medium range scans do not read large enough contiguous blocks to be able to take advantage of readahead. Instead, readahead ends up reducing user-visible I/O performance. See #198 .

itsbilal · 2019-08-12T17:41:42Z

Fix landed in master (#215 ). Closing.

ajkr self-assigned this Jul 31, 2019

petermattis assigned ajkr and unassigned ajkr Jul 31, 2019

petermattis changed the title ~~slowdown compared to rocksdb when DB is larger than memory~~ vfs: open sstables with POSIX_FADV_RANDOM Jul 31, 2019

petermattis added a commit that referenced this issue Aug 1, 2019

Hack to set POSIX_FADV_RANDOM on sstables

2720e5b

This is a hack because it sets POSIX_FADV_RANDOM on all opened files, not just sstables. Is that a problem? We rarely read the WAL or MANIFEST. See #198

petermattis assigned itsbilal Aug 1, 2019

itsbilal mentioned this issue Aug 8, 2019

vfs, table_cache: fadvise FADV_RANDOM on sstable files #215

Merged

ajkr removed their assignment Aug 8, 2019

itsbilal closed this as completed Aug 12, 2019

petermattis mentioned this issue Jun 2, 2020

storage: slower incremental backups on Pebble cockroachdb/cockroach#49710

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vfs: open sstables with POSIX_FADV_RANDOM #198

vfs: open sstables with POSIX_FADV_RANDOM #198

ajkr commented Jul 31, 2019 •

edited

Loading

ajkr commented Jul 31, 2019

petermattis commented Jul 31, 2019

ajkr commented Jul 31, 2019

petermattis commented Jul 31, 2019

ajkr commented Jul 31, 2019

ajkr commented Jul 31, 2019

petermattis commented Aug 1, 2019

itsbilal commented Aug 1, 2019

petermattis commented Aug 1, 2019

ajkr commented Aug 7, 2019

itsbilal commented Aug 12, 2019

vfs: open sstables with POSIX_FADV_RANDOM #198

vfs: open sstables with POSIX_FADV_RANDOM #198

Comments

ajkr commented Jul 31, 2019 • edited Loading

ajkr commented Jul 31, 2019

petermattis commented Jul 31, 2019

ajkr commented Jul 31, 2019

petermattis commented Jul 31, 2019

ajkr commented Jul 31, 2019

ajkr commented Jul 31, 2019

petermattis commented Aug 1, 2019

itsbilal commented Aug 1, 2019

petermattis commented Aug 1, 2019

ajkr commented Aug 7, 2019

itsbilal commented Aug 12, 2019

ajkr commented Jul 31, 2019 •

edited

Loading