Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vfs: open sstables with POSIX_FADV_RANDOM #198

Closed
ajkr opened this issue Jul 31, 2019 · 11 comments
Closed

vfs: open sstables with POSIX_FADV_RANDOM #198

ajkr opened this issue Jul 31, 2019 · 11 comments
Assignees

Comments

@ajkr
Copy link
Contributor

ajkr commented Jul 31, 2019

Linux supports a syscall, fadvise64() (typically accessed via libc's posix_fadvise()), for the application to give hints to the kernel about the access pattern it'll follow when reading from a file descriptor. If not specified, the default hint is POSIX_FADV_NORMAL. This hint enables readahead that is device-dependent. On my system it is 128KB (the units for the below command are 512-byte sectors):

$ sudo blockdev --getra /dev/nvme0n1
256

This causes a problem for Pebble on short to medium range scans. Since data blocks are only 32KB (and smaller when compressed), doing 128KB readahead causes us to do I/O for data blocks that end up not being needed for the scan. In an I/O bound use case, this results in decreased user-visible throughput.

To repro, let's use the following setup.

  • c5d.4xlarge instance
  • mlock 26GB, then there's ~4GB available memory
  • 1GB block cache
  • 64 byte values, 16GB database
  • workload: 1000 key scans, uniform distribution, 16 concurrent scanners

We can generate the DB as follows:

$ ./pebble ycsb /mnt/data1/bench_16G/ --wipe --workload insert=100 --values 64 --initial-keys 0 --keys uniform --concurrency 64 --batch 64 --num-ops $[(16 << 30) / 64 / 64] --duration 0s --wait-compactions

Or we can use my pre-generated DB to save a bit of time:

$ aws s3 cp s3://andrewk-artifacts/bench_val64_db16G.tar /mnt/data1/ --region us-east-2
$ mkdir -p /mnt/data1/bench_16G && tar -xf /mnt/data1/bench_val64_db16G.tar -C /mnt/data1/bench_16G

The purpose of mlock()ing memory is to artificially make the available memory smaller so we can test the scenario of DB size larger than memory. This saves us time in not having to prepopulate an enormous DB.

To mlock 26GB, we can compile this simple program (https://gist.github.com/ajkr/f7501a1177647e9a45cb636002c76e39) and run it as root:

$ gcc ./mlock.c -o ./mlock
$ chmod +x ./mlock
$ sudo ./mlock 26 $[1024*1024*1024] &

Then the benchmark commands are below. Make sure you've built pebble with -tags rocksdb for the second command to work:

$ ./pebble ycsb /mnt/data1/bench_16G --workload scan=100 --concurrency 16 --prepopulated-keys $[16*1024*1024*1024/64] --initial-keys 0 --keys uniform --duration 2m --cache $[1*1024*1024*1024] --scans 1000
$ ./pebble ycsb /mnt/data1/bench_16G --workload scan=100 --concurrency 16 --prepopulated-keys $[16*1024*1024*1024/64] --initial-keys 0 --keys uniform --duration 2m --cache $[1*1024*1024*1024] --scans 1000 --rocksdb

Results:

engine ops/sec p50 ms p99 ms
pebble 2508.9 6.6 12.1
rocksdb 6345.6 2.6 4.5

It's somewhat informative to look at the resource usage of those benchmark commands. For Pebble, measuring resource usage with /usr/bin/time shows:

93.30user 162.88system 2:00.73elapsed 212%CPU (0avgtext+0avgdata 2417616maxresident)k
186368744inputs+608outputs (282major+505178minor)pagefaults 0swaps

While with RocksDB we see:

286.87user 134.34system 2:00.68elapsed 349%CPU (0avgtext+0avgdata 2517884maxresident)k
186770144inputs+504outputs (6186major+640093minor)pagefaults 0swaps

This indicates the engines do the same amount of read I/O, while RocksDB achieves a much higher user-visible throughput than Pebble. Peter's experiments confirmed this is indeed due to the different readahead settings. We should use POSIX_FADV_RANDOM (i.e., hint kernel to do zero readahead) to be competitive on short or medium scans.

@ajkr ajkr self-assigned this Jul 31, 2019
@ajkr
Copy link
Contributor Author

ajkr commented Jul 31, 2019

@itsbilal this task could give you some exposure to our low level performance work. feel free to claim it from me if you're interested.

@petermattis
Copy link
Collaborator

That’s interesting. Pebble is doing the same amount of I/Os but using less CPU and achieving less scan ops/sec. To speculate, maybe it’s I/O bound in both cases but the useful I/Os are diluted in Pebble’s case because we didn’t fadvise random the file descriptors so we’re using default kernel readahead.

The difference in user and system time are both quite dramatic. It would be a fairly simple experiment to mark all files as POSIX_FADV_RANDOM and see if that fixes the discrepancy. Of course, it could be something else. I wonder what the CPU and blocking profiles show.

@ajkr
Copy link
Contributor Author

ajkr commented Jul 31, 2019

Talked with Bilal. He'll try out the POSIX_FADV_RANDOM experiment like we described, though not necessarily in the next few weeks. I think this is a good intro to performance work since the gap is so large that it shouldn't be caused by something super subtle.

I will improve the issue description to be more instructive. It was originally just a dump of my thoughts/commands so as not to forget, under the assumption I'd be working on it.

@petermattis
Copy link
Collaborator

You've dropped this bone in front of me. My will power is not strong enough to wait days to pick it up, let alone weeks.

I was able to replicate your numbers above. Adding in a hack to fadvise(..., POSIX_FADV_RANDOM) immediately after opening any file results in Pebble achieving:

____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100   120.0s         757011         6308.4      6308373.9      2.5      2.4      3.4      8.4   3623.9

The higher p99 latencies seemed to occur all at startup, after which the p99 latencies settled down to around 3.7ms, but that startup regime skewed the overall numbers. I saw similar behavior from RocksDB. High initial p99 latencies which then leveled off at around 3.7ms. Here are the numbers from my RocksDB run:

____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100   120.1s         757810         6308.8      6308828.3      2.5      2.6      3.4      5.0     71.3

POSIX_FADV_NORMAL results in the default readahead specified by the block device, while POSIX_FADV_RANDOM causes no additional readahead, but performs reads at whatever size the application is requesting. The nvme1n1 device shows:

ubuntu@ip-10-12-40-64:~$ cat /sys/block/nvme1n1/bdi/read_ahead_kb
128

Given that we're using 32kb blocks in this database, I'm surprised the perf diff wasn't higher than 2.5x.

I also tested out different scan sizes, and with the POSIX_FADV_RANDOM hack, Pebble and RocksDB appear to have equivalent performance for scan sizes of 1, 10, and 100.

@itsbilal Sorry for stealing your thunder. I'll leave implementing real support for POSIX_FADV_RANDOM in your hands.

@petermattis petermattis assigned ajkr and unassigned ajkr Jul 31, 2019
@petermattis petermattis changed the title slowdown compared to rocksdb when DB is larger than memory vfs: open sstables with POSIX_FADV_RANDOM Jul 31, 2019
@ajkr
Copy link
Contributor Author

ajkr commented Jul 31, 2019

You've dropped this bone in front of me. My will power is not strong enough to wait days to pick it up, let alone weeks.

Ha, I know, opportunities for small changes with 2x performance improvement do not show up often! I was also thinking of experimenting out of curiosity.

@ajkr
Copy link
Contributor Author

ajkr commented Jul 31, 2019

Given that we're using 32kb blocks in this database, I'm surprised the perf diff wasn't higher than 2.5x.

Since we're scanning 1000 keys probably some of the readahead is actually useful particularly at the bottom level.

petermattis added a commit that referenced this issue Aug 1, 2019
This is a hack because it sets POSIX_FADV_RANDOM on all opened files,
not just sstables. Is that a problem? We rarely read the WAL or
MANIFEST.

See #198
@petermattis
Copy link
Collaborator

petermattis@2720e5b contains my hack. See also the fadvise-hack branch.

@itsbilal
Copy link
Member

itsbilal commented Aug 1, 2019

@petermattis no worries! I overestimated the amount of work on this at first, but it makes total sense to do the hack for now. I assume the proper fix is to only POSIX_FADV_RANDOM sstables, which I can do soon enough.

@petermattis
Copy link
Collaborator

I assume the proper fix is to only POSIX_FADV_RANDOM sstables, which I can do soon enough.

Yep. That will involve plumbing a new interface on either vfs.File or vfs.VFS.Open(). Sstable are opened in one place: https://github.com/petermattis/pebble/blob/master/table_cache.go#L335.

Trickiest part is to see if there is some way to test this so that it isn't accidentally regressed.

@ajkr
Copy link
Contributor Author

ajkr commented Aug 7, 2019

I tried rewriting the description in a more informative way. LMK any questions.

itsbilal added a commit to itsbilal/pebble that referenced this issue Aug 8, 2019
This change calls fadvise with FADV_RANDOM on sstable file descriptors,
to ensure that readahead is disabled. This reduces wasted I/Os when
reading from sstables, since sstable reads especially for short to
medium range scans do not read large enough contiguous blocks to be
able to take advantage of readahead. Instead, readahead ends up reducing
user-visible I/O performance.

See cockroachdb#198 .
@ajkr ajkr removed their assignment Aug 8, 2019
itsbilal added a commit to itsbilal/pebble that referenced this issue Aug 9, 2019
This change calls fadvise with FADV_RANDOM on sstable file descriptors,
to ensure that readahead is disabled. This reduces wasted I/Os when
reading from sstables, since sstable reads especially for short to
medium range scans do not read large enough contiguous blocks to be
able to take advantage of readahead. Instead, readahead ends up reducing
user-visible I/O performance.

See cockroachdb#198 .
itsbilal added a commit to itsbilal/pebble that referenced this issue Aug 12, 2019
This change calls fadvise with FADV_RANDOM on sstable file descriptors,
to ensure that readahead is disabled. This reduces wasted I/Os when
reading from sstables, since sstable reads especially for short to
medium range scans do not read large enough contiguous blocks to be
able to take advantage of readahead. Instead, readahead ends up reducing
user-visible I/O performance.

See cockroachdb#198 .
itsbilal added a commit to itsbilal/pebble that referenced this issue Aug 12, 2019
This change calls fadvise with FADV_RANDOM on sstable file descriptors,
to ensure that readahead is disabled. This reduces wasted I/Os when
reading from sstables, since sstable reads especially for short to
medium range scans do not read large enough contiguous blocks to be
able to take advantage of readahead. Instead, readahead ends up reducing
user-visible I/O performance.

See cockroachdb#198 .
itsbilal added a commit that referenced this issue Aug 12, 2019
This change calls fadvise with FADV_RANDOM on sstable file descriptors,
to ensure that readahead is disabled. This reduces wasted I/Os when
reading from sstables, since sstable reads especially for short to
medium range scans do not read large enough contiguous blocks to be
able to take advantage of readahead. Instead, readahead ends up reducing
user-visible I/O performance.

See #198 .
@itsbilal
Copy link
Member

Fix landed in master (#215 ). Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants