-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vfs: open sstables with POSIX_FADV_RANDOM #198
Comments
@itsbilal this task could give you some exposure to our low level performance work. feel free to claim it from me if you're interested. |
The difference in user and system time are both quite dramatic. It would be a fairly simple experiment to mark all files as |
Talked with Bilal. He'll try out the I will improve the issue description to be more instructive. It was originally just a dump of my thoughts/commands so as not to forget, under the assumption I'd be working on it. |
You've dropped this bone in front of me. My will power is not strong enough to wait days to pick it up, let alone weeks. I was able to replicate your numbers above. Adding in a hack to
The higher p99 latencies seemed to occur all at startup, after which the p99 latencies settled down to around 3.7ms, but that startup regime skewed the overall numbers. I saw similar behavior from RocksDB. High initial p99 latencies which then leveled off at around 3.7ms. Here are the numbers from my RocksDB run:
Given that we're using 32kb blocks in this database, I'm surprised the perf diff wasn't higher than 2.5x. I also tested out different scan sizes, and with the @itsbilal Sorry for stealing your thunder. I'll leave implementing real support for |
Ha, I know, opportunities for small changes with 2x performance improvement do not show up often! I was also thinking of experimenting out of curiosity. |
Since we're scanning 1000 keys probably some of the readahead is actually useful particularly at the bottom level. |
This is a hack because it sets POSIX_FADV_RANDOM on all opened files, not just sstables. Is that a problem? We rarely read the WAL or MANIFEST. See #198
petermattis@2720e5b contains my hack. See also the fadvise-hack branch. |
@petermattis no worries! I overestimated the amount of work on this at first, but it makes total sense to do the hack for now. I assume the proper fix is to only |
Yep. That will involve plumbing a new interface on either Trickiest part is to see if there is some way to test this so that it isn't accidentally regressed. |
I tried rewriting the description in a more informative way. LMK any questions. |
This change calls fadvise with FADV_RANDOM on sstable file descriptors, to ensure that readahead is disabled. This reduces wasted I/Os when reading from sstables, since sstable reads especially for short to medium range scans do not read large enough contiguous blocks to be able to take advantage of readahead. Instead, readahead ends up reducing user-visible I/O performance. See cockroachdb#198 .
This change calls fadvise with FADV_RANDOM on sstable file descriptors, to ensure that readahead is disabled. This reduces wasted I/Os when reading from sstables, since sstable reads especially for short to medium range scans do not read large enough contiguous blocks to be able to take advantage of readahead. Instead, readahead ends up reducing user-visible I/O performance. See cockroachdb#198 .
This change calls fadvise with FADV_RANDOM on sstable file descriptors, to ensure that readahead is disabled. This reduces wasted I/Os when reading from sstables, since sstable reads especially for short to medium range scans do not read large enough contiguous blocks to be able to take advantage of readahead. Instead, readahead ends up reducing user-visible I/O performance. See cockroachdb#198 .
This change calls fadvise with FADV_RANDOM on sstable file descriptors, to ensure that readahead is disabled. This reduces wasted I/Os when reading from sstables, since sstable reads especially for short to medium range scans do not read large enough contiguous blocks to be able to take advantage of readahead. Instead, readahead ends up reducing user-visible I/O performance. See cockroachdb#198 .
This change calls fadvise with FADV_RANDOM on sstable file descriptors, to ensure that readahead is disabled. This reduces wasted I/Os when reading from sstables, since sstable reads especially for short to medium range scans do not read large enough contiguous blocks to be able to take advantage of readahead. Instead, readahead ends up reducing user-visible I/O performance. See #198 .
Fix landed in master (#215 ). Closing. |
Linux supports a syscall,
fadvise64()
(typically accessed via libc'sposix_fadvise()
), for the application to give hints to the kernel about the access pattern it'll follow when reading from a file descriptor. If not specified, the default hint isPOSIX_FADV_NORMAL
. This hint enables readahead that is device-dependent. On my system it is 128KB (the units for the below command are 512-byte sectors):This causes a problem for Pebble on short to medium range scans. Since data blocks are only 32KB (and smaller when compressed), doing 128KB readahead causes us to do I/O for data blocks that end up not being needed for the scan. In an I/O bound use case, this results in decreased user-visible throughput.
To repro, let's use the following setup.
We can generate the DB as follows:
Or we can use my pre-generated DB to save a bit of time:
The purpose of
mlock()
ing memory is to artificially make the available memory smaller so we can test the scenario of DB size larger than memory. This saves us time in not having to prepopulate an enormous DB.To mlock 26GB, we can compile this simple program (https://gist.github.com/ajkr/f7501a1177647e9a45cb636002c76e39) and run it as root:
Then the benchmark commands are below. Make sure you've built
pebble
with-tags rocksdb
for the second command to work:Results:
It's somewhat informative to look at the resource usage of those benchmark commands. For Pebble, measuring resource usage with
/usr/bin/time
shows:While with RocksDB we see:
This indicates the engines do the same amount of read I/O, while RocksDB achieves a much higher user-visible throughput than Pebble. Peter's experiments confirmed this is indeed due to the different readahead settings. We should use
POSIX_FADV_RANDOM
(i.e., hint kernel to do zero readahead) to be competitive on short or medium scans.The text was updated successfully, but these errors were encountered: