-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid file cache trashing on Linux with mmapfs by using madvise() ? #27748
Comments
I think this should be opened as a Lucene issue? |
Lucene already has a directory for this but it requires native code etc. You can go and use it, adding a custom directory to ES with a plugin is pretty straight forward. Since Java 10 will add the ability to add O_DIRECT to streams / channels I think I'd want to wait for this and add it as an optional thing we can use if you run on Java 10 (which will come early next year I assume). I hope this makes sense @micoq I will close this for now. we can still reopen if needed. Please feel free to continue the discussion here. |
Actually, I didn't tested In my case, the bottleneck was especially about the read operations while searching documents. I wanted to know why mmapfs performed badly on large setups contrary to niofs (since mmapfs is theoretically better because it avoid extra copies into an user space buffer). |
that is truly interesting. It almost seems that with madvice the OS is a bit more diligent with mapping memory into the cache. So this is all pure guessing but in default mode mmap will do initiate quite a bit of read-ahead while in This is quite an interesting place to do some research. I am convinced we won't ship any native code in core by default but there might be room for a plugin here. |
@micoq can you tell how much memory your machine has and how much of it you are giving to elasticsearch when you run these tests? |
@s1monw Sure. Elasticsearch is configured with a heap of 24GB:
Compressed pointers are enabled: |
I didn't do the test with NIO but if you try to map a big file and touch a single byte every 2Mbytes, Linux will load the entire 2MB chunk from the disk. Here is a quick and dirty example to illustrate this (do a drop caches before executing it):
Note that on OpenJDK 8 and Linux, if you preload the file with
|
@micoq maybe you can use |
@childe Thank you but unfortunately this doesn't have any effect (I tried different values, the default was 8192 blocks in my case). The bottleneck came from the memory management read ahead which is different from the block device read ahead. The memory management read ahead is only used for mapped files and the block device read ahead for all read operations on storage devices whether using In fact, the 2MB read ahead limit I observed in my tests seems to be hard coded in the kernel: |
Pinging @elastic/es-core-infra |
Hello, I've just uploaded a plugin for Elasticsearch which implements memory mapping with madvise() and direct I/O for merges. It's available here: https://github.com/micoq/native-unix-store |
Hello,micoq |
Hello @azuresky11, Can you provide some details about the configuration, the queries and the dataset you used to run these tests ?
Are the results the same if you enable the plugin but you disable mmap in the test index with Finally, it's possible the queries you made read data more sequentially than randomly so the |
@micoq |
It's interesting : So You could try directIO for merges but it's not necessary better in terms of speed (it only save the filesystem cache). If I'm not mistaken, the |
@micoq |
@azuresky11 Anyway, by default |
Hello @micoq |
Hello @azuresky11, Basically, if your queries do a lot of random accesses on cold data "nativeunixfs" will loads far less data from the storage: you will see a lower throughput (with As you can see on my first post: on a single test the average time is not necessary better than "mmapfs". However, with "nativeunixfs", it loads ~1.5 GB of data and ~7.4 GB with "mmapfs" for the same results so 5.9 GB of cache wasted by the requests. In some cases, the saved filesystem cache can be used to serve other requests and finally improve the global response time (in my case, I use Elasticsearch for logging with constant indexing and many parallel requests). In you case, maybe "mmapfs" will be sufficient (and you will not have to manage the native libraries). If you have no performance drops, you can keep it. You can also try the new "hybridfs" store included in the last version of Elasticsearch (6.7.0): Just a little (m)advice for your tests: between each test, don't forget to drop the caches with |
Thankyou @micoq
|
@s1monw Ok, I found why However, it is not that simple ! To begin, the 2MB size hardcoded in the kernel are not the readahead size but the maximum size of the data the kernel can load at once from the storage while performing a readahead operation Unfortunately, the readahead size of the Elasticsearch data partition was also 2MB on all my testing machines... In addition, using Some system have a small readahead by default (usually 128KB) but on large installations with hardware RAID you can have a large readahead (4MB). In Lucene, some files are read with many "jumps" or "holes" like the A more graphical example:
This happens on a For some reasons, the readahead is always maximal on mapped memory (
Now, this will not eliminate the cache consumption while merging and it's not always the best choice to use a small readahead (or |
@micoq We analyzed the same problem with |
With https://issues.apache.org/jira/browse/LUCENE-8982 introducing a pure java based |
Is there any progress on this plan? |
See apache/lucene#13196, which is going into Lucene 9.11. It is used with Java 21 or later. |
With recent Java 21/22 changes around project Panama this is no longer needed as you can pass a The problem is more |
Thanks @uschindler for closing the loop. I'm closing this issue in favor of the Lucene and JDK issues that you shared above. |
With
mmapfs
, search queries load more data than necessary in the page cache. By default, every memory mapping done withmmap()
(andFileChannel.map()
in Java) on Linux is expected to be read (almost) sequentially. However, when a search request is done, the inverted index and other data structures seem to be read randomly so the system loads extra memory pages before and after the necessary page.This results in a lot of I/O from the storage to warm up the file cache. In addition the cache is filling with unnecessary data and can evict more quickly the hot pages, slowing the next requests.
The problem is more visible with big indices (~1TB in our case).
To avoid this, Linux provides the
madvise()
syscall to change the prefetching behavior of memory maps. You can tell the system to avoid loading pages by using this with the flagMADV_RANDOM
.Unfortunately, Java doesn't use this syscall. Lucene provides a native library to do this :
org.apache.lucene.store.NativePosixUtil
but it doesn't seem to be used.To illustrate this, I made some tests on readonly indices (~60GB) with a batch of search requests (bool requests on some fields with size=0, just document count). Each index have been optimized with
_forcemerge
:Each column represents a single test and results are in ms:
echo 3 > /proc/sys/vm/drop_caches
)The query cache and the request cache have been disabled.
The storage is made of spinning disks.
Elasticsearch version: 5.5.0.
OS: RHEL 7.3
You can see
mmapfs
is consuming more cache and IO thanniofs
.In the
madv
column, I patched Lucene (MMapDirectory
) to executemadvise(MADV_RANDOM)
on each mapped file. This further improve the file cache and I/O consumption. In addition, the search are faster on warmed data.To do this, I just add a single line in
MMapDirectory.java
:Then I compile the shared native library
libNativePosixUtil.so
(with Lucene sources):And finally, starts Elasticsearch with
-Djava.library.path=/.../lucene-6.6.0/build/native/NativePosixUtil.so
injvm.options
.I didn't know if this solution can be applied in all cases and I didn't test all the cases (replication, merging, other queries...) but it could explain why
mmapfs
badly perform on large setups for searching. Some users reported a similar behavior like here, here or here.I didn't know if there is a similar problem on Windows since it's memory management is different.
The text was updated successfully, but these errors were encountered: