Avoid file cache trashing on Linux with mmapfs by using madvise() ? #27748

micoq · 2017-12-10T23:14:23Z

With mmapfs, search queries load more data than necessary in the page cache. By default, every memory mapping done with mmap() (and FileChannel.map() in Java) on Linux is expected to be read (almost) sequentially. However, when a search request is done, the inverted index and other data structures seem to be read randomly so the system loads extra memory pages before and after the necessary page.
This results in a lot of I/O from the storage to warm up the file cache. In addition the cache is filling with unnecessary data and can evict more quickly the hot pages, slowing the next requests.
The problem is more visible with big indices (~1TB in our case).

To avoid this, Linux provides the madvise() syscall to change the prefetching behavior of memory maps. You can tell the system to avoid loading pages by using this with the flag MADV_RANDOM.
Unfortunately, Java doesn't use this syscall. Lucene provides a native library to do this : org.apache.lucene.store.NativePosixUtil but it doesn't seem to be used.

To illustrate this, I made some tests on readonly indices (~60GB) with a batch of search requests (bool requests on some fields with size=0, just document count). Each index have been optimized with _forcemerge:

                            Warm                  Cold
                       madv   mmap    nio   madv   mmap   nio

query  1               8276   9100   9422  13967  10487 10769
query  2                  9     10      9     95   1031    28
query  3                403    774    753   1019   1267   839
query  4                428    852    739    702   1025   857
query  5               4003   5591   5580   7970   6778  5947
query  6               1608   2237   2567   2611   2511  2594
query  7               5154   7193   7476   7890   7204  7943
query  8                438    705    707   1110   1211   793
query  9               2824   3922   4377   4143   4400  4237
query 10               2313   3235   3073   3086   3262  3471
average                2545   3361   3470   4259   3917  3747

consumed cache (Mio)      -      -      -   1607   7659  4687
storage I/O (Mio/s)       0      0      0    ~30   ~250  ~150

Each column represents a single test and results are in ms:

"cold" is made after a fresh startup and empty caches (echo 3 > /proc/sys/vm/drop_caches)
"warm" is the same test made right after the first one

The query cache and the request cache have been disabled.
The storage is made of spinning disks.
Elasticsearch version: 5.5.0.
OS: RHEL 7.3

You can see mmapfs is consuming more cache and IO than niofs.

In the madv column, I patched Lucene (MMapDirectory) to execute madvise(MADV_RANDOM) on each mapped file. This further improve the file cache and I/O consumption. In addition, the search are faster on warmed data.
To do this, I just add a single line in MMapDirectory.java:

final ByteBuffer[] map(String resourceDescription, FileChannel fc, long offset, long length) throws IOException {
...
  try {
    buffer = fc.map(MapMode.READ_ONLY, offset + bufferStart, bufSize);
    NativePosixUtil.madvise(buffer,NativePosixUtil.RANDOM); // here !
  } catch (IOException ioe) {
    throw convertMapFailedIOException(ioe, resourceDescription, bufSize);
  }
...
  return buffers;
}

Then I compile the shared native library libNativePosixUtil.so (with Lucene sources):

cd lucene-6.6.0/misc
ant build-native-unix

And finally, starts Elasticsearch with -Djava.library.path=/.../lucene-6.6.0/build/native/NativePosixUtil.so in jvm.options.

I didn't know if this solution can be applied in all cases and I didn't test all the cases (replication, merging, other queries...) but it could explain why mmapfs badly perform on large setups for searching. Some users reported a similar behavior like here, here or here.

I didn't know if there is a similar problem on Windows since it's memory management is different.

The text was updated successfully, but these errors were encountered:

jasontedor · 2017-12-15T14:06:45Z

I think this should be opened as a Lucene issue?

s1monw · 2017-12-15T14:20:23Z

Lucene already has a directory for this but it requires native code etc. You can go and use it, adding a custom directory to ES with a plugin is pretty straight forward. Since Java 10 will add the ability to add O_DIRECT to streams / channels I think I'd want to wait for this and add it as an optional thing we can use if you run on Java 10 (which will come early next year I assume).

I hope this makes sense @micoq I will close this for now. we can still reopen if needed. Please feel free to continue the discussion here.

micoq · 2017-12-15T23:27:27Z

Actually, I didn't tested NativeUnixDirectory yet.
O_DIRECT could helps to reduce the cache trashing while merging/writing segments (and maybe for the translog or the shards restoration ?). It's a good idea to integrate this directly into the JVM.

In my case, the bottleneck was especially about the read operations while searching documents. I wanted to know why mmapfs performed badly on large setups contrary to niofs (since mmapfs is theoretically better because it avoid extra copies into an user space buffer).

s1monw · 2017-12-18T08:35:28Z

In my case, the bottleneck was especially about the read operations while searching documents. I wanted to know why mmapfs performed badly on large setups contrary to niofs (since mmapfs is theoretically better because it avoid extra copies into an user space buffer).

that is truly interesting. It almost seems that with madvice the OS is a bit more diligent with mapping memory into the cache. So this is all pure guessing but in default mode mmap will do initiate quite a bit of read-ahead while in MADV_RANDOM mode it won't do any readahead. In the NIO case I guess the readahead overhead amortizes the extra copy to a userspace buffer, that the OS needs to do in the NIO case which is actually surprising since there is quite a bit of an overhead when doing normal file IO compared to mmaps. The latter has literally no syscalls involved (ideally) when reading from it. Also seeks are basically pointer manipulations. I wonder if the reads from NIO if indices / files are very large are faster due to the better usage of FS caches and amortized or rather prevented read-aheads.

This is quite an interesting place to do some research. I am convinced we won't ship any native code in core by default but there might be room for a plugin here.

s1monw · 2017-12-18T08:36:27Z

@micoq can you tell how much memory your machine has and how much of it you are giving to elasticsearch when you run these tests?

micoq · 2017-12-18T17:08:53Z

@s1monw Sure.
This machine has 64 GB of memory (swap is disabled).

Elasticsearch is configured with a heap of 24GB:

jvm.options
-Xms24576m
-Xmx24576m

Compressed pointers are enabled:
[2017-12-18T18:05:47,129][INFO ][o.e.e.NodeEnvironment ] [cLvT3QE] heap size [24gb], compressed ordinary object pointers [true]

micoq · 2017-12-18T22:36:48Z

I didn't do the test with NIO but if you try to map a big file and touch a single byte every 2Mbytes, Linux will load the entire 2MB chunk from the disk. Here is a quick and dirty example to illustrate this (do a drop caches before executing it):

public static void main(String[] args) throws IOException {
    MMapDirectory dir = new MMapDirectory(Paths.get("/home/me/test"));
    dir.setPreload(false); // just to be sure
    IOContext ctx = new IOContext();
    IndexInput in = dir.openInput("bigfile.dat", ctx);
    long pos = 0;
    while(true) {
	    in.seek(pos);
	    //pos += 2 << 11; // 4 MB/s (a byte every single page or 4096 bytes)
	    //pos += 2 << 12; // 8 MB/s (a byte every 2 pages or 8192 bytes)
	    //pos += 2 << 13; // 16 MB/s
	    //pos += 2 << 14; // 32 MB/s
	    //pos += 2 << 15;  // 64 MB/s
	    //pos += 2 << 16;  // 128 MB/s
	    //pos += 2 << 17;  // 256 MB/s
	    //pos += 2 << 18;  // 512 MB/s (this is a theoretical value for 1ms delay)
	    //pos += 2 << 19;  // 1024 MB/s
	    pos += 2 << 20;  // 2048 MB/s
            //pos += 2 << 21; // 4096 MB/s
            //pos += 2 << 22; // 4096 MB/s (the speed doesn't increase after this)
	    try {
		    Thread.sleep(1); // set this to 10 or 100 if the storage doesn't keeps up
	    } catch (InterruptedException e) {}
	    in.readByte();
    }
}

Note that on OpenJDK 8 and Linux, if you preload the file with dir.setPreload(true) (the load() method from MappedByteBuffer: https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html), the JVM will execute madvise(MADV_WILLNEED) on the whole mapping (it's the only case where Java seems to use madvise()). On Windows, the JVM just touch all the pages by reading a single value on every page (like the example above with the first line uncommented). (MappedByteBuffer.c in the JDK source)

madvise(MADV_RANDOM) effectively disables readaheads. Unfortunately, I didn't find any other way (i.e. without a native library) to change the kernel behavior.

childe · 2018-06-07T04:13:04Z

@micoq maybe you can use blockdev --setra to change the kernel behavior.

micoq · 2018-06-07T08:02:52Z

@childe Thank you but unfortunately this doesn't have any effect (I tried different values, the default was 8192 blocks in my case). The bottleneck came from the memory management read ahead which is different from the block device read ahead.

The memory management read ahead is only used for mapped files and the block device read ahead for all read operations on storage devices whether using niofs or mmapfs.

In fact, the 2MB read ahead limit I observed in my tests seems to be hard coded in the kernel:
https://github.com/torvalds/linux/blob/master/mm/readahead.c#L236

elasticmachine · 2018-11-09T11:13:06Z

Pinging @elastic/es-core-infra

micoq · 2018-11-12T00:50:33Z

Hello,

I've just uploaded a plugin for Elasticsearch which implements memory mapping with madvise() and direct I/O for merges. It's available here: https://github.com/micoq/native-unix-store

azuresky11 · 2019-03-12T03:02:30Z

Hello，micoq
I have a question about the native-unix-store plugin,I did two tests,
The two test conditions are the same, they are all tested under the condition that the memory is not enough.
in these two test cases, the environment and parameters are the same. I want to know why it spends more time after installing native-unix-store plugin? thanks

micoq · 2019-03-17T18:38:02Z

Hello @azuresky11,

Can you provide some details about the configuration, the queries and the dataset you used to run these tests ?

OS, Elasticsearch, Java and plugin versions
CPU / RAM / Storage
involved queries in your tests (read only, writes, only indexing... ?)
index settings (did you used other settings than "index.store.type":"nativeunixfs" in you second test ?)
dataset (size, number of shards)

Are the results the same if you enable the plugin but you disable mmap in the test index with "index.store.mmap.enabled": false ?

Finally, it's possible the queries you made read data more sequentially than randomly so the madvise(RANDOM) optimization will not be efficient. Let's try with "index.store.mmap.read_ahead": true

azuresky11 · 2019-03-19T03:32:02Z

@micoq
This is my configuration：
1、CentOS Linux release 7.4.1708 (Core)，elasticsearch-6.6.0，openjdk11, plugin versions:native-unix-store-6.6.0-1.0.0
2、40U 20core, virtual machine created on the server. The memory size is 16g，storage is sata 4T.
3、Test case is segment merge(but i find only one core is used while i set 40U).This is my command:curl -XPOST http://162.19.33.112:9206/es*/_forcemerge?max_num_segments=1.
4、I tried this setting :"index.store.type":“mmapfs", It takes less time.
5、Dataset：80g，5shards，0 replicas
I will try your suggestion，thanks for you reply

micoq · 2019-03-19T20:50:53Z

@azuresky11

It's interesting :
By default, nativeunixfs uses the same IO method as niofs for reading segments in a merge context (if you don't use directIO) whereas mmapfs uses the memory mapped file (both use the same method for writing).
nativeunixfs only uses the memory mapped file for search queries. This could explain the difference.

So niofs would be less efficient than mmapfs for merges. In this case, you should have the same result with niofs or nativeunixfs on your merge test.

You could try directIO for merges but it's not necessary better in terms of speed (it only save the filesystem cache).

If I'm not mistaken, the _forcemerge operation always uses a single thread/core.

azuresky11 · 2019-03-21T14:09:57Z

@micoq
Which performance is better for niofs and mmapfs after installing the plugin? (For writing and aggregation)

micoq · 2019-03-21T15:15:58Z

@azuresky11
The plugin didn't change the behavior of niofs ("index.store.type":"niofs" in index settings) or mmapfs ("index.store.type":"mmapfs" in index settings).
It only be used if "index.store.type":"nativeunixfs" is set in the index settings.

Anyway, by default mmapfs is theoretically better than niofs on access time since it avoids an extra copy of data between the kernel space and the user space (to fill a buffer).
However, mmapfs tend to load more data than necessary and can perform worse than niofs. That's why I wrote this plugin.

azuresky11 · 2019-03-31T15:20:43Z

Hello @micoq
I used ssd to do several query tests, the dataset is 1t, when I set "index.store.type": "mmapfs" and "index.store.type": "nativeunixfs", the query time of these two settings almost the same.Do you know why this result?

micoq · 2019-04-01T17:26:21Z

Hello @azuresky11,
The benefits of the plugin depend heavily on the data structure of your index and your filesystem cache usage. It is difficult to predict the behavior without performing some tests in real conditions.

Basically, if your queries do a lot of random accesses on cold data "nativeunixfs" will loads far less data from the storage: you will see a lower throughput (with iotop) and the filesystem cache will keep old data for a longer time.

As you can see on my first post: on a single test the average time is not necessary better than "mmapfs". However, with "nativeunixfs", it loads ~1.5 GB of data and ~7.4 GB with "mmapfs" for the same results so 5.9 GB of cache wasted by the requests.

In some cases, the saved filesystem cache can be used to serve other requests and finally improve the global response time (in my case, I use Elasticsearch for logging with constant indexing and many parallel requests).

In you case, maybe "mmapfs" will be sufficient (and you will not have to manage the native libraries). If you have no performance drops, you can keep it.

You can also try the new "hybridfs" store included in the last version of Elasticsearch (6.7.0):
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html
This store combine "niofs" and "mmapfs".

Just a little (m)advice for your tests: between each test, don't forget to drop the caches with
echo 3 > /proc/sys/vm/drop_caches
after closing the indices and before reopening them with the new store type or the data will not be dropped from the cache.
(doing this command while a mapped file is loaded in the cache will have no effect !)

azuresky11 · 2019-04-12T12:12:39Z

Thankyou @micoq
I did a write test using esrally, this is my test environment, arm architecture, 4cpu, 16g memory, -Xms8g, -Xmx8g, the data set size I wrote is 3g,between each test,i restarted OS, test results:

	mmapfs	niofs	nativeunixfs
Median Throughput（docs/s）	33449.35	33186.3	31830.4
I did a lot of tests and the results are the same.According to the results, the plugin performance is not effective.What is wrong here?

micoq · 2019-04-19T12:47:59Z

@s1monw Ok, I found why mmapfs loads too much data: it was... the readahead of the filesystem ! (@childe you was right !).

However, it is not that simple !

To begin, the 2MB size hardcoded in the kernel are not the readahead size but the maximum size of the data the kernel can load at once from the storage while performing a readahead operation
https://github.com/torvalds/linux/blob/master/mm/readahead.c#L236
If necessary, the kernel can load more than one chunk of 2MB.

Unfortunately, the readahead size of the Elasticsearch data partition was also 2MB on all my testing machines...

In addition, using blockdev --setra on the whole disk is not sufficient. You must set it at the partition level to be taken into account (e.g. on a LVM logicial volume).
For some filesystems, you cannot use blockdev and have to modify it directly in /sys. For example, with btrfs, you can modify the readahead here: /sys/devices/virtual/bdi/btrfs-1/read_ahead_kb

Some system have a small readahead by default (usually 128KB) but on large installations with hardware RAID you can have a large readahead (4MB).

In Lucene, some files are read with many "jumps" or "holes" like the .doc files (links between terms and documents containing the terms). With a large readahead, these holes are filled with data even if we don't need it.

A more graphical example:

The file before any request:  -------------------------------------------------------
The data actually accessed:   RR---------------------------R--R-RR-R---R-----R--R--RR
The data loaded in the cache: RRRR-------------------------RRRRRRRRRRRRRRRR--RRRRRRRR

This happens on a .doc with a query on a message field containing a lot of terms. It could be reproduced on public data with a query like this: field:*s*
(I know the leading wildcard is not optimal but it's easier to reproduce the behavior !)

For some reasons, the readahead is always maximal on mapped memory (mmapfs) and not with standard I/O accesses (niofs). This can explain the poor performance with mmapfs on some deployments.
Here are some other people who had a similar problem: https://phabricator.wikimedia.org/T169498
and found similar solutions:

calling madvise(MADV_RANDOM) on the mapped files (with ptrace !)
changing the readahead of the data partition (easier !)
madvise() always reduce the readahead to 0: only 1 page of 4KB is read at a time.

Now, this will not eliminate the cache consumption while merging and it's not always the best choice to use a small readahead (or madvise()) but it could help to improve the performance on large clusters.

jasontedor · 2019-04-19T14:42:07Z

@micoq We analyzed the same problem with mmapfs around six months ago and came to the same conclusions. This is why we have introduced hybridfs (#36668) to use NIO when we expect the access pattern to be random such that sequential read-ahead would be painful and otherwise use mmap. We are also planning to make a contribution to the JVM to expose madvise(MADV_RANDOM) so that we can then return to using mmap everywhere for. For filesystem cache consumption while merging, I am planning now that the master branch of Lucene exposes JDK 10 to investigate using O_DIRECT when merging.

malpani · 2020-12-05T12:30:15Z

With https://issues.apache.org/jira/browse/LUCENE-8982 introducing a pure java based DirectIODirectory, what are your thoughts on adding directfs as one of the store types in Elasticsearch?

vsop-479 · 2024-02-02T07:32:07Z

We are also planning to make a contribution to the JVM to expose madvise(MADV_RANDOM) so that we can then return to using mmap everywhere for.

Is there any progress on this plan?

uschindler · 2024-03-26T15:21:13Z

See apache/lucene#13196, which is going into Lucene 9.11. It is used with Java 21 or later.

uschindler · 2024-03-26T15:34:05Z

We are also planning to make a contribution to the JVM to expose madvise(MADV_RANDOM) so that we can then return to using mmap everywhere for.

Is there any progress on this plan?

With recent Java 21/22 changes around project Panama this is no longer needed as you can pass a MemorySegment (used by new version of MMapDirectory directly to native code using a MethodHandle. See above PR.

The problem is more fadvise() to reduce impact on merging or when using NIOFSDir. fadvise needs a file descriptor, so native support in the JDK is a requirement. There's already a discussion going on on the OpenJDK bug tracker: https://bugs.openjdk.org/browse/JDK-8292771

jpountz · 2024-03-26T22:13:23Z

Thanks @uschindler for closing the loop. I'm closing this issue in favor of the Lucene and JDK issues that you shared above.

colings86 added the discuss label Dec 11, 2017

s1monw closed this as completed Dec 15, 2017

s1monw removed the discuss label Dec 15, 2017

s1monw self-assigned this Dec 15, 2017

pmoust reopened this Nov 7, 2018

colings86 added the :Core/Infra/Core Core issues without another label label Nov 9, 2018

dstuebe mentioned this issue Nov 19, 2019

Madvise jnr upserve/uppend#109

Merged

rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020

rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020

jaymode added team-discuss and removed needs:triage Requires assignment of a team area label labels Dec 14, 2020

westonpace mentioned this issue Nov 3, 2021

ARROW-14548: [C++] Add madvise random support for memory mapped file apache/arrow#11588

Closed

lizhanhui mentioned this issue Jun 15, 2022

[Performance] Modify FD flags as data become progressively cold apache/rocketmq#4465

Closed

rdnm assigned rjernst Jun 29, 2022

jainankitk mentioned this issue Jun 28, 2023

[BUG] Evaluate the performance of hybridfs against mmapfs opensearch-project/OpenSearch#8298

Open

jasperjiaguo mentioned this issue Dec 16, 2023

Query latency impact from linux disk read ahead option apache/pinot#12166

Closed

uschindler mentioned this issue Mar 26, 2024

Replace boolean flags on IOContext with an enum. apache/lucene#13219

Merged

jpountz closed this as completed Mar 26, 2024

uschindler mentioned this issue Mar 28, 2024

Recommend lowering the default mmap readahead. apache/lucene#13223

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid file cache trashing on Linux with mmapfs by using madvise() ? #27748

Avoid file cache trashing on Linux with mmapfs by using madvise() ? #27748

micoq commented Dec 10, 2017 •

edited

Loading

jasontedor commented Dec 15, 2017

s1monw commented Dec 15, 2017

micoq commented Dec 15, 2017

s1monw commented Dec 18, 2017

s1monw commented Dec 18, 2017

micoq commented Dec 18, 2017

micoq commented Dec 18, 2017 •

edited

Loading

childe commented Jun 7, 2018

micoq commented Jun 7, 2018

elasticmachine commented Nov 9, 2018

micoq commented Nov 12, 2018

azuresky11 commented Mar 12, 2019

micoq commented Mar 17, 2019

azuresky11 commented Mar 19, 2019 •

edited

Loading

micoq commented Mar 19, 2019 •

edited

Loading

azuresky11 commented Mar 21, 2019

micoq commented Mar 21, 2019

azuresky11 commented Mar 31, 2019

micoq commented Apr 1, 2019 •

edited

Loading

azuresky11 commented Apr 12, 2019

micoq commented Apr 19, 2019

jasontedor commented Apr 19, 2019

malpani commented Dec 5, 2020

vsop-479 commented Feb 2, 2024

uschindler commented Mar 26, 2024

uschindler commented Mar 26, 2024 •

edited

Loading

jpountz commented Mar 26, 2024

Avoid file cache trashing on Linux with mmapfs by using madvise() ? #27748

Avoid file cache trashing on Linux with mmapfs by using madvise() ? #27748

Comments

micoq commented Dec 10, 2017 • edited Loading

jasontedor commented Dec 15, 2017

s1monw commented Dec 15, 2017

micoq commented Dec 15, 2017

s1monw commented Dec 18, 2017

s1monw commented Dec 18, 2017

micoq commented Dec 18, 2017

micoq commented Dec 18, 2017 • edited Loading

childe commented Jun 7, 2018

micoq commented Jun 7, 2018

elasticmachine commented Nov 9, 2018

micoq commented Nov 12, 2018

azuresky11 commented Mar 12, 2019

micoq commented Mar 17, 2019

azuresky11 commented Mar 19, 2019 • edited Loading

micoq commented Mar 19, 2019 • edited Loading

azuresky11 commented Mar 21, 2019

micoq commented Mar 21, 2019

azuresky11 commented Mar 31, 2019

micoq commented Apr 1, 2019 • edited Loading

azuresky11 commented Apr 12, 2019

micoq commented Apr 19, 2019

jasontedor commented Apr 19, 2019

malpani commented Dec 5, 2020

vsop-479 commented Feb 2, 2024

uschindler commented Mar 26, 2024

uschindler commented Mar 26, 2024 • edited Loading

jpountz commented Mar 26, 2024

micoq commented Dec 10, 2017 •

edited

Loading

micoq commented Dec 18, 2017 •

edited

Loading

azuresky11 commented Mar 19, 2019 •

edited

Loading

micoq commented Mar 19, 2019 •

edited

Loading

micoq commented Apr 1, 2019 •

edited

Loading

uschindler commented Mar 26, 2024 •

edited

Loading