Madvise jnr #109

dstuebe · 2019-11-18T19:27:07Z

Rewrite of #108 using JNR-FFI. Much simpler with easy integration (Gradle, Jitpack) and better unit testing.

Changes:

Add nativeIO madvise system call wrapper using JNR-FFI
Advise page tables and headers as 'will use'
Add option to advise blob pages as 'random'
Use mapped pages for writing in virtual pages files

Need to add local and production validation tests.

Previous notes from JNI implementation

Local benchmarking shows little impact on IO performance from use of madvise - this is good as impact might have been negative. Need to test under memory pressure at high IO in production environment. It would be ideal to develop a more rigorous local test with memory pressure.

The write to mapped change is an unexpected benefit. Need to test carefully in production as we have gone back and forth on this as a best practice. It looks like the latest iteration is a clear win for mapped IO as the benchmark time is cut in half!

Based on https://github.com/apache/lucene-solr/blob/master/lucene/misc/src/java/org/apache/lucene/store/NativePosixUtil.cpp#L288

codecov-io · 2019-11-19T01:14:37Z

Codecov Report

Merging #109 into master will decrease coverage by 0.53%.
The diff coverage is 76.19%.

@@             Coverage Diff              @@
##             master     #109      +/-   ##
============================================
- Coverage     89.76%   89.23%   -0.54%     
- Complexity      723      730       +7     
============================================
  Files            56       57       +1     
  Lines          2688     2740      +52     
  Branches        177      180       +3     
============================================
+ Hits           2413     2445      +32     
- Misses          191      206      +15     
- Partials         84       89       +5

Impacted Files	Coverage Δ	Complexity Δ
...va/com/upserve/uppend/blobs/VirtualPageFileIO.java	`83.33% <ø> (ø)`	`20 <0> (ø)`	⬇️
...c/main/java/com/upserve/uppend/blobs/FilePage.java	`100% <ø> (ø)`	`5 <0> (ø)`	⬇️
.../java/com/upserve/uppend/cli/CommandBenchmark.java	`96.15% <100%> (+0.15%)`	`6 <0> (ø)`	⬇️
.../java/com/upserve/uppend/AppendStorePartition.java	`87.82% <100%> (+0.1%)`	`20 <0> (ø)`	⬇️
...main/java/com/upserve/uppend/FileStoreBuilder.java	`89.33% <20%> (-10.67%)`	`34 <3> (+1)`
...ava/com/upserve/uppend/AppendOnlyStoreBuilder.java	`86.2% <55.55%> (-13.8%)`	`15 <4> (+3)`
src/main/java/com/upserve/uppend/BlockedLongs.java	`81.43% <63.63%> (-1.49%)`	`52 <1> (-3)`
...c/main/java/com/upserve/uppend/blobs/NativeIO.java	`86.95% <86.95%> (ø)`	`4 <4> (?)`
...java/com/upserve/uppend/blobs/VirtualPageFile.java	`82.27% <96.29%> (-0.11%)`	`62 <13> (+2)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5cbe7f...98b53ec. Read the comment docs.

dstuebe

Left comment on discrepancy in calculating the address and size.
Please review that carefully.

dstuebe · 2019-11-19T15:09:06Z

src/main/java/com/upserve/uppend/blobs/NativeIO.java

+    }
+
+    static long alignedSize(long address, int capacity) {
+        long end = address + capacity;


Lucene had this as long end = alignedAddress + capacity
This means that the pages you advise for may not cover the entire capacity of the buffer.
https://github.com/apache/lucene-solr/blob/master/lucene/misc/src/java/org/apache/lucene/store/NativePosixUtil.cpp#L305-L308

At this point I am not inclined to license this file under the lucene apache license, but open suggestions.

I wouldn't worry about the license. IANAL, but It doesn't look like there's a single line copied verbatim.

I'm skeptical about their approach in general. It makes a lot of assumptions about alignment that don't seem necessary and could break in the future if the underlying kernel implementation changes.

nicholasnassar

Looks good overall.

I'm inclined to make the page alignment code more conservative.

We need to benchmark on the actual gastronomer hardware before we merge.

nicholasnassar · 2019-11-19T17:30:39Z

src/main/java/com/upserve/uppend/blobs/NativeIO.java

+    }
+
+    static long alignedAddress(long address) {
+        return address & (- pageSize);


The return value of this function will always be <= address. This means that for an unaligned buffer, madvise applies to some memory outside of the buffer. It shouldn't crash because memory is allocated in pageSize increments, but it will affect the management of memory we don't intend it to affect. We might want to consider a more conservative version of this that returns the next aligned address.

return (address + pageSize -1) & (~(pageSize-1));

nicholasnassar · 2019-11-19T17:50:26Z

src/main/java/com/upserve/uppend/blobs/NativeIO.java

+
+    static long alignedSize(long address, int capacity) {
+        long end = address + capacity;
+        end = (end + pageSize - 1) & (-pageSize);


I don't understand why the size needs to be page aligned. The Linux man page for madvise requires that the address is page aligned, but doesn't put any restrictions on the size. Just to be conservative, we should probably drop this line so that the madvise applies to the contents of the buffer exactly.

Buffer: |------------| Pages: |------|------|------|

I am actually inclined to be aggressive and assert that all three pages should be advised.
We could also make this a boolean flag - aggressive mode!

I think a good solution might be to enforce that the file headers which are WILLNEED and the file content which may be RANDOM are aligned at 4096. I think this is probably already true, but would be easy to enforce.

-1 on the feature flag. This isn't something we want to tune. I would rather add a comment that the policy is to apply madvise to memory outside of unaligned buffers and move on.

Just thought of this:

Buffers: |------A-----|-----B------| Pages: |------|------|------|------|------|

What if A is WILLNEED and B is RANDOM or vice-versa?

nicholasnassar · 2019-11-19T18:01:56Z

src/main/java/com/upserve/uppend/blobs/NativeIO.java

+    }
+
+    static long alignedSize(long address, int capacity) {
+        long end = address + capacity;


I wouldn't worry about the license. IANAL, but It doesn't look like there's a single line copied verbatim.

I'm skeptical about their approach in general. It makes a lot of assumptions about alignment that don't seem necessary and could break in the future if the underlying kernel implementation changes.

…o system page size

dstuebe · 2019-11-19T21:27:25Z

@nicholasnassar Linux IO is tightly coupled to the system page size through the PageCache and the Block Layer.
The byte range access we are used to in read and write is not real - everything is a page size block.

There are even libraries that expose O_Direct in java but then you definitely need to write whole pages.

Here is some additional context that I found helpful before we started this work.
https://github.com/smacke/jaydio
http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html
elastic/elasticsearch#27748

So now that we actually have access to the system page size and (I) have a better understanding of how that actually works in VM/IO, lets allow the Uppend objects to be concerned with system page size and alignment.

One alternative that we might be worth considering is allowing Uppend to have lower level access - to explicitly madvise on a file position and size, rather than passing a buffer. We would need to pass something to get the correct offset in VM for the start of the file though?

nicholasnassar · 2019-11-20T13:56:11Z

I see why it's safe to madvise the entire pages at the start and end of unaligned buffers for our case. It's unsafe in the general case. Anything, with any access pattern, could be in those areas. If we used MADV_REMOVE, it could cause real problems.

Rather than providing a version of madvise that accepts a MappedBuffer and will advise beyond the bounds, maybe we should extend MappedBuffer with a WillNeedMappedBuffer and a RandomAccessMappedBuffer so we can put all of that Gastronomer specific logic in the class.

We should enforce that when headers and content are both found in a page, it gets marked as MADV_WILLNEED. Either that or we should enforce alignment of headers and content like you suggested above, but I don't think it hurts if there's some content that's never paged out because it's on the same page as a header.

jeffrey-a-meunier · 2019-11-20T14:25:53Z

src/main/java/com/upserve/uppend/BlockedLongs.java

-        Arrays.stream(pages)
-                .parallel()
-                .filter(Objects::nonNull)
-                .forEach(MappedByteBuffer::force);


Why are you no longer flushing pages inside this method called flush?

Because flush guarantees durability - written to disk, but I now understand that is not required for multiple processes to share state in the same machine through the page cache.

jeffrey-a-meunier · 2019-11-20T14:33:27Z

src/main/java/com/upserve/uppend/blobs/VirtualPageFile.java

        if (buffer == null) {
            synchronized (pageTables) {
                buffer = pageTables[pageNumber];
                if (buffer == null) {


These lines are confusing.

Agreed - suggestions on how to improve?
You have to check the state of the array again after you enter the synchronized block.

jeffrey-a-meunier

+1'd but I did leave some questions/comments

dstuebe · 2019-11-20T16:34:28Z

@nicholasnassar I have explicitly forced alignment in 98b53ec, something I have sought to do since early days of the project, but now have the understanding and the page size to do so.

With that change, we can:

go with the existing implementation - call madvise on buffers and check the alignment anyway
Use the existing api but rip out the alignment code and explicitly assume alignment
Modify the madvise api to take an explicit address and size and expose these directly in the application code

I looked at MappedByteBuffer, it is not easy to extend because it is abstract. How would that work? Would you also extend FileChannel.map to take an advise argument or wrap the result? I am not sure I see the advantage of this approach yet?

dstuebe · 2019-12-18T17:12:11Z

Benchmark results:

current master branch:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.69    0.00   68.74    4.15    0.00   26.42
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
md0               0.00     0.00 161300.80 3975.20 14919.81    16.46   185.08     0.00    0.00    0.00    0.00   0.00   0.00
2019-12-18 17:00:30,837 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark Read:    0.62mb/s  602.00r/s; Write    0.55mb/s 1112.40a/s; Mem 12156.96mb free 16372.00mb total
2019-12-18 17:00:30,983 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark LookupDataMetrics: Deltas{flush( 402.79 ms/,  67.00 keys/,     1 #), keyLookups(12.020%new,   6463 #exist,    883 #new), searchCache(89.685%hit,  75699 #),  findKey(68386.24 us/,  7346 #), }; Absolute{lookupKeys(2639.45 avg keys/,  2856 max keys/,     86489622 #)};
2019-12-18 17:00:31,201 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark BlockedLongMetrics: Deltas{blocks( 1134 #), appends(40190.07 us/,   4950 #), reads(22835.63 us/,    2.13 vals/,   2688 #), readLast(   0.00 us/,      0 #)}; Absolute{blocks(745615.13 avg, 747851 max, 95438736 #), appends(1691271.45 avg,    1696665 max,      216482746 #), valsPerBlock(    2.27 avg)};
2019-12-18 17:00:31,315 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark BlobStoreMetrics: Deltas{pagesAllocated(   40), appends(15412.98 us/,  517.29 bytes/,   4981 #), reads(70163.92 us/,  509.68 b/,   5734 #)}; Absolute{pageCount(13430.31 avg/, 13485 max/,   1719080 #)};
2019-12-18 17:00:31,433 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark LongBlobStoreMetrics: Deltas{pagesAllocated(    1), appends(1745.16 us/,   21.00 bytes/,    137 #), reads(11864.90 us/,   21.00 b/,   9958 #), writeLongs(  0.00 us/,      0 #), readLongs(5917.60 us/,   6672 #)}; Absolute{pageCount(3588.83 avg/,  3599 max/,    459370 #)};
2019-12-18 17:00:31,534 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark MutableBlobStoreMetrics: Deltas{pagesAllocated(    0), writes(98263.58 us/, 10618.00 bytes/,      2 #), reads(  0.00 us/,    0.00 b/,      0 #)}; Absolute{pageCount( 768.00 avg/,   768 max/,     98304 #)};

Madvise branch:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          43.51    0.00   19.65    0.98    0.00   35.86
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
md0               0.00     0.00 177995.80 15426.40   908.87   581.17    15.78     0.00    0.00    0.00    0.00   0.00   0.00
2019-12-18 16:58:32,013 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark Read:   67.85mb/s 53793.20r/s; Write  191.40mb/s 392197.20a/s; Mem 8657.82mb free 16384.00mb total
2019-12-18 16:58:32,017 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark LookupDataMetrics: Deltas{flush(  15.42 ms/, 186.07 keys/,   558 #), keyLookups(7.772%new, 1974501 #exist, 166392 #new), searchCache(94.204%hit, 22041468 #),  findKey(  54.35 us/, 2140893 #), }; Absolute{lookupKeys(2700.59 avg keys/,  3024 max keys/,     88493058 #)};
2019-12-18 16:58:32,023 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark BlockedLongMetrics: Deltas{blocks(430220 #), appends( 111.88 us/, 1960137 #), reads(  63.66 us/,    2.59 vals/, 269269 #), readLast(   0.00 us/,      0 #)}; Absolute{blocks(821249.23 avg, 823644 max, 105119901 #), appends(2036258.84 avg,    2042401 max,      260641132 #), valsPerBlock(    2.48 avg)};
2019-12-18 16:58:32,024 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark BlobStoreMetrics: Deltas{pagesAllocated(15405), appends( 70.33 us/,  515.74 bytes/, 1950932 #), reads(773.44 us/,  514.14 b/, 694740 #)}; Absolute{pageCount(16144.60 avg/, 16210 max/,   2066509 #)};
2019-12-18 16:58:32,025 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark LongBlobStoreMetrics: Deltas{pagesAllocated(  548), appends( 20.99 us/,   21.00 bytes/, 103466 #), reads( 25.96 us/,   21.00 b/, 1819317 #), writeLongs(  0.00 us/,      0 #), readLongs( 13.93 us/, 1960823 #)}; Absolute{pageCount(3674.92 avg/,  3696 max/,    470390 #)};
2019-12-18 16:58:32,041 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark MutableBlobStoreMetrics: Deltas{pagesAllocated(    0), writes(7379.28 us/, 11344.39 bytes/,    557 #), reads(  0.00 us/,    0.00 b/,      0 #)}; Absolute{pageCount( 768.00 avg/,   768 max/,     98304 #)}

##Hardware:
AWS i3.metal
Linux 5054adf56376 4.15.0-1044-aws #46-Ubuntu SMP Thu Jul 4 13:38:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
In a docker container... with 32Gb memory limit to create pressure on the page cache
$ java --version
java 10.0.1 2018-04-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.1+10)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.1+10, mixed mode)

Run with:

madvise
export JAR=/path/to/jars/uppend-all-0.2.5-7-g98b53ec.dirty.jar
export TEST_PATH=/path/to/benchmark/madvise.uppend
master
export JAR=/path/to/jars/uppend-all-0.2.5.dirty.jar
export TEST_PATH=/path/to/benchmark/master.uppend
mode
export MODE=write
export MODE=read
export MODE=readwrite

trap "kill 0" EXIT

java -Xmx16g -jar $JAR benchmark -b large -c wide -m $MODE -s large $TEST_PATH & BENCHMARK_PID=$!

iostat -c -d 5 -x -p md0 -m & IOSTAT_PID=$!

wait $BENCHMARK_PID
kill $IOSTAT_PID

./run_test.sh | 2>&1 tee $TEST_PATH.$MODE.log

Procedure:

Run in write mode for each Jar (master/madvise) - run to completion
Run in read mode for each Jar (stopped after a few minutes when rates appear stable)
Run in readwrite mode to demonstrate improvement

Observations

The write mode in the madvise branch demonstrated some glitchy behavior with write speeds dipping for short periods. Further analysis and improvements are needed there, but the read rate and readwrite rate improvement is fantastic.

dstuebe added 5 commits November 9, 2019 21:45

Add jni wrapper for system madvise to control page cache

708e489

Make madvice static. Include unistd.h. Remove redudent returns.

edfb2ef

Streamline close for VirtualFilePage and BlockedLongs

79be5de

Add shared library to jar

8978fd5

Attempt to load nativeIO library from jar

5392a9b

dstuebe mentioned this pull request Nov 18, 2019

Please madvise re: buffer alignment jnr/jnr-ffi#197

Closed

Use JNR-FFI to implement madvise

8947719

dstuebe force-pushed the madvise_jnr branch from 4f69c51 to 8947719 Compare November 19, 2019 01:06

dstuebe commented Nov 19, 2019

View reviewed changes

dstuebe mentioned this pull request Nov 19, 2019

Add nativeIO for madvise on mapped page cache #108

Closed

dstuebe requested review from nicholasnassar and jeffrey-a-meunier November 19, 2019 15:22

nicholasnassar reviewed Nov 19, 2019

View reviewed changes

Backward incompatible file format changes to enforce headers aligne t…

98b53ec

…o system page size

jeffrey-a-meunier reviewed Nov 20, 2019

View reviewed changes

jeffrey-a-meunier approved these changes Nov 20, 2019

View reviewed changes

dstuebe requested a review from nicholasnassar December 18, 2019 17:52

nicholasnassar approved these changes Dec 18, 2019

View reviewed changes

dstuebe merged commit bdc272a into master Dec 18, 2019

dstuebe deleted the madvise_jnr branch December 18, 2019 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Madvise jnr #109

Madvise jnr #109

dstuebe commented Nov 18, 2019 •

edited

Loading

codecov-io commented Nov 19, 2019 •

edited

Loading

dstuebe left a comment

dstuebe Nov 19, 2019

nicholasnassar Nov 19, 2019

nicholasnassar left a comment

nicholasnassar Nov 19, 2019

nicholasnassar Nov 19, 2019

dstuebe Nov 19, 2019 •

edited

Loading

nicholasnassar Nov 19, 2019

nicholasnassar Nov 19, 2019

nicholasnassar Nov 19, 2019

dstuebe commented Nov 19, 2019

nicholasnassar commented Nov 20, 2019 •

edited

Loading

jeffrey-a-meunier Nov 20, 2019

dstuebe Nov 20, 2019

jeffrey-a-meunier Nov 20, 2019

dstuebe Nov 20, 2019

jeffrey-a-meunier left a comment

dstuebe commented Nov 20, 2019

dstuebe commented Dec 18, 2019 •

edited

Loading

Madvise jnr #109

Madvise jnr #109

Conversation

dstuebe commented Nov 18, 2019 • edited Loading

Previous notes from JNI implementation

codecov-io commented Nov 19, 2019 • edited Loading

Codecov Report

dstuebe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicholasnassar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstuebe Nov 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstuebe commented Nov 19, 2019

nicholasnassar commented Nov 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffrey-a-meunier left a comment

Choose a reason for hiding this comment

dstuebe commented Nov 20, 2019

dstuebe commented Dec 18, 2019 • edited Loading

Benchmark results:

current master branch:

Madvise branch:

Run with:

Procedure:

Observations

dstuebe commented Nov 18, 2019 •

edited

Loading

codecov-io commented Nov 19, 2019 •

edited

Loading

dstuebe Nov 19, 2019 •

edited

Loading

nicholasnassar commented Nov 20, 2019 •

edited

Loading

dstuebe commented Dec 18, 2019 •

edited

Loading