Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Madvise jnr #109

Merged
merged 7 commits into from
Dec 18, 2019
Merged

Madvise jnr #109

merged 7 commits into from
Dec 18, 2019

Conversation

dstuebe
Copy link
Contributor

@dstuebe dstuebe commented Nov 18, 2019

Rewrite of #108 using JNR-FFI. Much simpler with easy integration (Gradle, Jitpack) and better unit testing.

Changes:

  • Add nativeIO madvise system call wrapper using JNR-FFI
  • Advise page tables and headers as 'will use'
  • Add option to advise blob pages as 'random'
  • Use mapped pages for writing in virtual pages files

Need to add local and production validation tests.

Previous notes from JNI implementation

Local benchmarking shows little impact on IO performance from use of madvise - this is good as impact might have been negative. Need to test under memory pressure at high IO in production environment. It would be ideal to develop a more rigorous local test with memory pressure.

The write to mapped change is an unexpected benefit. Need to test carefully in production as we have gone back and forth on this as a best practice. It looks like the latest iteration is a clear win for mapped IO as the benchmark time is cut in half!

Based on https://github.com/apache/lucene-solr/blob/master/lucene/misc/src/java/org/apache/lucene/store/NativePosixUtil.cpp#L288

@codecov-io
Copy link

codecov-io commented Nov 19, 2019

Codecov Report

Merging #109 into master will decrease coverage by 0.53%.
The diff coverage is 76.19%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #109      +/-   ##
============================================
- Coverage     89.76%   89.23%   -0.54%     
- Complexity      723      730       +7     
============================================
  Files            56       57       +1     
  Lines          2688     2740      +52     
  Branches        177      180       +3     
============================================
+ Hits           2413     2445      +32     
- Misses          191      206      +15     
- Partials         84       89       +5
Impacted Files Coverage Δ Complexity Δ
...va/com/upserve/uppend/blobs/VirtualPageFileIO.java 83.33% <ø> (ø) 20 <0> (ø) ⬇️
...c/main/java/com/upserve/uppend/blobs/FilePage.java 100% <ø> (ø) 5 <0> (ø) ⬇️
.../java/com/upserve/uppend/cli/CommandBenchmark.java 96.15% <100%> (+0.15%) 6 <0> (ø) ⬇️
.../java/com/upserve/uppend/AppendStorePartition.java 87.82% <100%> (+0.1%) 20 <0> (ø) ⬇️
...main/java/com/upserve/uppend/FileStoreBuilder.java 89.33% <20%> (-10.67%) 34 <3> (+1)
...ava/com/upserve/uppend/AppendOnlyStoreBuilder.java 86.2% <55.55%> (-13.8%) 15 <4> (+3)
src/main/java/com/upserve/uppend/BlockedLongs.java 81.43% <63.63%> (-1.49%) 52 <1> (-3)
...c/main/java/com/upserve/uppend/blobs/NativeIO.java 86.95% <86.95%> (ø) 4 <4> (?)
...java/com/upserve/uppend/blobs/VirtualPageFile.java 82.27% <96.29%> (-0.11%) 62 <13> (+2)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5cbe7f...98b53ec. Read the comment docs.

Copy link
Contributor Author

@dstuebe dstuebe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comment on discrepancy in calculating the address and size.
Please review that carefully.

}

static long alignedSize(long address, int capacity) {
long end = address + capacity;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lucene had this as long end = alignedAddress + capacity
This means that the pages you advise for may not cover the entire capacity of the buffer.
https://github.com/apache/lucene-solr/blob/master/lucene/misc/src/java/org/apache/lucene/store/NativePosixUtil.cpp#L305-L308

At this point I am not inclined to license this file under the lucene apache license, but open suggestions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't worry about the license. IANAL, but It doesn't look like there's a single line copied verbatim.

I'm skeptical about their approach in general. It makes a lot of assumptions about alignment that don't seem necessary and could break in the future if the underlying kernel implementation changes.

Copy link

@nicholasnassar nicholasnassar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

I'm inclined to make the page alignment code more conservative.

We need to benchmark on the actual gastronomer hardware before we merge.

}

static long alignedAddress(long address) {
return address & (- pageSize);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of this function will always be <= address. This means that for an unaligned buffer, madvise applies to some memory outside of the buffer. It shouldn't crash because memory is allocated in pageSize increments, but it will affect the management of memory we don't intend it to affect. We might want to consider a more conservative version of this that returns the next aligned address.

return (address + pageSize -1) & (~(pageSize-1));


static long alignedSize(long address, int capacity) {
long end = address + capacity;
end = (end + pageSize - 1) & (-pageSize);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the size needs to be page aligned. The Linux man page for madvise requires that the address is page aligned, but doesn't put any restrictions on the size. Just to be conservative, we should probably drop this line so that the madvise applies to the contents of the buffer exactly.

Copy link
Contributor Author

@dstuebe dstuebe Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer:     |------------|
Pages: |------|------|------|

I am actually inclined to be aggressive and assert that all three pages should be advised.
We could also make this a boolean flag - aggressive mode!

I think a good solution might be to enforce that the file headers which are WILLNEED and the file content which may be RANDOM are aligned at 4096. I think this is probably already true, but would be easy to enforce.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on the feature flag. This isn't something we want to tune. I would rather add a comment that the policy is to apply madvise to memory outside of unaligned buffers and move on.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thought of this:

Buffers:    |------A-----|-----B------|
Pages: |------|------|------|------|------|

What if A is WILLNEED and B is RANDOM or vice-versa?

}

static long alignedSize(long address, int capacity) {
long end = address + capacity;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't worry about the license. IANAL, but It doesn't look like there's a single line copied verbatim.

I'm skeptical about their approach in general. It makes a lot of assumptions about alignment that don't seem necessary and could break in the future if the underlying kernel implementation changes.

@dstuebe
Copy link
Contributor Author

dstuebe commented Nov 19, 2019

@nicholasnassar Linux IO is tightly coupled to the system page size through the PageCache and the Block Layer.
The byte range access we are used to in read and write is not real - everything is a page size block.

There are even libraries that expose O_Direct in java but then you definitely need to write whole pages.

Here is some additional context that I found helpful before we started this work.
https://github.com/smacke/jaydio
http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html
elastic/elasticsearch#27748

So now that we actually have access to the system page size and (I) have a better understanding of how that actually works in VM/IO, lets allow the Uppend objects to be concerned with system page size and alignment.

One alternative that we might be worth considering is allowing Uppend to have lower level access - to explicitly madvise on a file position and size, rather than passing a buffer. We would need to pass something to get the correct offset in VM for the start of the file though?

@nicholasnassar
Copy link

nicholasnassar commented Nov 20, 2019

I see why it's safe to madvise the entire pages at the start and end of unaligned buffers for our case. It's unsafe in the general case. Anything, with any access pattern, could be in those areas. If we used MADV_REMOVE, it could cause real problems.

Rather than providing a version of madvise that accepts a MappedBuffer and will advise beyond the bounds, maybe we should extend MappedBuffer with a WillNeedMappedBuffer and a RandomAccessMappedBuffer so we can put all of that Gastronomer specific logic in the class.

We should enforce that when headers and content are both found in a page, it gets marked as MADV_WILLNEED. Either that or we should enforce alignment of headers and content like you suggested above, but I don't think it hurts if there's some content that's never paged out because it's on the same page as a header.

Arrays.stream(pages)
.parallel()
.filter(Objects::nonNull)
.forEach(MappedByteBuffer::force);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you no longer flushing pages inside this method called flush?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because flush guarantees durability - written to disk, but I now understand that is not required for multiple processes to share state in the same machine through the page cache.

Comment on lines 497 to 500
if (buffer == null) {
synchronized (pageTables) {
buffer = pageTables[pageNumber];
if (buffer == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - suggestions on how to improve?
You have to check the state of the array again after you enter the synchronized block.

Copy link
Contributor

@jeffrey-a-meunier jeffrey-a-meunier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1'd but I did leave some questions/comments

@dstuebe
Copy link
Contributor Author

dstuebe commented Nov 20, 2019

@nicholasnassar I have explicitly forced alignment in 98b53ec, something I have sought to do since early days of the project, but now have the understanding and the page size to do so.

With that change, we can:

  • go with the existing implementation - call madvise on buffers and check the alignment anyway
  • Use the existing api but rip out the alignment code and explicitly assume alignment
  • Modify the madvise api to take an explicit address and size and expose these directly in the application code

I looked at MappedByteBuffer, it is not easy to extend because it is abstract. How would that work? Would you also extend FileChannel.map to take an advise argument or wrap the result? I am not sure I see the advantage of this approach yet?

@dstuebe
Copy link
Contributor Author

dstuebe commented Dec 18, 2019

Benchmark results:

current master branch:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.69    0.00   68.74    4.15    0.00   26.42
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
md0               0.00     0.00 161300.80 3975.20 14919.81    16.46   185.08     0.00    0.00    0.00    0.00   0.00   0.00
2019-12-18 17:00:30,837 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark Read:    0.62mb/s  602.00r/s; Write    0.55mb/s 1112.40a/s; Mem 12156.96mb free 16372.00mb total
2019-12-18 17:00:30,983 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark LookupDataMetrics: Deltas{flush( 402.79 ms/,  67.00 keys/,     1 #), keyLookups(12.020%new,   6463 #exist,    883 #new), searchCache(89.685%hit,  75699 #),  findKey(68386.24 us/,  7346 #), }; Absolute{lookupKeys(2639.45 avg keys/,  2856 max keys/,     86489622 #)};
2019-12-18 17:00:31,201 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark BlockedLongMetrics: Deltas{blocks( 1134 #), appends(40190.07 us/,   4950 #), reads(22835.63 us/,    2.13 vals/,   2688 #), readLast(   0.00 us/,      0 #)}; Absolute{blocks(745615.13 avg, 747851 max, 95438736 #), appends(1691271.45 avg,    1696665 max,      216482746 #), valsPerBlock(    2.27 avg)};
2019-12-18 17:00:31,315 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark BlobStoreMetrics: Deltas{pagesAllocated(   40), appends(15412.98 us/,  517.29 bytes/,   4981 #), reads(70163.92 us/,  509.68 b/,   5734 #)}; Absolute{pageCount(13430.31 avg/, 13485 max/,   1719080 #)};
2019-12-18 17:00:31,433 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark LongBlobStoreMetrics: Deltas{pagesAllocated(    1), appends(1745.16 us/,   21.00 bytes/,    137 #), reads(11864.90 us/,   21.00 b/,   9958 #), writeLongs(  0.00 us/,      0 #), readLongs(5917.60 us/,   6672 #)}; Absolute{pageCount(3588.83 avg/,  3599 max/,    459370 #)};
2019-12-18 17:00:31,534 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark MutableBlobStoreMetrics: Deltas{pagesAllocated(    0), writes(98263.58 us/, 10618.00 bytes/,      2 #), reads(  0.00 us/,    0.00 b/,      0 #)}; Absolute{pageCount( 768.00 avg/,   768 max/,     98304 #)};

Madvise branch:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          43.51    0.00   19.65    0.98    0.00   35.86
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
md0               0.00     0.00 177995.80 15426.40   908.87   581.17    15.78     0.00    0.00    0.00    0.00   0.00   0.00
2019-12-18 16:58:32,013 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark Read:   67.85mb/s 53793.20r/s; Write  191.40mb/s 392197.20a/s; Mem 8657.82mb free 16384.00mb total
2019-12-18 16:58:32,017 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark LookupDataMetrics: Deltas{flush(  15.42 ms/, 186.07 keys/,   558 #), keyLookups(7.772%new, 1974501 #exist, 166392 #new), searchCache(94.204%hit, 22041468 #),  findKey(  54.35 us/, 2140893 #), }; Absolute{lookupKeys(2700.59 avg keys/,  3024 max keys/,     88493058 #)};
2019-12-18 16:58:32,023 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark BlockedLongMetrics: Deltas{blocks(430220 #), appends( 111.88 us/, 1960137 #), reads(  63.66 us/,    2.59 vals/, 269269 #), readLast(   0.00 us/,      0 #)}; Absolute{blocks(821249.23 avg, 823644 max, 105119901 #), appends(2036258.84 avg,    2042401 max,      260641132 #), valsPerBlock(    2.48 avg)};
2019-12-18 16:58:32,024 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark BlobStoreMetrics: Deltas{pagesAllocated(15405), appends( 70.33 us/,  515.74 bytes/, 1950932 #), reads(773.44 us/,  514.14 b/, 694740 #)}; Absolute{pageCount(16144.60 avg/, 16210 max/,   2066509 #)};
2019-12-18 16:58:32,025 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark LongBlobStoreMetrics: Deltas{pagesAllocated(  548), appends( 20.99 us/,   21.00 bytes/, 103466 #), reads( 25.96 us/,   21.00 b/, 1819317 #), writeLongs(  0.00 us/,      0 #), readLongs( 13.93 us/, 1960823 #)}; Absolute{pageCount(3674.92 avg/,  3696 max/,    470390 #)};
2019-12-18 16:58:32,041 INFO [Timer-0:{}] com.upserve.uppend.cli.benchmark.Benchmark MutableBlobStoreMetrics: Deltas{pagesAllocated(    0), writes(7379.28 us/, 11344.39 bytes/,    557 #), reads(  0.00 us/,    0.00 b/,      0 #)}; Absolute{pageCount( 768.00 avg/,   768 max/,     98304 #)}

##Hardware:
AWS i3.metal
Linux 5054adf56376 4.15.0-1044-aws #46-Ubuntu SMP Thu Jul 4 13:38:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
In a docker container... with 32Gb memory limit to create pressure on the page cache
$ java --version
java 10.0.1 2018-04-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.1+10)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.1+10, mixed mode)

Run with:

  • madvise
    export JAR=/path/to/jars/uppend-all-0.2.5-7-g98b53ec.dirty.jar
    export TEST_PATH=/path/to/benchmark/madvise.uppend

  • master
    export JAR=/path/to/jars/uppend-all-0.2.5.dirty.jar
    export TEST_PATH=/path/to/benchmark/master.uppend

  • mode
    export MODE=write
    export MODE=read
    export MODE=readwrite

trap "kill 0" EXIT

java -Xmx16g -jar $JAR benchmark -b large -c wide -m $MODE -s large $TEST_PATH & BENCHMARK_PID=$!

iostat -c -d 5 -x -p md0 -m & IOSTAT_PID=$!

wait $BENCHMARK_PID
kill $IOSTAT_PID

./run_test.sh | 2>&1 tee $TEST_PATH.$MODE.log

Procedure:

Run in write mode for each Jar (master/madvise) - run to completion
Run in read mode for each Jar (stopped after a few minutes when rates appear stable)
Run in readwrite mode to demonstrate improvement

Observations

The write mode in the madvise branch demonstrated some glitchy behavior with write speeds dipping for short periods. Further analysis and improvements are needed there, but the read rate and readwrite rate improvement is fantastic.

@dstuebe dstuebe merged commit bdc272a into master Dec 18, 2019
@dstuebe dstuebe deleted the madvise_jnr branch December 18, 2019 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants