-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Madvise jnr #109
Madvise jnr #109
Conversation
Codecov Report
@@ Coverage Diff @@
## master #109 +/- ##
============================================
- Coverage 89.76% 89.23% -0.54%
- Complexity 723 730 +7
============================================
Files 56 57 +1
Lines 2688 2740 +52
Branches 177 180 +3
============================================
+ Hits 2413 2445 +32
- Misses 191 206 +15
- Partials 84 89 +5
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comment on discrepancy in calculating the address and size.
Please review that carefully.
} | ||
|
||
static long alignedSize(long address, int capacity) { | ||
long end = address + capacity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lucene had this as long end = alignedAddress + capacity
This means that the pages you advise for may not cover the entire capacity of the buffer.
https://github.com/apache/lucene-solr/blob/master/lucene/misc/src/java/org/apache/lucene/store/NativePosixUtil.cpp#L305-L308
At this point I am not inclined to license this file under the lucene apache license, but open suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't worry about the license. IANAL, but It doesn't look like there's a single line copied verbatim.
I'm skeptical about their approach in general. It makes a lot of assumptions about alignment that don't seem necessary and could break in the future if the underlying kernel implementation changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall.
I'm inclined to make the page alignment code more conservative.
We need to benchmark on the actual gastronomer hardware before we merge.
} | ||
|
||
static long alignedAddress(long address) { | ||
return address & (- pageSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value of this function will always be <= address. This means that for an unaligned buffer, madvise
applies to some memory outside of the buffer. It shouldn't crash because memory is allocated in pageSize increments, but it will affect the management of memory we don't intend it to affect. We might want to consider a more conservative version of this that returns the next aligned address.
return (address + pageSize -1) & (~(pageSize-1));
|
||
static long alignedSize(long address, int capacity) { | ||
long end = address + capacity; | ||
end = (end + pageSize - 1) & (-pageSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the size needs to be page aligned. The Linux man page for madvise
requires that the address is page aligned, but doesn't put any restrictions on the size. Just to be conservative, we should probably drop this line so that the madvise
applies to the contents of the buffer exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffer: |------------|
Pages: |------|------|------|
I am actually inclined to be aggressive and assert that all three pages should be advised.
We could also make this a boolean flag - aggressive mode!
I think a good solution might be to enforce that the file headers which are WILLNEED and the file content which may be RANDOM are aligned at 4096. I think this is probably already true, but would be easy to enforce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 on the feature flag. This isn't something we want to tune. I would rather add a comment that the policy is to apply madvise
to memory outside of unaligned buffers and move on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thought of this:
Buffers: |------A-----|-----B------|
Pages: |------|------|------|------|------|
What if A is WILLNEED and B is RANDOM or vice-versa?
} | ||
|
||
static long alignedSize(long address, int capacity) { | ||
long end = address + capacity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't worry about the license. IANAL, but It doesn't look like there's a single line copied verbatim.
I'm skeptical about their approach in general. It makes a lot of assumptions about alignment that don't seem necessary and could break in the future if the underlying kernel implementation changes.
…o system page size
@nicholasnassar Linux IO is tightly coupled to the system page size through the PageCache and the Block Layer. There are even libraries that expose Here is some additional context that I found helpful before we started this work. So now that we actually have access to the system page size and (I) have a better understanding of how that actually works in VM/IO, lets allow the Uppend objects to be concerned with system page size and alignment. One alternative that we might be worth considering is allowing Uppend to have lower level access - to explicitly madvise on a file position and size, rather than passing a buffer. We would need to pass something to get the correct offset in VM for the start of the file though? |
I see why it's safe to Rather than providing a version of We should enforce that when headers and content are both found in a page, it gets marked as MADV_WILLNEED. Either that or we should enforce alignment of headers and content like you suggested above, but I don't think it hurts if there's some content that's never paged out because it's on the same page as a header. |
Arrays.stream(pages) | ||
.parallel() | ||
.filter(Objects::nonNull) | ||
.forEach(MappedByteBuffer::force); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you no longer flushing pages inside this method called flush
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because flush guarantees durability - written to disk, but I now understand that is not required for multiple processes to share state in the same machine through the page cache.
if (buffer == null) { | ||
synchronized (pageTables) { | ||
buffer = pageTables[pageNumber]; | ||
if (buffer == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - suggestions on how to improve?
You have to check the state of the array again after you enter the synchronized block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1'd but I did leave some questions/comments
@nicholasnassar I have explicitly forced alignment in 98b53ec, something I have sought to do since early days of the project, but now have the understanding and the page size to do so. With that change, we can:
I looked at MappedByteBuffer, it is not easy to extend because it is abstract. How would that work? Would you also extend FileChannel.map to take an advise argument or wrap the result? I am not sure I see the advantage of this approach yet? |
Benchmark results:current master branch:
Madvise branch:
##Hardware: Run with:
trap "kill 0" EXIT
java -Xmx16g -jar $JAR benchmark -b large -c wide -m $MODE -s large $TEST_PATH & BENCHMARK_PID=$!
iostat -c -d 5 -x -p md0 -m & IOSTAT_PID=$!
wait $BENCHMARK_PID
kill $IOSTAT_PID ./run_test.sh | 2>&1 tee $TEST_PATH.$MODE.log Procedure:Run in write mode for each Jar (master/madvise) - run to completion ObservationsThe write mode in the madvise branch demonstrated some glitchy behavior with write speeds dipping for short periods. Further analysis and improvements are needed there, but the read rate and readwrite rate improvement is fantastic. |
Rewrite of #108 using JNR-FFI. Much simpler with easy integration (Gradle, Jitpack) and better unit testing.
Changes:
Need to add local and production validation tests.
Previous notes from JNI implementation
Local benchmarking shows little impact on IO performance from use of madvise - this is good as impact might have been negative. Need to test under memory pressure at high IO in production environment. It would be ideal to develop a more rigorous local test with memory pressure.
The write to mapped change is an unexpected benefit. Need to test carefully in production as we have gone back and forth on this as a best practice. It looks like the latest iteration is a clear win for mapped IO as the benchmark time is cut in half!
Based on https://github.com/apache/lucene-solr/blob/master/lucene/misc/src/java/org/apache/lucene/store/NativePosixUtil.cpp#L288