-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8252500: ZGC on aarch64: Unable to allocate heap for certain Linux kernel configurations #40
Conversation
…rnel configurations
👋 Welcome back cgo! A progress list of the required criteria for merging this PR into |
@mychris The following label will be automatically applied to this pull request: When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the |
Webrevs
|
/label add hotspot-gc |
@mychris |
/label remove hotspot |
@mychris |
/cc hotspot-gc |
@mychris The |
#ifdef ASSERT | ||
fatal("Received %s while probing the address space for the highest valid bit", os::errno_name(errno)); | ||
#else // ASSERT | ||
log_warning(gc)("Received %s while probing the address space for the highest valid bit", os::errno_name(errno)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change this line to log_warning_p, then we get this error message in our hs_err files.
// Default value if probing is not implemented for a certain platform: 128TB | ||
#define DEFAULT_MAX_ADDRESS_BIT 47 | ||
// Minimum value returned, if probing fails: 64GB | ||
#define MINIMUM_MAX_ADDRESS_BIT 36 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could/should be typed static constants instead of defines.
for (int i = DEFAULT_MAX_ADDRESS_BIT; i > MINIMUM_MAX_ADDRESS_BIT && max_address_bit == 0; --i) { | ||
const uintptr_t base_addr = ((uintptr_t) 1U) << i; | ||
if (msync((void*)base_addr, page_size, MS_ASYNC) == 0) { | ||
// msync suceeded, the address is valid, and maybe even already mapped. | ||
max_address_bit = i; | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a one-off error here? Taking i == 47 as an example. This means that you test the base_address == '10000000 00000000 00000000 00000000 00000000 00000000' (in bits). That is, you test that the address range with the 48th bit set (128T-256T) is usable. However, when 47 then is returned to the caller, it interprets it as if the 64T-128T range is usable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I verified your assumption by doing the following:
I returned the result of probe_valid_max_address_bit() + 1 from ZPlatformAddressOffsetBits(). I then executed a VM on a platform with only 3 pagetable levels, which has less than DEFAULT_MAX_ADDRESS_BIT bits available in the address space. The heap allocation succeeded, which means the current patch has a one off error.
However, I would fix this in ZPlatformAddressOffsetBits(). because I interpret the result of probe_valid_max_address_bit() as to be the highest bit index of an address which can be allocated. Which means that mmap the result of 1 << probe_valid_max_address_bit() should succeed (if the page is not already mapped). I think ZPlatformAddressOffsetBits() should add 1 to the result of probe_valid_max_address_bit().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the explanation of my verification was wrong. I did the following:
size_t ZPlatformAddressOffsetBits() {
const static size_t valid_max_address_bit = probe_valid_max_address_bit();
const size_t max_address_offset_bits = valid_max_address_bit - 3;
return max_address_offset_bits + 1;
}
and allocation of the Java heap succeeded.
if ((uintptr_t) result_addr == base_addr) { | ||
// address is valid | ||
max_address_bit = i; | ||
} | ||
if (result_addr != MAP_FAILED) { | ||
munmap(result_addr, page_size); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you swap the order of these if-statements you could add a break after 172.
This way the exit of the loop would look like the one you have after msync. You would also be able to get rid of the ' && max_address_bit == 0' check in the for statement.
uintptr_t high_addr = ((uintptr_t) 1U) << DEFAULT_MAX_ADDRESS_BIT; | ||
void* result_addr = mmap((void*) high_addr, page_size, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0); | ||
if (result_addr != MAP_FAILED) { | ||
max_address_bit = BitsPerSize_t - count_leading_zeros((size_t) result_addr) - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me that the '- 1' is correct here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's assume the probing failed (for whatever reason) and we are actually able to allocate DEFAULT_MAX_ADDRESS_BIT:
00000000_00000000_10000000_00000000_00000000_00000000_00000000_00000000 = 1U << DEFAULT_MAX_ADDRESS_BIT
Mmap succeeds and returns exactly that address. Now the line would evaluate to the following:
64 - 16 - 1 = 47 <=> DEFAULT_MAX_ADDRESS_BIT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks OK now that you moved the +1 to ZPlatformAddressOffsetBits().
size_t max_address_bit = 0; | ||
const size_t page_size = os::vm_page_size(); | ||
for (int i = DEFAULT_MAX_ADDRESS_BIT; i > MINIMUM_MAX_ADDRESS_BIT && max_address_bit == 0; --i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bit of a mismatch between the type of 'i' and 'max_address_bit'. See also 'BitsPerSize_t - count_leading_zeros((size_t) result_addr) - 1;'. I think that changing max_address_bit to be an int, and getting rid of the size_t's here will fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it the other way around, and made 'i' of type size_t, since 'i' is the only int variable left and all the others are of type size_t.
Changed the log_warning to a log_warning_p, so it is reported in hs_err file. mummap the result of mmap directly after the mmap call to be able to break out of the loop, instead of check the max_address_bit == 0 precondition. Fixed an 'off by one' mistake.
Thanks for your review, Stefan. I integrated your suggestions and will execute JTreg again, before integration, since it takes very long on my test devices and we might need to make more adjustments to this PR. |
#include "gc/z/zGlobals.hpp" | ||
#include "runtime/globals.hpp" | ||
#include "utilities/globalDefinitions.hpp" | ||
#include "utilities/powerOfTwo.hpp" | ||
|
||
#include <sys/mman.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also be guarded by #ifdef LINUX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small nits, then I think this is good from my point-of-view. It would be good to get an extra pair of eyes to go through and verify that the bit calculation parts are correct.
munmap(result_addr, page_size); | ||
} | ||
} | ||
log_info_p(gc, init)("Probing address space for the highest valid bit: %zu", max_address_bit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this during our pre-review, but I see that it wasn't updated: %zu should be using SIZE_FORMAT.
#ifdef ASSERT | ||
fatal("Received %s while probing the address space for the highest valid bit", os::errno_name(errno)); | ||
#else // ASSERT | ||
log_warning_p(gc)("Received %s while probing the address space for the highest valid bit", os::errno_name(errno)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should explicitly include runtime/os.hpp.
Thanks for your in depth review Stefan. Do you know if I can mark a PR to required more than one reviewer, or do I just wait? |
I couldn't find a way to add Reviewers. Maybe pinging them? @pliden @fisk @stooart-mon |
|
||
static size_t probe_valid_max_address_bit() { | ||
#ifdef LINUX | ||
size_t max_address_bit = 0; | ||
const size_t page_size = os::vm_page_size(); | ||
for (int i = DEFAULT_MAX_ADDRESS_BIT; i > MINIMUM_MAX_ADDRESS_BIT && max_address_bit == 0; --i) { | ||
for (size_t i = DEFAULT_MAX_ADDRESS_BIT; i > MINIMUM_MAX_ADDRESS_BIT; --i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the loop condition be MINIMUM_MAX_ADDRESS_BIT inclusive instead of exclusive? Seems weird that it is exclusive, when we later check if max_address_bit < MINIMUM_MAX_ADDRESS_BIT return MINIMUM_MAX_ADDRESS_BIT, which seems like exactly what will happen if the loop would have advanced to MINIMUM_MAX_ADDRESS_BIT (and then finish the loop).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to comment a few lines below where you conditionally return MINIMUM_MAX_ADDRESS_BIT in this new cool GUI. Anyway, you could use MIN2 to return MIN2(MINIMUM_MAX_ADDRESS_BIT, max_address_bit) to make the intention more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not processing MINIMUM_MAX_ADDRESS_BIT in the loop, because this is the absolute minimum that should be returned. So, regardless of the probing result, we are going to return that value. Therefor, it is a waste of time to actually perform the probe.
The check max_address_bit < MINIMUM_MAX_ADDRESS_BIT
is done, because after the probing (if it failed and we didn't find any valid page), right after the for loop, we do a single mmap and use the returned address to calculate the max_address_bit
. So max_address_bit
could then be lower than MINIMUM_MAX_ADDRESS_BIT
.
For the return part, you are right, we could simplify it by using MAX2
, since we are interested in the highest bit, and MINIMUM_MAX_ADDRESS_BIT
is the absolute minimum we want to return.
I will push an update for the return part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Incremental looks good.
…if to make the intention more clear.
@mychris This change now passes all automated pre-integration checks. In addition to the automated checks, the change must also fulfill all project specific requirements After integration, the commit message will be:
Since the source branch of this PR was last updated there have been 12 commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid automatic rebasing, please merge As you do not have Committer status in this project, an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@stefank, @fisk, @pliden) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
fatal("Received %s while probing the address space for the highest valid bit", os::errno_name(errno)); | ||
#else // ASSERT | ||
log_warning_p(gc)("Received %s while probing the address space for the highest valid bit", os::errno_name(errno)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please turn the two %s into '%s', otherwise the output will look strange.
} | ||
// Since msync failed with ENOMEM, the page might not be mapped. | ||
// Try to map it, to see if the address is valid. | ||
void* result_addr = mmap((void*) base_addr, page_size, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this "void* const result_addr = ...".
uintptr_t high_addr = ((uintptr_t) 1U) << DEFAULT_MAX_ADDRESS_BIT; | ||
void* result_addr = mmap((void*) high_addr, page_size, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, please make these two locals const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I executed JTreg again last night with both kernel configurations. The last commit was obviously not included, but I think it is trivial enough to not require another test run. |
/integrate |
/sponsor |
@pliden @mychris Since your change was applied there have been 12 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit 73ba3ae. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
This patch optimizes the backend implementation of VectorMaskToLong for AArch64, given a more efficient approach to mov value bits from predicate register to general purpose register as x86 PMOVMSK[1] does, by using BEXT[2] which is available in SVE2. With this patch, the final code (input mask is byte type with SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU emulator) changes as below: Before: mov z16.b, p0/z, #1 fmov x0, d16 orr x0, x0, x0, lsr openjdk#7 orr x0, x0, x0, lsr openjdk#14 orr x0, x0, x0, lsr openjdk#28 and x0, x0, #0xff fmov x8, v16.d[1] orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#8 orr x8, xzr, #0x2 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#16 orr x8, xzr, #0x3 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#24 orr x8, xzr, #0x4 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#32 mov x8, #0x5 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#40 orr x8, xzr, #0x6 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#48 orr x8, xzr, #0x7 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#56 After: mov z16.b, p0/z, #1 mov z17.b, #1 bext z16.d, z16.d, z17.d mov z17.d, #0 uzp1 z16.s, z16.s, z17.s uzp1 z16.h, z16.h, z17.h uzp1 z16.b, z16.b, z17.b mov x0, v16.d[0] [1] https://www.felixcloutier.com/x86/pmovmskb [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
This patch optimizes the backend implementation of VectorMaskToLong for AArch64, given a more efficient approach to mov value bits from predicate register to general purpose register as x86 PMOVMSK[1] does, by using BEXT[2] which is available in SVE2. With this patch, the final code (input mask is byte type with SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU emulator) changes as below: Before: mov z16.b, p0/z, #1 fmov x0, d16 orr x0, x0, x0, lsr openjdk#7 orr x0, x0, x0, lsr openjdk#14 orr x0, x0, x0, lsr openjdk#28 and x0, x0, #0xff fmov x8, v16.d[1] orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#8 orr x8, xzr, #0x2 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#16 orr x8, xzr, #0x3 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#24 orr x8, xzr, #0x4 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#32 mov x8, #0x5 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#40 orr x8, xzr, #0x6 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#48 orr x8, xzr, #0x7 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#56 After: mov z16.b, p0/z, #1 mov z17.b, #1 bext z16.d, z16.d, z17.d mov z17.d, #0 uzp1 z16.s, z16.s, z17.s uzp1 z16.h, z16.h, z17.h uzp1 z16.b, z16.b, z17.b mov x0, v16.d[0] [1] https://www.felixcloutier.com/x86/pmovmskb [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
After JDK-8283091, the loop below can be vectorized partially. Statement 1 can be vectorized but statement 2 can't. ``` // int[] iArr; long[] lArrFld; int i1,i2; for (i1 = 6; i1 < 227; i1++) { iArr[i1] += lArrFld[i1]++; // statement 1 iArr[i1 + 1] -= (i2++); // statement 2 } ``` But we got incorrect results because the vector packs of iArr are scheduled incorrectly like: ``` ... load_vector XMM1,[R8 + openjdk#16 + R11 << openjdk#2] movl RDI, [R8 + openjdk#20 + R11 << openjdk#2] # int load_vector XMM2,[R9 + openjdk#8 + R11 << openjdk#3] subl RDI, R11 # int vpaddq XMM3,XMM2,XMM0 ! add packedL store_vector [R9 + openjdk#8 + R11 << openjdk#3],XMM3 vector_cast_l2x XMM2,XMM2 ! vpaddd XMM1,XMM2,XMM1 ! add packedI addl RDI, openjdk#228 # int movl [R8 + openjdk#20 + R11 << openjdk#2], RDI # int movl RBX, [R8 + openjdk#24 + R11 << openjdk#2] # int subl RBX, R11 # int addl RBX, openjdk#227 # int movl [R8 + openjdk#24 + R11 << openjdk#2], RBX # int ... movl RBX, [R8 + openjdk#40 + R11 << openjdk#2] # int subl RBX, R11 # int addl RBX, openjdk#223 # int movl [R8 + openjdk#40 + R11 << openjdk#2], RBX # int movl RDI, [R8 + openjdk#44 + R11 << openjdk#2] # int subl RDI, R11 # int addl RDI, openjdk#222 # int movl [R8 + openjdk#44 + R11 << openjdk#2], RDI # int store_vector [R8 + openjdk#16 + R11 << openjdk#2],XMM1 ... ``` simplified as: ``` load_vector iArr in statement 1 unvectorized loads/stores in statement 2 store_vector iArr in statement 1 ``` We cannot pick the memory state from the first load for LoadI pack here, as the LoadI vector operation must load the new values in memory after iArr writes 'iArr[i1 + 1] - (i2++)' to 'iArr[i1 + 1]'(statement 2). We must take the memory state of the last load where we have assigned new values ('iArr[i1 + 1] - (i2++)') to the iArr array. In JDK-8240281, we picked the memory state of the first load. Different from the scenario in JDK-8240281, the store, which is dependent on an earlier load here, is in a pack to be scheduled and the LoadI pack depends on the last_mem. As designed[2], to schedule the StoreI pack, all memory operations in another single pack should be moved in the same direction. We know that the store in the pack depends on one of loads in the LoadI pack, so the LoadI pack should be scheduled before the StoreI pack. And the LoadI pack depends on the last_mem, so the last_mem must be scheduled before the LoadI pack and also before the store pack. Therefore, we need to take the memory state of the last load for the LoadI pack here. To fix it, the pack adds additional checks while picking the memory state of the first load. When the store locates in a pack and the load pack relies on the last_mem, we shouldn't choose the memory state of the first load but choose the memory state of the last load. [1]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2380 [2]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2232 Jira: ENTLLT-5482 Change-Id: I341d10b91957b60a1b4aff8116723e54083a5fb8 CustomizedGitHooks: yes
Co-authored-by: Xin Liu <xxinliu@amazon.com>
…ng into ldp/stp on AArch64 Macro-assembler on aarch64 can merge adjacent loads or stores into ldp/stp[1]. For example, it can merge: ``` str w20, [sp, openjdk#16] str w10, [sp, openjdk#20] ``` into ``` stp w20, w10, [sp, openjdk#16] ``` But C2 may generate a sequence like: ``` str x21, [sp, openjdk#8] str w20, [sp, openjdk#16] str x19, [sp, openjdk#24] <--- str w10, [sp, openjdk#20] <--- Before sorting str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` We can't do any merging for non-adjacent loads or stores. The patch is to sort the spilling or unspilling sequence in the order of offset during instruction scheduling and bundling phase. After that, we can get a new sequence: ``` str x21, [sp, openjdk#8] str w20, [sp, openjdk#16] str w10, [sp, openjdk#20] <--- str x19, [sp, openjdk#24] <--- After sorting str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` Then macro-assembler can do ld/st merging: ``` str x21, [sp, openjdk#8] stp w20, w10, [sp, openjdk#16] <--- Merged str x19, [sp, openjdk#24] str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` To justify the patch, we run `HelloWorld.java` ``` public class HelloWorld { public static void main(String [] args) { System.out.println("Hello World!"); } } ``` with `java -Xcomp -XX:-TieredCompilation HelloWorld`. Before the patch, macro-assembler can do ld/st merging for 3688 times. After the patch, the number of ld/st merging increases to 3871 times, by ~5 %. Tested tier1~3 on x86 and AArch64. [1] https://github.com/openjdk/jdk/blob/a95062b39a431b4937ab6e9e73de4d2b8ea1ac49/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L2079
The patch introduces a new function to probe for the highest valid bit in the virtual address space for userspace programs on Linux.
I guarded the whole implementation to only probe on Linux, other platforms will remain unaffected. Possibly, it would be nicer to move the probing code into an OS+ARCH specific source file. But since this is only a single function, I thought it would be better to put it right next to the caller and guard it with an #ifdef LINUX.
The probing mechanism uses a combination of msync + mmap, to first check if the address is valid using msync (if msync succeeds, the address was valid). If msync fails, mmap is used to check if msync failed because the memory wasn't mapped, or if it failed because the address is invalid. Due to some undefined behavior (documented in the msync man page), I also use a single mmap at the end, if the msync approach failed before.
I tested msync with different combinations of mappings, and also with sbrk, and it always succeeded, or failed with ENOMEM. I never got back any other error code.
The specified minimum value has been chosen "randomly". The JVM terminates (unable to allocate heap), if this minimum value is smaller than the requested Java Heap size, so it might be better to make the minimum dependent on the MaxHeapSize and not a compile time constant? I didn't want to make the minimum too big, since for aarch64 on Linux, the documented minimum would be 38 (see [1]).
I avoided MAP_FIXED_NOREPLACE, because according to the man page, it has been added in Linux 4.17. There are still plenty of stable kernel versions around which will not have that feature, which means we need to implement a workaround for it. Some of my test devices also have a kernel version lower than that.
I executed the HotSpot tier1 JTreg tests on two different aarch64 devices. One with 4KB pages and 3 page levels and the other with 4KB pages and 4 page levels. Tests passed on both devices.
[1] https://www.kernel.org/doc/Documentation/arm64/memory.txt
Progress
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/40/head:pull/40
$ git checkout pull/40