Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8252500: ZGC on aarch64: Unable to allocate heap for certain Linux kernel configurations #40

Closed
wants to merge 6 commits into from

Conversation

mychris
Copy link
Contributor

@mychris mychris commented Sep 7, 2020

The patch introduces a new function to probe for the highest valid bit in the virtual address space for userspace programs on Linux.

I guarded the whole implementation to only probe on Linux, other platforms will remain unaffected. Possibly, it would be nicer to move the probing code into an OS+ARCH specific source file. But since this is only a single function, I thought it would be better to put it right next to the caller and guard it with an #ifdef LINUX.

The probing mechanism uses a combination of msync + mmap, to first check if the address is valid using msync (if msync succeeds, the address was valid). If msync fails, mmap is used to check if msync failed because the memory wasn't mapped, or if it failed because the address is invalid. Due to some undefined behavior (documented in the msync man page), I also use a single mmap at the end, if the msync approach failed before.

I tested msync with different combinations of mappings, and also with sbrk, and it always succeeded, or failed with ENOMEM. I never got back any other error code.

The specified minimum value has been chosen "randomly". The JVM terminates (unable to allocate heap), if this minimum value is smaller than the requested Java Heap size, so it might be better to make the minimum dependent on the MaxHeapSize and not a compile time constant? I didn't want to make the minimum too big, since for aarch64 on Linux, the documented minimum would be 38 (see [1]).

I avoided MAP_FIXED_NOREPLACE, because according to the man page, it has been added in Linux 4.17. There are still plenty of stable kernel versions around which will not have that feature, which means we need to implement a workaround for it. Some of my test devices also have a kernel version lower than that.

I executed the HotSpot tier1 JTreg tests on two different aarch64 devices. One with 4KB pages and 3 page levels and the other with 4KB pages and 4 page levels. Tests passed on both devices.

[1] https://www.kernel.org/doc/Documentation/arm64/memory.txt


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8252500: ZGC on aarch64: Unable to allocate heap for certain Linux kernel configurations

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/40/head:pull/40
$ git checkout pull/40

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 7, 2020

👋 Welcome back cgo! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 7, 2020
@openjdk
Copy link

openjdk bot commented Sep 7, 2020

@mychris The following label will be automatically applied to this pull request: hotspot.

When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label (add|remove) "label" command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Sep 7, 2020
@mlbridge
Copy link

mlbridge bot commented Sep 7, 2020

Webrevs

@mychris
Copy link
Contributor Author

mychris commented Sep 7, 2020

/label add hotspot-gc

@openjdk openjdk bot added the hotspot-gc hotspot-gc-dev@openjdk.org label Sep 7, 2020
@openjdk
Copy link

openjdk bot commented Sep 7, 2020

@mychris
The hotspot-gc label was successfully added.

@mychris
Copy link
Contributor Author

mychris commented Sep 7, 2020

/label remove hotspot

@openjdk openjdk bot removed the hotspot hotspot-dev@openjdk.org label Sep 7, 2020
@openjdk
Copy link

openjdk bot commented Sep 7, 2020

@mychris
The hotspot label was successfully removed.

@mychris
Copy link
Contributor Author

mychris commented Sep 7, 2020

/cc hotspot-gc

@openjdk
Copy link

openjdk bot commented Sep 7, 2020

@mychris The hotspot-gc label was already applied.

#ifdef ASSERT
fatal("Received %s while probing the address space for the highest valid bit", os::errno_name(errno));
#else // ASSERT
log_warning(gc)("Received %s while probing the address space for the highest valid bit", os::errno_name(errno));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change this line to log_warning_p, then we get this error message in our hs_err files.

Comment on lines 141 to 144
// Default value if probing is not implemented for a certain platform: 128TB
#define DEFAULT_MAX_ADDRESS_BIT 47
// Minimum value returned, if probing fails: 64GB
#define MINIMUM_MAX_ADDRESS_BIT 36
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could/should be typed static constants instead of defines.

Comment on lines 150 to 156
for (int i = DEFAULT_MAX_ADDRESS_BIT; i > MINIMUM_MAX_ADDRESS_BIT && max_address_bit == 0; --i) {
const uintptr_t base_addr = ((uintptr_t) 1U) << i;
if (msync((void*)base_addr, page_size, MS_ASYNC) == 0) {
// msync suceeded, the address is valid, and maybe even already mapped.
max_address_bit = i;
break;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a one-off error here? Taking i == 47 as an example. This means that you test the base_address == '10000000 00000000 00000000 00000000 00000000 00000000' (in bits). That is, you test that the address range with the 48th bit set (128T-256T) is usable. However, when 47 then is returned to the caller, it interprets it as if the 64T-128T range is usable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I verified your assumption by doing the following:
I returned the result of probe_valid_max_address_bit() + 1 from ZPlatformAddressOffsetBits(). I then executed a VM on a platform with only 3 pagetable levels, which has less than DEFAULT_MAX_ADDRESS_BIT bits available in the address space. The heap allocation succeeded, which means the current patch has a one off error.

However, I would fix this in ZPlatformAddressOffsetBits(). because I interpret the result of probe_valid_max_address_bit() as to be the highest bit index of an address which can be allocated. Which means that mmap the result of 1 << probe_valid_max_address_bit() should succeed (if the page is not already mapped). I think ZPlatformAddressOffsetBits() should add 1 to the result of probe_valid_max_address_bit().

Copy link
Contributor Author

@mychris mychris Sep 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the explanation of my verification was wrong. I did the following:

size_t ZPlatformAddressOffsetBits() {
const static size_t valid_max_address_bit = probe_valid_max_address_bit();
const size_t max_address_offset_bits = valid_max_address_bit - 3;
return max_address_offset_bits + 1;
}

and allocation of the Java heap succeeded.

Comment on lines 170 to 177
if ((uintptr_t) result_addr == base_addr) {
// address is valid
max_address_bit = i;
}
if (result_addr != MAP_FAILED) {
munmap(result_addr, page_size);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you swap the order of these if-statements you could add a break after 172.

This way the exit of the loop would look like the one you have after msync. You would also be able to get rid of the ' && max_address_bit == 0' check in the for statement.

uintptr_t high_addr = ((uintptr_t) 1U) << DEFAULT_MAX_ADDRESS_BIT;
void* result_addr = mmap((void*) high_addr, page_size, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
if (result_addr != MAP_FAILED) {
max_address_bit = BitsPerSize_t - count_leading_zeros((size_t) result_addr) - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me that the '- 1' is correct here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume the probing failed (for whatever reason) and we are actually able to allocate DEFAULT_MAX_ADDRESS_BIT:
00000000_00000000_10000000_00000000_00000000_00000000_00000000_00000000 = 1U << DEFAULT_MAX_ADDRESS_BIT
Mmap succeeds and returns exactly that address. Now the line would evaluate to the following:
64 - 16 - 1 = 47 <=> DEFAULT_MAX_ADDRESS_BIT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks OK now that you moved the +1 to ZPlatformAddressOffsetBits().

Comment on lines 148 to 150
size_t max_address_bit = 0;
const size_t page_size = os::vm_page_size();
for (int i = DEFAULT_MAX_ADDRESS_BIT; i > MINIMUM_MAX_ADDRESS_BIT && max_address_bit == 0; --i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bit of a mismatch between the type of 'i' and 'max_address_bit'. See also 'BitsPerSize_t - count_leading_zeros((size_t) result_addr) - 1;'. I think that changing max_address_bit to be an int, and getting rid of the size_t's here will fix that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it the other way around, and made 'i' of type size_t, since 'i' is the only int variable left and all the others are of type size_t.

Changed the log_warning to a log_warning_p, so it is reported in hs_err
file.
mummap the result of mmap directly after the mmap call to be able to
break out of the loop, instead of check the max_address_bit ==
0 precondition.
Fixed an 'off by one' mistake.
@mychris
Copy link
Contributor Author

mychris commented Sep 7, 2020

Thanks for your review, Stefan. I integrated your suggestions and will execute JTreg again, before integration, since it takes very long on my test devices and we might need to make more adjustments to this PR.

#include "gc/z/zGlobals.hpp"
#include "runtime/globals.hpp"
#include "utilities/globalDefinitions.hpp"
#include "utilities/powerOfTwo.hpp"

#include <sys/mman.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be guarded by #ifdef LINUX?

Copy link
Member

@stefank stefank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small nits, then I think this is good from my point-of-view. It would be good to get an extra pair of eyes to go through and verify that the bit calculation parts are correct.

munmap(result_addr, page_size);
}
}
log_info_p(gc, init)("Probing address space for the highest valid bit: %zu", max_address_bit);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this during our pre-review, but I see that it wasn't updated: %zu should be using SIZE_FORMAT.

#ifdef ASSERT
fatal("Received %s while probing the address space for the highest valid bit", os::errno_name(errno));
#else // ASSERT
log_warning_p(gc)("Received %s while probing the address space for the highest valid bit", os::errno_name(errno));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should explicitly include runtime/os.hpp.

@mychris
Copy link
Contributor Author

mychris commented Sep 7, 2020

Thanks for your in depth review Stefan. Do you know if I can mark a PR to required more than one reviewer, or do I just wait?

@stefank
Copy link
Member

stefank commented Sep 7, 2020

I couldn't find a way to add Reviewers. Maybe pinging them? @pliden @fisk @stooart-mon


static size_t probe_valid_max_address_bit() {
#ifdef LINUX
size_t max_address_bit = 0;
const size_t page_size = os::vm_page_size();
for (int i = DEFAULT_MAX_ADDRESS_BIT; i > MINIMUM_MAX_ADDRESS_BIT && max_address_bit == 0; --i) {
for (size_t i = DEFAULT_MAX_ADDRESS_BIT; i > MINIMUM_MAX_ADDRESS_BIT; --i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the loop condition be MINIMUM_MAX_ADDRESS_BIT inclusive instead of exclusive? Seems weird that it is exclusive, when we later check if max_address_bit < MINIMUM_MAX_ADDRESS_BIT return MINIMUM_MAX_ADDRESS_BIT, which seems like exactly what will happen if the loop would have advanced to MINIMUM_MAX_ADDRESS_BIT (and then finish the loop).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to comment a few lines below where you conditionally return MINIMUM_MAX_ADDRESS_BIT in this new cool GUI. Anyway, you could use MIN2 to return MIN2(MINIMUM_MAX_ADDRESS_BIT, max_address_bit) to make the intention more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not processing MINIMUM_MAX_ADDRESS_BIT in the loop, because this is the absolute minimum that should be returned. So, regardless of the probing result, we are going to return that value. Therefor, it is a waste of time to actually perform the probe.

The check max_address_bit < MINIMUM_MAX_ADDRESS_BIT is done, because after the probing (if it failed and we didn't find any valid page), right after the for loop, we do a single mmap and use the returned address to calculate the max_address_bit. So max_address_bit could then be lower than MINIMUM_MAX_ADDRESS_BIT.

For the return part, you are right, we could simplify it by using MAX2, since we are interested in the highest bit, and MINIMUM_MAX_ADDRESS_BIT is the absolute minimum we want to return.

I will push an update for the return part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Incremental looks good.

@openjdk
Copy link

openjdk bot commented Sep 7, 2020

@mychris This change now passes all automated pre-integration checks. In addition to the automated checks, the change must also fulfill all project specific requirements

After integration, the commit message will be:

8252500: ZGC on aarch64: Unable to allocate heap for certain Linux kernel configurations

Reviewed-by: stefank, eosterlund, pliden
  • If you would like to add a summary, use the /summary command.
  • To credit additional contributors, use the /contributor command.
  • To add additional solved issues, use the /issue command.

Since the source branch of this PR was last updated there have been 12 commits pushed to the master branch:

  • 5dd1ead: 8252767: URLConnection.setRequestProperty throws IllegalAccessError
  • 2cceeed: 8166554: Avoid compilation blocking in OverloadCompileQueueTest.java
  • 188b0bc: 8252868: Clean up unused function from G1MMUTracker
  • 891886b: 8252887: Zero VM is broken after JDK-8252661
  • 7686e87: 8250968: Symlinks attributes not preserved when using jarsigner on zip files
  • 8d6d43c: 8251193: bin/idea.sh is generating wrong folder definitions for JVMCI modules
  • 70d5cac: 8251152: ARM32: jtreg c2 Test8202414 test crash
  • e0d5b5f: 8252627: Make it safe for JFR thread to read threadObj
  • e29c3f6: 8252661: Change SafepointMechanism terminology to talk less about "blocking"
  • e0c8d44: 8252844: Update check configuration to Skara format
  • ... and 2 more: https://git.openjdk.java.net/jdk/compare/5f76deb2e064ca8a48fb8a638c23aad34bf27f9c...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid automatic rebasing, please merge master into your branch, and then specify the current head hash when integrating, like this: /integrate 5dd1eaded7c0aaea199df462b3e2e3b326e1d3b7.

As you do not have Committer status in this project, an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@stefank, @fisk, @pliden) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 7, 2020
Comment on lines 164 to 166
fatal("Received %s while probing the address space for the highest valid bit", os::errno_name(errno));
#else // ASSERT
log_warning_p(gc)("Received %s while probing the address space for the highest valid bit", os::errno_name(errno));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please turn the two %s into '%s', otherwise the output will look strange.

}
// Since msync failed with ENOMEM, the page might not be mapped.
// Try to map it, to see if the address is valid.
void* result_addr = mmap((void*) base_addr, page_size, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this "void* const result_addr = ...".

Comment on lines 184 to 185
uintptr_t high_addr = ((uintptr_t) 1U) << DEFAULT_MAX_ADDRESS_BIT;
void* result_addr = mmap((void*) high_addr, page_size, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
Copy link
Contributor

@pliden pliden Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, please make these two locals const.

Copy link
Contributor

@pliden pliden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@mychris
Copy link
Contributor Author

mychris commented Sep 8, 2020

I executed JTreg again last night with both kernel configurations. The last commit was obviously not included, but I think it is trivial enough to not require another test run.
Thanks for all your input.

@mychris
Copy link
Contributor Author

mychris commented Sep 8, 2020

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Sep 8, 2020
@openjdk
Copy link

openjdk bot commented Sep 8, 2020

@mychris
Your change (at version a89cdcf) is now ready to be sponsored by a Committer.

@pliden
Copy link
Contributor

pliden commented Sep 8, 2020

/sponsor

@openjdk openjdk bot closed this Sep 8, 2020
@openjdk openjdk bot added integrated Pull request has been integrated and removed sponsor Pull request is ready to be sponsored ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 8, 2020
@openjdk
Copy link

openjdk bot commented Sep 8, 2020

@pliden @mychris Since your change was applied there have been 12 commits pushed to the master branch:

  • 5dd1ead: 8252767: URLConnection.setRequestProperty throws IllegalAccessError
  • 2cceeed: 8166554: Avoid compilation blocking in OverloadCompileQueueTest.java
  • 188b0bc: 8252868: Clean up unused function from G1MMUTracker
  • 891886b: 8252887: Zero VM is broken after JDK-8252661
  • 7686e87: 8250968: Symlinks attributes not preserved when using jarsigner on zip files
  • 8d6d43c: 8251193: bin/idea.sh is generating wrong folder definitions for JVMCI modules
  • 70d5cac: 8251152: ARM32: jtreg c2 Test8202414 test crash
  • e0d5b5f: 8252627: Make it safe for JFR thread to read threadObj
  • e29c3f6: 8252661: Change SafepointMechanism terminology to talk less about "blocking"
  • e0c8d44: 8252844: Update check configuration to Skara format
  • ... and 2 more: https://git.openjdk.java.net/jdk/compare/5f76deb2e064ca8a48fb8a638c23aad34bf27f9c...master

Your commit was automatically rebased without conflicts.

Pushed as commit 73ba3ae.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@mychris mychris deleted the 8252500 branch September 8, 2020 09:40
e1iu pushed a commit to e1iu/jdk that referenced this pull request Mar 29, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.

With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:

Before:

        mov     z16.b, p0/z, #1
        fmov    x0, d16
        orr     x0, x0, x0, lsr openjdk#7
        orr     x0, x0, x0, lsr openjdk#14
        orr     x0, x0, x0, lsr openjdk#28
        and     x0, x0, #0xff
        fmov    x8, v16.d[1]
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#8

        orr     x8, xzr, #0x2
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#16

        orr     x8, xzr, #0x3
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#24

        orr     x8, xzr, #0x4
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#32

        mov     x8, #0x5
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#40

        orr     x8, xzr, #0x6
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#48

        orr     x8, xzr, #0x7
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#56

After:

        mov     z16.b, p0/z, #1
        mov     z17.b, #1
        bext    z16.d, z16.d, z17.d
        mov     z17.d, #0
        uzp1    z16.s, z16.s, z17.s
        uzp1    z16.h, z16.h, z17.h
        uzp1    z16.b, z16.b, z17.b
        mov     x0, v16.d[0]

[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-

Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
e1iu pushed a commit to e1iu/jdk that referenced this pull request Apr 21, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.

With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:

Before:

        mov     z16.b, p0/z, #1
        fmov    x0, d16
        orr     x0, x0, x0, lsr openjdk#7
        orr     x0, x0, x0, lsr openjdk#14
        orr     x0, x0, x0, lsr openjdk#28
        and     x0, x0, #0xff
        fmov    x8, v16.d[1]
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#8

        orr     x8, xzr, #0x2
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#16

        orr     x8, xzr, #0x3
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#24

        orr     x8, xzr, #0x4
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#32

        mov     x8, #0x5
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#40

        orr     x8, xzr, #0x6
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#48

        orr     x8, xzr, #0x7
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#56

After:

        mov     z16.b, p0/z, #1
        mov     z17.b, #1
        bext    z16.d, z16.d, z17.d
        mov     z17.d, #0
        uzp1    z16.s, z16.s, z17.s
        uzp1    z16.h, z16.h, z17.h
        uzp1    z16.b, z16.b, z17.b
        mov     x0, v16.d[0]

[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-

Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Aug 17, 2022
After JDK-8283091, the loop below can be vectorized partially.
Statement 1 can be vectorized but statement 2 can't.
```
// int[] iArr; long[] lArrFld; int i1,i2;
for (i1 = 6; i1 < 227; i1++) {
  iArr[i1] += lArrFld[i1]++; // statement 1
  iArr[i1 + 1] -= (i2++); // statement 2
}
```

But we got incorrect results because the vector packs of iArr are
scheduled incorrectly like:
```
...
load_vector XMM1,[R8 + openjdk#16 + R11 << openjdk#2]
movl    RDI, [R8 + openjdk#20 + R11 << openjdk#2] # int
load_vector XMM2,[R9 + openjdk#8 + R11 << openjdk#3]
subl    RDI, R11    # int
vpaddq  XMM3,XMM2,XMM0  ! add packedL
store_vector [R9 + openjdk#8 + R11 << openjdk#3],XMM3
vector_cast_l2x  XMM2,XMM2  !
vpaddd  XMM1,XMM2,XMM1  ! add packedI
addl    RDI, openjdk#228   # int
movl    [R8 + openjdk#20 + R11 << openjdk#2], RDI # int
movl    RBX, [R8 + openjdk#24 + R11 << openjdk#2] # int
subl    RBX, R11    # int
addl    RBX, openjdk#227   # int
movl    [R8 + openjdk#24 + R11 << openjdk#2], RBX # int
...
movl    RBX, [R8 + openjdk#40 + R11 << openjdk#2] # int
subl    RBX, R11    # int
addl    RBX, openjdk#223   # int
movl    [R8 + openjdk#40 + R11 << openjdk#2], RBX # int
movl    RDI, [R8 + openjdk#44 + R11 << openjdk#2] # int
subl    RDI, R11    # int
addl    RDI, openjdk#222   # int
movl    [R8 + openjdk#44 + R11 << openjdk#2], RDI # int
store_vector [R8 + openjdk#16 + R11 << openjdk#2],XMM1
...
```
simplified as:
```
load_vector iArr in statement 1
unvectorized loads/stores in statement 2
store_vector iArr in statement 1
```
We cannot pick the memory state from the first load for LoadI pack
here, as the LoadI vector operation must load the new values in memory
after iArr writes 'iArr[i1 + 1] - (i2++)' to 'iArr[i1 + 1]'(statement 2).
We must take the memory state of the last load where we have assigned
new values ('iArr[i1 + 1] - (i2++)') to the iArr array.

In JDK-8240281, we picked the memory state of the first load. Different
from the scenario in JDK-8240281, the store, which is dependent on an
earlier load here, is in a pack to be scheduled and the LoadI pack
depends on the last_mem. As designed[2], to schedule the StoreI pack,
all memory operations in another single pack should be moved in the same
direction. We know that the store in the pack depends on one of loads in
the LoadI pack, so the LoadI pack should be scheduled before the StoreI
pack. And the LoadI pack depends on the last_mem, so the last_mem must
be scheduled before the LoadI pack and also before the store pack.
Therefore, we need to take the memory state of the last load for the
LoadI pack here.

To fix it, the pack adds additional checks while picking the memory state
of the first load. When the store locates in a pack and the load pack
relies on the last_mem, we shouldn't choose the memory state of the
first load but choose the memory state of the last load.

[1]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2380
[2]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2232

Jira: ENTLLT-5482
Change-Id: I341d10b91957b60a1b4aff8116723e54083a5fb8
CustomizedGitHooks: yes
caojoshua pushed a commit to caojoshua/jdk that referenced this pull request May 24, 2023
Co-authored-by: Xin Liu <xxinliu@amazon.com>
robehn pushed a commit to robehn/jdk that referenced this pull request Aug 15, 2023
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Nov 21, 2023
…ng into ldp/stp on AArch64

Macro-assembler on aarch64 can merge adjacent loads or stores
into ldp/stp[1]. For example, it can merge:
```
str     w20, [sp, openjdk#16]
str     w10, [sp, openjdk#20]
```
into
```
stp     w20, w10, [sp, openjdk#16]
```

But C2 may generate a sequence like:
```
str     x21, [sp, openjdk#8]
str     w20, [sp, openjdk#16]
str     x19, [sp, openjdk#24] <---
str     w10, [sp, openjdk#20] <--- Before sorting
str     x11, [sp, openjdk#40]
str     w13, [sp, openjdk#48]
str     x16, [sp, openjdk#56]
```
We can't do any merging for non-adjacent loads or stores.

The patch is to sort the spilling or unspilling sequence in
the order of offset during instruction scheduling and bundling
phase. After that, we can get a new sequence:
```
str     x21, [sp, openjdk#8]
str     w20, [sp, openjdk#16]
str     w10, [sp, openjdk#20] <---
str     x19, [sp, openjdk#24] <--- After sorting
str     x11, [sp, openjdk#40]
str     w13, [sp, openjdk#48]
str     x16, [sp, openjdk#56]
```

Then macro-assembler can do ld/st merging:
```
str     x21, [sp, openjdk#8]
stp     w20, w10, [sp, openjdk#16] <--- Merged
str     x19, [sp, openjdk#24]
str     x11, [sp, openjdk#40]
str     w13, [sp, openjdk#48]
str     x16, [sp, openjdk#56]
```

To justify the patch, we run `HelloWorld.java`
```
public class HelloWorld {
    public static void main(String [] args) {
        System.out.println("Hello World!");
    }
}
```
with `java -Xcomp -XX:-TieredCompilation HelloWorld`.

Before the patch, macro-assembler can do ld/st merging for
3688 times. After the patch, the number of ld/st merging
increases to 3871 times, by ~5 %.

Tested tier1~3 on x86 and AArch64.

[1] https://github.com/openjdk/jdk/blob/a95062b39a431b4937ab6e9e73de4d2b8ea1ac49/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L2079
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-gc hotspot-gc-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants