-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is false #26917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… false Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>
|
👋 Welcome back qpzhang! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
@cnqpzhang The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
|
@cnqpzhang If you look back at the history of this code you will see that you are undoing a change that was made deliberately by @theRealAph. Your patch may improve the specific test case you have provided but at the cost of a significant and unacceptable increase in code cache use for all cases. The comment at the head of the code you have edited makes this point explicitly. The reasoning behind that comment is available in the JIRA history and associated review comments. The relevant issue is https://bugs.openjdk.org/browse/JDK-8179444 and the corresponding review thread starts with https://mail.openjdk.org/pipermail/hotspot-dev/2017-April/026742.html and continues with https://mail.openjdk.org/pipermail/hotspot-dev/2017-May/026766.html I don't recommend integrating this change. |
|
Hi @adinn, thanks for your review. I have read two related JBS:
Particularly to two My PR undoes some of the first patch (1ce2a362524), as described by #2 and #3 in the PR summary, but it is not all. Please see below, 1ce2a362524 removed the 1ce2a362524#diff-fe18bdf6585d1a0d4d510f382a568c4428334d4ad941581ecc10ec60ccafca4aL4972-L4974 6c68ce2d396#diff-0f4150a9c607ccd590bf256daa800c0276144682a92bc6bdced5e8bc1bb81f3aR4680-R4684 This looks a bit confusing when we have Without the In contrast, with the So, it appears that BlockZeroingLowLimit currently serves two purposes: as the lower limit for block zeroing, and as the threshold determining whether to call a stub or perform STP unrolling inline. Should we fix this, leave it as it is, or just add comments to explain it better? |
Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>
|
It's difficult for anyone to predict all the possibilities of To begin with, please add this short patch, then see if any of this PR provides an advantage. |
|
Mailing list message from Andrew Haley on hotspot-dev: On 29/08/2025 12:10, Patrick Zhang wrote:
That's not going to tell you anything. The zeroing code is expanded many -- |
Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>
Thanks for advice. Updated accordingly (commit 3 vs 2: 22e72f4) to keep the shape of the generated code as unchanged as possible. My test case with |
OK, but please edit the claims at the top of this PR to respect the new reality. In particular, please state the test cases which are improved. |
Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>
Updated. |
|
I can't see any statistically-significant improvement. Please tell us your test results and your test conditions. |
The impact can be divided into two parts, at execution time and at code generation time respectively.
Let us focus on tests that can generate diffs, for example, I run below on Ampere Altra (Neoverse-N1), Fedora 40, Kernel 6.1. Results (Base) Patched (minor variations or slight improvements, as expected) Note that we do not include C2 tests and size > 256 as the generated code are same, no noticeable performance change.
Firstly, we test clearing 4 words (32 bytes) with low limit 8 bytes (1 words), the patch will correct the low limit to 256 bytes (32 words). Run it 20 times to see the ratios of patch vs base (lower is better): Test results, zero_words wall time (ns): Secondly, we test clearing larger memory, 16 words (128 bytes) with low limit 64 bytes (8 words). Remember to update New test results, zero_words wall time (ns): In summary, the code changes bring a slight improvement to execution time, though some of these differences may be within normal variation, and a clear reduction in wall time for the Thanks for taking the time to read this long write-up in details. |
|
@cnqpzhang This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a |
Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>
Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>
Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>
|
Added test/hotspot/gtest/aarch64/test_MacroAssembler_zero_words.cpp to measure the impact of different low limits and cleared word counts on the wall time of Run the test and compare the wall times. We can see that $ make run-test TEST="gtest:MacroAssemblerZeroWordsTest"
See below for the detailed run log, including the generated code sequences under various conditions: |
Hi @theRealAph , Do you have any further comments on the updates? Aside from the code changes to |
|
@cnqpzhang I don't understand why you think these tests indicate anything useful for real use cases. Do you have an actual user whose needs justify adopting this change? Let's consider what your patch and associated test achieve. Initially you tried to remove the limit on unrolling that was imposed to avoid excessive cache consumption. When it was explained why this was inappropriate you reduced the patch so that it now adjusts the threshold at which unrolling is replaced by a call to the stub. Your two test runs appear to demonstrate a performance improvement between old and new but the difference is more apparent than real. In the specific configurations you have selected your change to the unrolling threshold targets two very specific points of disparity. The new cases fully unroll while the old cases rely on a callout. Not surpisingly. this gives very different performance when you run it in a loop many times. But we already know that callouts are more expensive than inline code. The important thing to note is that this transition between unrolling vs callout happens in both old and new code, just at different size points. If you ran with other config settings and sizes you could find many cases where both versions fully unroll or both rely on a callout. So your test does not truly reflect what is going on here and your fix is really doing little more than rescaling the dials so they can go up to 11. You have provided no good evidence as to why we need to adjust the scale by which we compute the threshold between unrolling or callout. Furthermore, since this rescaling allows more unrolling to occur than in the old version you still need to justify why that is worth doing. |
|
@adinn Thank you for the good summary of the proposed code changes. You omitted a key condition in the context: I would like to reiterate that I have no objection to the functions when the The initial patch aimed to decouple these uses, but @adinn raised concerns regarding the size of code caches and potential performance side effects. I profiled a SPECjbb2015 PRESET run and presented the minor impact, while @theRealAph commented that this approach might not fully capture all the impacts in detail. Based on the subsequent advice, a compromise was proposed: we could start by resetting With regards to your last question, I would not try to justify why |
It does not. When |
zero_words does not check In contrast, the inner stub function does so. address generate_zero_blocks() { |
It doesn't need to because That is to say, if a user sets |
Yes, this is the current state we have, with the patch, and it also represents the compromise I can accept regarding Is there anything else we need to do for this PR? |
|
Hi @theRealAph and @adinn, please let me know if you have any additional comments on this PR, or advice to improve it. Thank you. |
|
Hi, The status of this PR has not been updated for a couple of weeks. I think I’ve addressed the feedback provided, but I haven’t seen any further comments or decisions. Please let me know if there’s anything else I can do to improve the patch or if it’s ready to move forward. Thanks for your time! |
| BLOCK_COMMENT("zero_words {"); | ||
| assert(ptr == r10 && cnt == r11, "mismatch in register usage"); | ||
| RuntimeAddress zero_blocks = RuntimeAddress(StubRoutines::aarch64::zero_blocks()); | ||
| assert(zero_blocks.target() != nullptr, "zero_blocks stub has not been generated"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are duplicates of getting the address of zero_blocks() and the assertion.
The first was originally introduced by [1] and got subsequently duplicated nearby with [2]. Was there a specific reason to have one copy placed after the br(LO, around) and another before it? I tried removing one instance and tests did not report any issue.
[1] 8179444: AArch64: Put zero_words on a diet, 1ce2a36#diff-fe18bdf6585d1a0d4d510f382a568c4428334d4ad941581ecc10ec60ccafca4aR4971-R4972
[2] 8270947: AArch64: C1: use zero_words to initialize all objects
6c68ce2#diff-0f4150a9c607ccd590bf256daa800c0276144682a92bc6bdced5e8bc1bb81f3aR4625-R4626
| mov(r10, base); mov(r11, cnt); | ||
| result = zero_words(r10, r11); | ||
| } else { | ||
| #ifndef PRODUCT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason of why I swapped the if and else code block is:
1). Initially, I intended to add a check for UseBlockZeroing to determine whether to call zero_words_reg_reg. Swapping the if and else branches makes it easier to compare the behavior with and without this additional condition. Otherwise, the if-cond would be like if (!UseBlockZeroing || cnt <= (uint64_t)BlockZeroingLowLimit / BytesPerWord).
2). Later, we decided not to check UseBlockZeroing here but I still didn't roll back this change because the comments of warning There is no need to check UseBlockZeroing.. should be placed before such an if condition, instead of the old one.
| uint64_t loops = cnt/16; | ||
| // Use 16 words as the block size which is 128 bytes on 64-bit systems. | ||
| // A complete loop body will be 8 STPs unrolled there. | ||
| const int block_size = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming this constant block_size only adds to any confusion, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered why MacroAssembler::zero_words uses 16 words to do stp unrolling, while generate_zero_blocks() 8 words (const int MacroAssembler::zero_words_block_size = 8;), so defined this variable to compare 8 vs 16 but did not find obvious performance difference.
Regarding the var name block_size, could unroll or unroll_words be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered why
MacroAssembler::zero_wordsuses 16 words to dostpunrolling, whilegenerate_zero_blocks()8 words (const int MacroAssembler::zero_words_block_size = 8;), so defined this variable to compare8 vs 16but did not find obvious performance difference.Regarding the var name
block_size, couldunrollorunroll_wordsbe better?
What's wrong with 16? I'm asking not from a "my teachers said always name constants" point of view, but from a reader's understanding point of view. Named constants are all well and good if the constant has some meaning, but this one is just two words. Perhaps 2 * WordSize would do.
cnqpzhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review, please see my updates and replies with the new commit.
| uint64_t loops = cnt/16; | ||
| // Use 16 words as the block size which is 128 bytes on 64-bit systems. | ||
| // A complete loop body will be 8 STPs unrolled there. | ||
| const int block_size = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered why MacroAssembler::zero_words uses 16 words to do stp unrolling, while generate_zero_blocks() 8 words (const int MacroAssembler::zero_words_block_size = 8;), so defined this variable to compare 8 vs 16 but did not find obvious performance difference.
Regarding the var name block_size, could unroll or unroll_words be better?
| BLOCK_COMMENT("zero_words {"); | ||
| assert(ptr == r10 && cnt == r11, "mismatch in register usage"); | ||
| RuntimeAddress zero_blocks = RuntimeAddress(StubRoutines::aarch64::zero_blocks()); | ||
| assert(zero_blocks.target() != nullptr, "zero_blocks stub has not been generated"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are duplicates of getting the address of zero_blocks() and the assertion.
The first was originally introduced by [1] and got subsequently duplicated nearby with [2]. Was there a specific reason to have one copy placed after the br(LO, around) and another before it? I tried removing one instance and tests did not report any issue.
[1] 8179444: AArch64: Put zero_words on a diet, 1ce2a36#diff-fe18bdf6585d1a0d4d510f382a568c4428334d4ad941581ecc10ec60ccafca4aR4971-R4972
[2] 8270947: AArch64: C1: use zero_words to initialize all objects
6c68ce2#diff-0f4150a9c607ccd590bf256daa800c0276144682a92bc6bdced5e8bc1bb81f3aR4625-R4626
| mov(r10, base); mov(r11, cnt); | ||
| result = zero_words(r10, r11); | ||
| } else { | ||
| #ifndef PRODUCT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason of why I swapped the if and else code block is:
1). Initially, I intended to add a check for UseBlockZeroing to determine whether to call zero_words_reg_reg. Swapping the if and else branches makes it easier to compare the behavior with and without this additional condition. Otherwise, the if-cond would be like if (!UseBlockZeroing || cnt <= (uint64_t)BlockZeroingLowLimit / BytesPerWord).
2). Later, we decided not to check UseBlockZeroing here but I still didn't roll back this change because the comments of warning There is no need to check UseBlockZeroing.. should be placed before such an if condition, instead of the old one.
Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

Issue:
In AArch64 port,
UseBlockZeroingis by default set to true andBlockZeroingLowLimitis initialized to 256. IfDC ZVAis supported,BlockZeroingLowLimitis later updated to4 * VM_Version::zva_length(). WhenUseBlockZeroingis set to false, all related conditional checks should ignoreBlockZeroingLowLimit. However, the functionMacroAssembler::zero_words(Register base, uint64_t cnt)still evaluates the lower limit and bases its code generation logic on it, which seems to be an incomplete conditional check.This PR:
BlockZeroingLowLimitto4 * VM_Version::zva_length()or 256 with a warning message if it was manually configured from the default whileUseBlockZeroingis disabled.MacroAssembler::zero_words(Register base, uint64_t cnt)andMacroAssembler::zero_words(Register ptr, Register cnt)to explain why we do not checkUseBlockZeroingin the outer part of these functions. Instead, the decision is delegated to the stub functionzero_blocks, which encapsulates the DC ZVA instructions and serves as the inner implementation ofzero_words. This approach helps better control the increase in code cache size during array or object instance initialization.test/micro/org/openjdk/bench/vm/gc/RawAllocationRate.javato better cover scenarios involving smaller arrays and objects..Tests:
vm.compiler.ClearMemory, andvm.gc.RawAllocationRate(includingarrayTestandinstanceTest) showed no obvious regression. Negative tests withjdk/bin/java -jar images/test/micro/benchmarks.jar RawAllocationRate.arrayTest_C1 -bm thrpt -gc false -wi 0 -w 30 -i 1 -r 30 -t 1 -f 1 -tu s -jvmArgs "-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8" -p size=32demonstrated good wall times onzero_words_reg_immcalls, as expected.Progress
Issue
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26917/head:pull/26917$ git checkout pull/26917Update a local copy of the PR:
$ git checkout pull/26917$ git pull https://git.openjdk.org/jdk.git pull/26917/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 26917View PR using the GUI difftool:
$ git pr show -t 26917Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26917.diff
Using Webrev
Link to Webrev Comment