8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is false #26917

cnqpzhang · 2025-08-24T16:23:19Z

Issue:
In AArch64 port, UseBlockZeroing is by default set to true and BlockZeroingLowLimit is initialized to 256. If DC ZVA is supported, BlockZeroingLowLimit is later updated to 4 * VM_Version::zva_length(). When UseBlockZeroing is set to false, all related conditional checks should ignore BlockZeroingLowLimit. However, the function MacroAssembler::zero_words(Register base, uint64_t cnt) still evaluates the lower limit and bases its code generation logic on it, which seems to be an incomplete conditional check.

This PR:

Reset BlockZeroingLowLimit to 4 * VM_Version::zva_length() or 256 with a warning message if it was manually configured from the default while UseBlockZeroing is disabled.
Added necessary comments in MacroAssembler::zero_words(Register base, uint64_t cnt) and MacroAssembler::zero_words(Register ptr, Register cnt) to explain why we do not check UseBlockZeroing in the outer part of these functions. Instead, the decision is delegated to the stub function zero_blocks, which encapsulates the DC ZVA instructions and serves as the inner implementation of zero_words. This approach helps better control the increase in code cache size during array or object instance initialization.
Added more testing sizes to test/micro/org/openjdk/bench/vm/gc/RawAllocationRate.java to better cover scenarios involving smaller arrays and objects..

Tests:

Performance tests on the bundled JMH vm.compiler.ClearMemory, and vm.gc.RawAllocationRate (including arrayTest and instanceTest) showed no obvious regression. Negative tests with jdk/bin/java -jar images/test/micro/benchmarks.jar RawAllocationRate.arrayTest_C1 -bm thrpt -gc false -wi 0 -w 30 -i 1 -r 30 -t 1 -f 1 -tu s -jvmArgs "-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8" -p size=32 demonstrated good wall times on zero_words_reg_imm calls, as expected.
Jtreg ter1 test on Ampere Altra, AmpereOne, Graviton2 and 3, tier2 on Altra. No new issues found. Passed tests of GHA Sanity Checks.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is false (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26917/head:pull/26917
$ git checkout pull/26917

Update a local copy of the PR:
$ git checkout pull/26917
$ git pull https://git.openjdk.org/jdk.git pull/26917/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26917

View PR using the GUI difftool:
$ git pr show -t 26917

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26917.diff

Using Webrev

Link to Webrev Comment

… false Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

bridgekeeper · 2025-08-24T16:23:53Z

👋 Welcome back qpzhang! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-08-24T16:24:21Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2025-08-24T16:24:51Z

@cnqpzhang The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-08-24T16:28:17Z

Webrevs

adinn · 2025-08-26T13:26:33Z

@cnqpzhang If you look back at the history of this code you will see that you are undoing a change that was made deliberately by @theRealAph. Your patch may improve the specific test case you have provided but at the cost of a significant and unacceptable increase in code cache use for all cases.

The comment at the head of the code you have edited makes this point explicitly. The reasoning behind that comment is available in the JIRA history and associated review comments. The relevant issue is

https://bugs.openjdk.org/browse/JDK-8179444

and the corresponding review thread starts with

https://mail.openjdk.org/pipermail/hotspot-dev/2017-April/026742.html

and continues with

https://mail.openjdk.org/pipermail/hotspot-dev/2017-May/026766.html

I don't recommend integrating this change.

cnqpzhang · 2025-08-29T11:06:51Z

Hi @adinn, thanks for your review.

I have read two related JBS:

JDK-8179444, Put zero_words on a diet (May 2017), 1ce2a362524
JDK-8270947, C1: use zero_words to initialize all objects (Jul 2021), 6c68ce2d396

Particularly to two zero_words functions, reg_reg and reg_imm, the first patch (1ce2a362524) had MacroAssembler::zero_words(Register ptr, Register cnt) call the stub function generate_zero_blocks() and moved the if (UseBlockZeroing) condition into it, as such got a shorter instruction sequence for ClearArray. While the second one made MacroAssembler::zero_words(Register base, uint64_t cnt) route to the stub as well.

My PR undoes some of the first patch (1ce2a362524), as described by #2 and #3 in the PR summary, but it is not all. Please see below, 1ce2a362524 removed the BlockZeroingLowLimit check when dropping the call to block_zero. Next, 6c68ce2d396 had zero_words(Register base, uint64_t cnt) call zero_words(Register ptr, Register cnt) then the stub func, which should have added back the UseBlockZeroing check but omitted it (intentionally?).

1ce2a362524#diff-fe18bdf6585d1a0d4d510f382a568c4428334d4ad941581ecc10ec60ccafca4aL4972-L4974

  } else if (UseBlockZeroing && cnt >= (u_int64_t)(BlockZeroingLowLimit >> LogBytesPerWord)) {
    mov(tmp, cnt);
    block_zero(base, tmp, true);

6c68ce2d396#diff-0f4150a9c607ccd590bf256daa800c0276144682a92bc6bdced5e8bc1bb81f3aR4680-R4684

void MacroAssembler::zero_words(Register base, uint64_t cnt)
{
  guarantee(zero_words_block_size < BlockZeroingLowLimit,
            "increase BlockZeroingLowLimit");
  if (cnt <= (uint64_t)BlockZeroingLowLimit / BytesPerWord) {

This looks a bit confusing when we have -XX:-UseBlockZeroing while the BlockZeroingLowLimit stil works. For example, when we have '-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=16 and object instance size = 32,

Without the UseBlockZeroing check (base), we have:

 ;; zero_words {
  0x0000400013b02e40:   subs  x8, x11, #0x8
  0x0000400013b02e44:   b.cc  0x0000400013b02e4c  // b.lo, b.ul, b.last
  0x0000400013b02e48:   bl  0x0000400013b02f10          ;   {runtime_call Stub::Stub Generator zero_blocks_stub}
  0x0000400013b02e4c:   tbz  w11, #2, 0x0000400013b02e58
  0x0000400013b02e50:   stp  xzr, xzr, [x10], #16
  0x0000400013b02e54:   stp  xzr, xzr, [x10], #16
  0x0000400013b02e58:   tbz  w11, #1, 0x0000400013b02e60
  0x0000400013b02e5c:   stp  xzr, xzr, [x10], #16
  0x0000400013b02e60:   tbz  w11, #0, 0x0000400013b02e68
  0x0000400013b02e64:   str  xzr, [x10]
 ;; } zero_words

In contrast, with the UseBlockZeroing check (patched), we will see:

 ;; zero_words (count = 2) {
  0x000040003415e874:   stp  xzr, xzr, [x10]
 ;; } zero_words

So, it appears that BlockZeroingLowLimit currently serves two purposes: as the lower limit for block zeroing, and as the threshold determining whether to call a stub or perform STP unrolling inline. Should we fix this, leave it as it is, or just add comments to explain it better?

cnqpzhang · 2025-08-29T11:08:32Z

Regarding the impact to code caches, I measured JMH vm.gc.RawAllocationRate.arrayTest and SPECjbb2015 PRESET run. The first is not suitable for comparison because the array init code only takes a small portion of the overall space, with -XX:+TieredCompilation the sum of three segmented caches only showed <<1% diff. In another viewpoint, SPECjbb2015 can be a complicated enough app that is able to demonstrate the impact on code caches, so I plot such a chart for a 20 minutes run, baseline vs patched.

We could eyeball that the profiled and non-profiled nmethods have slightly bigger sizes of used caches (patched vs baseline), tiny part of the total sizes ~6MB (profiled nm) and ~12MB (non-profiled nm). Furthermore, these diffs are relatively far smaller than the total reserved size, either 32M (C1 only), or 48M (with C2), or 240M (configured ergonomically by JVM). I manually set it as -XX:InitialCodeCacheSize=32M -XX:ReservedCodeCacheSize=64M for a managed range.

Therefore, I have a question regarding the practical impact of the code cache in this context. Specifically, is the code cache still practically a significant concern relative to the benefits gained from reduced call counts and the modest performance improvements in code generation and execution for the generated array and object initialization code?

That said, I fully understand the potential risks and concerns associated with modifying the existing logic. I would get prepared to roll back the changes related to the C2 part.

Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

theRealAph · 2025-09-01T10:11:49Z

It's difficult for anyone to predict all the possibilities of -XX command-line arguments that users might try, despite them not making any sense.

To begin with, please add this short patch, then see if any of this PR provides an advantage.


diff --git a/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp b/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
index 9321dd0542e..14a584c5106 100644
--- a/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
@@ -446,6 +446,11 @@ void VM_Version::initialize() {
     FLAG_SET_DEFAULT(UseBlockZeroing, false);
   }
 
+  if (!UseBlockZeroing && !FLAG_IS_DEFAULT(BlockZeroingLowLimit)) {
+    warning("BlockZeroingLowLimit has been ignored because UseBlockZeroing is disabled");
+    FLAG_SET_DEFAULT(BlockZeroingLowLimit, 4 * VM_Version::zva_length());
+  }
+
   if (VM_Version::supports_sve2()) {
     if (FLAG_IS_DEFAULT(UseSVE)) {
       FLAG_SET_DEFAULT(UseSVE, 2);

mlbridge · 2025-09-01T10:20:35Z

Mailing list message from Andrew Haley on hotspot-dev:

On 29/08/2025 12:10, Patrick Zhang wrote:

Regarding the impact to code caches, I measured JMH

That's not going to tell you anything. The zeroing code is expanded many
times during a compilation, and code cache size is limited. Every time
we needlessly expand intrinsics inline we kick user's code out.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

cnqpzhang · 2025-09-03T10:28:26Z

To begin with, please add this short patch, then see if any of this PR provides an advantage.


diff --git a/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp b/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
index 9321dd0542e..14a584c5106 100644
--- a/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
@@ -446,6 +446,11 @@ void VM_Version::initialize() {
     FLAG_SET_DEFAULT(UseBlockZeroing, false);
   }
 
+  if (!UseBlockZeroing && !FLAG_IS_DEFAULT(BlockZeroingLowLimit)) {
+    warning("BlockZeroingLowLimit has been ignored because UseBlockZeroing is disabled");
+    FLAG_SET_DEFAULT(BlockZeroingLowLimit, 4 * VM_Version::zva_length());
+  }
+
   if (VM_Version::supports_sve2()) {
     if (FLAG_IS_DEFAULT(UseSVE)) {
       FLAG_SET_DEFAULT(UseSVE, 2);

Thanks for advice. Updated accordingly (commit 3 vs 2: 22e72f4) to keep the shape of the generated code as unchanged as possible. My test case with -XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8, size=32 also works as expected. I added some comments to better clarify the purpose of the if-condition inside the zero_words function to avoid future confusion upon. Please help review, thanks.

theRealAph · 2025-09-04T09:37:21Z

Please help review, thanks.

OK, but please edit the claims at the top of this PR to respect the new reality. In particular, please state the test cases which are improved.

Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

cnqpzhang · 2025-09-05T07:38:35Z

OK, but please edit the claims at the top of this PR to respect the new reality. In particular, please state the test cases which are improved.

Updated.
Had a new change (f23abb9) to set BlockZeroingLowLimit to 256 if is_zva_enabled() returns false, otherwise we would have 0 from _zva_length.

theRealAph · 2025-09-05T11:22:15Z

I can't see any statistically-significant improvement. Please tell us your test results and your test conditions.

cnqpzhang · 2025-09-07T01:49:09Z

I can't see any statistically-significant improvement. Please tell us your test results and your test conditions.

The impact can be divided into two parts, at execution time and at code generation time respectively.

Execution time measured by JMH RawAllocationRate test cases
As mentioned in the initial PR summary, we do not expect significant improvement in the execution of zero_words with this PR, neither in the original version (C1 and C2) nor in the current revision (C1 only). The instruction sequences generated by both the baseline and patched versions show only minor differences under certain test conditions. Additionally, some reduction in cmp and branch instructions is insufficient to yield a significant performance benefit.

Let us focus on tests that can generate diffs, for example, I run below on Ampere Altra (Neoverse-N1), Fedora 40, Kernel 6.1.

JVM_ARGS="-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8"
JMH_ARGS="-p size=32 -p size=48 -p size=64 -p size=80 -p size=96 -p size=128 -p size=256"
jdk/bin/java -jar images/test/micro/benchmarks.jar RawAllocationRate.instanceTest_C1 -bm thrpt -gc false -wi 2 -w 60 -i 1 -r 30 -t 1 -f 1 -tu s -jvmArgs "${JVM_ARGS}" ${JMH_ARGS} -rf csv -rff results.csv

Results (Base)

"Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit","Param: size"
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7013.365157,NaN,"ops/s",32
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9160.068513,NaN,"ops/s",48
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,10216.516550,NaN,"ops/s",64
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9512.467605,NaN,"ops/s",80
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7555.693378,NaN,"ops/s",96
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9033.057061,NaN,"ops/s",128
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,5559.689404,NaN,"ops/s",256

Patched (minor variations or slight improvements, as expected)

"Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit","Param: size"
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7071.799147,NaN,"ops/s",32
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9250.847903,NaN,"ops/s",48
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,10240.947817,NaN,"ops/s",64
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9757.645075,NaN,"ops/s",80
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7531.211049,NaN,"ops/s",96
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9045.657067,NaN,"ops/s",128
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,5560.328088,NaN,"ops/s",256

Note that we do not include C2 tests and size > 256 as the generated code are same, no noticeable performance change.

Code-gen time measured by Gtest test_MacroAssembler_zero_words.cpp
I created jdk/test/hotspot/gtest/aarch64/test_MacroAssembler_zero_words.cpp to measure the wall time of zero_words calls; however, I have not included it in this PR because it still contains some hardcoded variables.

#include "asm/assembler.hpp"
#include "asm/assembler.inline.hpp"
#include "asm/macroAssembler.hpp"
#include "unittest.hpp"
#include <chrono>

#if defined(AARCH64) && !defined(ZERO)

TEST_VM(AssemblerAArch64, zero_words_wall_time) {
    BufferBlob* b = BufferBlob::create("aarch64Test", 200000);
    CodeBuffer code(b);
    MacroAssembler _masm(&code);

    const size_t call_count = 1000;
    const size_t word_count = 4; // 32B / 8B-per-word = 4
    // const size_t word_count = 16; // 128B / 8B-per-word = 16
    uint64_t* buffer = new uint64_t[word_count];
    Register base = r10;
    uint64_t cnt = word_count;

    // Set up base register to point to buffer
    _masm.mov(base, (uintptr_t)buffer);

    auto start = std::chrono::steady_clock::now();
    for (size_t i = 0; i < call_count; ++i) {
        _masm.zero_words(base, cnt);
    }
    auto end = std::chrono::steady_clock::now();

    auto wall_time_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    printf("zero_words wall time (ns): %ld\n", wall_time_ns / call_count);

    // Optionally verify buffer is zeroed
    for (size_t i = 0; i < word_count; ++i) {
        ASSERT_EQ(buffer[i], 0u);
    }

    delete[] buffer;
}

#endif  // AARCH64 && !ZERO

Firstly, we test clearing 4 words (32 bytes) with low limit 8 bytes (1 words), the patch will correct the low limit to 256 bytes (32 words). Run it 20 times to see the ratios of patch vs base (lower is better):

for ((i=0;i<20;i++));do
make test-only TEST="gtest:AssemblerAArch64.zero_words_wall_time" TEST_OPTS="JAVA_OPTIONS=-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8" 2>/dev/null | grep "wall time"
done

Test results, zero_words wall time (ns):

Base 	Patch	Patch vs Base
346	    45	    0.13
393	    45	    0.11
398	    46	    0.12
390	    30	    0.08
322	    29	    0.09
398	    27	    0.07
392	    51	    0.13
392	    44	    0.11
361	    53	    0.15
390	    44	    0.11
299	    28	    0.09
303	    29	    0.10
419	    52	    0.12
390	    44	    0.11
403	    29	    0.07
387	    44	    0.11
387	    53	    0.14
307	    29	    0.09
298	    45	    0.15
387	    45	    0.12

Secondly, we test clearing larger memory, 16 words (128 bytes) with low limit 64 bytes (8 words). Remember to update test_MacroAssembler_zero_words.cpp with const size_t word_count = 16; and use below command line:

for ((i=0;i<20;i++));do
make test-only TEST="gtest:AssemblerAArch64.zero_words_wall_time" TEST_OPTS="JAVA_OPTIONS=-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=64" | grep "wall time"
done

New test results, zero_words wall time (ns):

Base 	Patch	Patch vs Base
370     204	    0.55
310     205	    0.66
369     209	    0.57
381     208	    0.55
384     172	    0.45
365     209	    0.57
364     205	    0.56
378     204	    0.54
388     208	    0.54
375     200	    0.53
369     201	    0.54
289     204	    0.71
377     204	    0.54
380     201	    0.53
379     201	    0.53
379     199	    0.53
388     207	    0.53
375     204	    0.54
402     201	    0.50
373     202	    0.54

In summary, the code changes bring a slight improvement to execution time, though some of these differences may be within normal variation, and a clear reduction in wall time for the zero_words_reg_imm calls under the specific test conditions where UseBlockZeroing is false and mem words cnt > BlockZeroingLowLimit / BytesPerWord. I understood that some of the observed differences are not statistically significant, and certain improved code-gen wall time ratios may be of limited concern. However, the primary purpose of this PR is to address the logical issue: ensuring that a configured BlockZeroingLowLimit should not take its confusing effect when UseBlockZeroing is false, unlike its behavior when true.

Thanks for taking the time to read this long write-up in details.

bridgekeeper · 2025-10-05T10:32:11Z

@cnqpzhang This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

cnqpzhang · 2025-10-17T04:20:28Z

Added test/hotspot/gtest/aarch64/test_MacroAssembler_zero_words.cpp to measure the impact of different low limits and cleared word counts on the wall time of MacroAssembler::zero_words and compare the resulting differences.

Run the test and compare the wall times. We can see that fixing the low limit from a lower value to the default 256 improves codegen efficiency, by 11x on clear_4_words (289 vs. 25) and by 1.6x on clear_16_words (170 vs. 107).

$ make run-test TEST="gtest:MacroAssemblerZeroWordsTest"

Clear 4 words with lower limit 8, zero_words wall time (ns): 289
Clear 4 words with lower limit 256, zero_words wall time (ns): 25
Clear 16 words with lower limit 64, zero_words wall time (ns): 170
Clear 16 words with lower limit 256, zero_words wall time (ns): 107

See below for the detailed run log, including the generated code sequences under various conditions:

Test selection 'gtest:MacroAssemblerZeroWordsTest', will run:
* gtest:MacroAssemblerZeroWordsTest/server

Running test 'gtest:MacroAssemblerZeroWordsTest/server'
Note: Google Test filter = MacroAssemblerZeroWordsTest*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from MacroAssemblerZeroWordsTest
[ RUN      ] MacroAssemblerZeroWordsTest.UseBZ_clear_32B_with_lowlimit_8B_vm
--------------------------------------------------------------------------------
udf     #0
  0x0000400011c56a40:   mov     x10, #0x6fb0                    // #28592
  0x0000400011c56a44:   movk    x10, #0xab05, lsl #16
  0x0000400011c56a48:   movk    x10, #0xaaaa, lsl #32
  0x0000400011c56a4c:   orr     x11, xzr, #0x4
  0x0000400011c56a50:   subs    x8, x11, #0x8
  0x0000400011c56a54:   b.cc    0x0000400011c56a5c  // b.lo, b.ul, b.last
  0x0000400011c56a58:   bl      Stub::Stub Generator zero_blocks_stub
  0x0000400011c56a5c:   tbz     w11, #2, 0x0000400011c56a68
  0x0000400011c56a60:   stp     xzr, xzr, [x10], #16
  0x0000400011c56a64:   stp     xzr, xzr, [x10], #16
  0x0000400011c56a68:   tbz     w11, #1, 0x0000400011c56a70
  0x0000400011c56a6c:   stp     xzr, xzr, [x10], #16
  0x0000400011c56a70:   tbz     w11, #0, 0x0000400011c56a78
  0x0000400011c56a74:   str     xzr, [x10]
--------------------------------------------------------------------------------

Clear 4 words with lower limit 8, zero_words wall time (ns): 289
[       OK ] MacroAssemblerZeroWordsTest.UseBZ_clear_32B_with_lowlimit_8B_vm (2 ms)
[ RUN      ] MacroAssemblerZeroWordsTest.UseBZ_clear_32B_with_lowlimit_256B_vm
--------------------------------------------------------------------------------
udf     #0
  0x0000400011c57400:   mov     x10, #0x6fb0                    // #28592
  0x0000400011c57404:   movk    x10, #0xab05, lsl #16
  0x0000400011c57408:   movk    x10, #0xaaaa, lsl #32
  0x0000400011c5740c:   stp     xzr, xzr, [x10]
  0x0000400011c57410:   stp     xzr, xzr, [x10, #16]
--------------------------------------------------------------------------------

Clear 4 words with lower limit 256, zero_words wall time (ns): 25
[       OK ] MacroAssemblerZeroWordsTest.UseBZ_clear_32B_with_lowlimit_256B_vm (0 ms)
[ RUN      ] MacroAssemblerZeroWordsTest.UseBZ_clear_128B_with_lowlimit_64B_vm
--------------------------------------------------------------------------------
udf     #0
  0x0000400011c57400:   mov     x10, #0x6fe0                    // #28640
  0x0000400011c57404:   movk    x10, #0xab05, lsl #16
  0x0000400011c57408:   movk    x10, #0xaaaa, lsl #32
  0x0000400011c5740c:   orr     x11, xzr, #0x10
  0x0000400011c57410:   subs    x8, x11, #0x8
  0x0000400011c57414:   b.cc    0x0000400011c5741c  // b.lo, b.ul, b.last
  0x0000400011c57418:   bl      Stub::Stub Generator zero_blocks_stub
  0x0000400011c5741c:   tbz     w11, #2, 0x0000400011c57428
  0x0000400011c57420:   stp     xzr, xzr, [x10], #16
  0x0000400011c57424:   stp     xzr, xzr, [x10], #16
  0x0000400011c57428:   tbz     w11, #1, 0x0000400011c57430
  0x0000400011c5742c:   stp     xzr, xzr, [x10], #16
  0x0000400011c57430:   tbz     w11, #0, 0x0000400011c57438
  0x0000400011c57434:   str     xzr, [x10]
--------------------------------------------------------------------------------

Clear 16 words with lower limit 64, zero_words wall time (ns): 170
[       OK ] MacroAssemblerZeroWordsTest.UseBZ_clear_128B_with_lowlimit_64B_vm (0 ms)
[ RUN      ] MacroAssemblerZeroWordsTest.UseBZ_clear_128B_with_lowlimit_256B_vm
--------------------------------------------------------------------------------
udf     #0
  0x0000400011c57400:   mov     x10, #0x6fe0                    // #28640
  0x0000400011c57404:   movk    x10, #0xab05, lsl #16
  0x0000400011c57408:   movk    x10, #0xaaaa, lsl #32
  0x0000400011c5740c:   stp     xzr, xzr, [x10]
  0x0000400011c57410:   stp     xzr, xzr, [x10, #16]
  0x0000400011c57414:   stp     xzr, xzr, [x10, #32]
  0x0000400011c57418:   stp     xzr, xzr, [x10, #48]
  0x0000400011c5741c:   stp     xzr, xzr, [x10, #64]
  0x0000400011c57420:   stp     xzr, xzr, [x10, #80]
  0x0000400011c57424:   stp     xzr, xzr, [x10, #96]
  0x0000400011c57428:   stp     xzr, xzr, [x10, #112]
  0x0000400011c5742c:   add     x10, x10, #0x80
--------------------------------------------------------------------------------

Clear 16 words with lower limit 256, zero_words wall time (ns): 107
[       OK ] MacroAssemblerZeroWordsTest.UseBZ_clear_128B_with_lowlimit_256B_vm (0 ms)
[----------] 4 tests from MacroAssemblerZeroWordsTest (110 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (110 ms total)
[  PASSED  ] 4 tests.
Finished running test 'gtest:MacroAssemblerZeroWordsTest/server'
Test report is stored in build-pr/test-results/gtest_MacroAssemblerZeroWordsTest_server

==============================
Test summary
==============================
   TEST                                              TOTAL  PASS  FAIL ERROR  SKIP   
   gtest:MacroAssemblerZeroWordsTest/server              4     4     0     0     0   
==============================
TEST SUCCESS

cnqpzhang · 2025-10-22T08:41:24Z

I can't see any statistically-significant improvement. Please tell us your test results and your test conditions.

Hi @theRealAph , Do you have any further comments on the updates? Aside from the code changes to BlockZeroingLowLimit, the refinements to code-gen, added comments, and tests help improve code clarity and reduce potential technical debt, offering long-term value beyond an immediate performance gain to execution time. Would appreciate your approval of this PR. Thank you.

adinn · 2025-10-27T11:49:09Z

@cnqpzhang I don't understand why you think these tests indicate anything useful for real use cases. Do you have an actual user whose needs justify adopting this change?

Let's consider what your patch and associated test achieve. Initially you tried to remove the limit on unrolling that was imposed to avoid excessive cache consumption. When it was explained why this was inappropriate you reduced the patch so that it now adjusts the threshold at which unrolling is replaced by a call to the stub. Your two test runs appear to demonstrate a performance improvement between old and new but the difference is more apparent than real. In the specific configurations you have selected your change to the unrolling threshold targets two very specific points of disparity. The new cases fully unroll while the old cases rely on a callout. Not surpisingly. this gives very different performance when you run it in a loop many times. But we already know that callouts are more expensive than inline code.

The important thing to note is that this transition between unrolling vs callout happens in both old and new code, just at different size points. If you ran with other config settings and sizes you could find many cases where both versions fully unroll or both rely on a callout. So your test does not truly reflect what is going on here and your fix is really doing little more than rescaling the dials so they can go up to 11. You have provided no good evidence as to why we need to adjust the scale by which we compute the threshold between unrolling or callout. Furthermore, since this rescaling allows more unrolling to occur than in the old version you still need to justify why that is worth doing.

cnqpzhang · 2025-10-28T07:55:42Z

@adinn Thank you for the good summary of the proposed code changes. You omitted a key condition in the context: -XX:-UseBlockZeroing

I would like to reiterate that I have no objection to the functions when the -XX:+UseBlockZeroing option is set, everything can keep as is. My point is that BlockZeroingLowLimit serves literally/specifically as a switch to control whether DC ZVA instructions are generated for clearing instances under a specified bytes size limitation, rather than for deciding between unrolling and callout. Therefore, it should NOT affect the code-gen results any longer when -XX:-UseBlockZeroing is set, should it?

The initial patch aimed to decouple these uses, but @adinn raised concerns regarding the size of code caches and potential performance side effects. I profiled a SPECjbb2015 PRESET run and presented the minor impact, while @theRealAph commented that this approach might not fully capture all the impacts in detail. Based on the subsequent advice, a compromise was proposed: we could start by resetting BlockZeroingLowLimit to its default value when -XX:-UseBlockZeroing is configured. At this point, I faced two additional challenges: first, how to quantify the statistical improvement; and second, whether I am attempting to demonstrate the patch’s benefits based on assumptions about the -XX options users might provide (I’m not trying to predict, but the new code has begun to behave in this way). So, this is gradually going far beyond of my initial purpose, and I think you might prefer to continue use BlockZeroingLowLimit in such a dual-use manner, not only for DC ZVA, but also for unrolling or callout. Perhaps the two flags should be renamed to UseDCZVA and BlockZeroingUnrollLimit respectively.

  product(bool, UseBlockZeroing, true,                                  \
          "Use DC ZVA for block zeroing")                               \
  product(intx, BlockZeroingLowLimit, 256,                              \
          "Minimum size in bytes when block zeroing will be used")      \
          range(wordSize, max_jint)                                     \

With regards to your last question, I would not try to justify why rescaling is worth doing, because it was not my intention. The original motivation is to improve the code clarity around the low limit, make the logic clearly expressed with less ambiguity.

theRealAph · 2025-10-28T08:17:15Z

I would like to reiterate that I have no objection to the functions when the -XX:+UseBlockZeroing option is set, everything can keep as is. My point is that BlockZeroingLowLimit serves literally/specifically as a switch to control whether DC ZVA instructions are generated for clearing instances under a specified bytes size limitation, rather than for deciding between unrolling and callout. Therefore, it should NOT affect the code-gen results any longer when -XX:-UseBlockZeroing is set, should it?

It does not. When -XX:-UseBlockZeroing is set, BlockZeroingLowLimit is ignored.

cnqpzhang · 2025-10-28T08:53:02Z

I would like to reiterate that I have no objection to the functions when the -XX:+UseBlockZeroing option is set, everything can keep as is. My point is that BlockZeroingLowLimit serves literally/specifically as a switch to control whether DC ZVA instructions are generated for clearing instances under a specified bytes size limitation, rather than for deciding between unrolling and callout. Therefore, it should NOT affect the code-gen results any longer when -XX:-UseBlockZeroing is set, should it?

It does not. When -XX:-UseBlockZeroing is set, BlockZeroingLowLimit is ignored.

zero_words does not check UseBlockZeroing, it directly compares cnt and BlockZeroingLowLimit / BytesPerWord.

https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L6198C1-L6204C16

address MacroAssembler::zero_words(Register base, uint64_t cnt)
{
  assert(wordSize <= BlockZeroingLowLimit,
            "increase BlockZeroingLowLimit");
  address result = nullptr;
  if (cnt <= (uint64_t)BlockZeroingLowLimit / BytesPerWord) {
#ifndef PRODUCT

In contrast, the inner stub function does so.

https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L669

address generate_zero_blocks() {
Label done;
Label base_aligned;

Register base = r10, cnt = r11;

__ align(CodeEntryAlignment);
StubId stub_id = StubId::stubgen_zero_blocks_id;
StubCodeMark mark(this, stub_id);
address start = __ pc();

if (UseBlockZeroing) {

theRealAph · 2025-10-28T12:08:01Z

I would like to reiterate that I have no objection to the functions when the -XX:+UseBlockZeroing option is set, everything can keep as is. My point is that BlockZeroingLowLimit serves literally/specifically as a switch to control whether DC ZVA instructions are generated for clearing instances under a specified bytes size limitation, rather than for deciding between unrolling and callout. Therefore, it should NOT affect the code-gen results any longer when -XX:-UseBlockZeroing is set, should it?

It does not. When -XX:-UseBlockZeroing is set, BlockZeroingLowLimit is ignored.

zero_words does not check UseBlockZeroing, it directly compares cnt and BlockZeroingLowLimit / BytesPerWord.

It doesn't need to because

  if (!UseBlockZeroing && !FLAG_IS_DEFAULT(BlockZeroingLowLimit)) {
    warning("BlockZeroingLowLimit has been ignored because UseBlockZeroing is disabled");
    FLAG_SET_DEFAULT(BlockZeroingLowLimit, is_zva_enabled() ? (4 * VM_Version::zva_length()) : 256);
  }

That is to say, if a user sets BlockZeroingLowLimit and -XX:-UseBlockZeroing, then the user's BlockZeroingLowLimit is, rightly, ignored.

cnqpzhang · 2025-10-29T14:59:33Z

That is to say, if a user sets BlockZeroingLowLimit and -XX:-UseBlockZeroing, then the user's BlockZeroingLowLimit is, rightly, ignored.

Yes, this is the current state we have, with the patch, and it also represents the compromise I can accept regarding zero_words using BlockZeroingLowLimit to decide between "unroll vs callout" without checking UseBlockZeroing. I added necessary comments to warn others from having similar confusion or misunderstanding about this code snippet.

Is there anything else we need to do for this PR?

cnqpzhang · 2025-11-05T03:48:56Z

Hi @theRealAph and @adinn, please let me know if you have any additional comments on this PR, or advice to improve it. Thank you.

cnqpzhang · 2025-11-21T06:33:21Z

Hi,

The status of this PR has not been updated for a couple of weeks. I think I’ve addressed the feedback provided, but I haven’t seen any further comments or decisions. Please let me know if there’s anything else I can do to improve the patch or if it’s ready to move forward. Thanks for your time!

theRealAph · 2025-11-21T16:15:07Z

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

  BLOCK_COMMENT("zero_words {");
  assert(ptr == r10 && cnt == r11, "mismatch in register usage");
-  RuntimeAddress zero_blocks = RuntimeAddress(StubRoutines::aarch64::zero_blocks());
-  assert(zero_blocks.target() != nullptr, "zero_blocks stub has not been generated");


What is the point of this change?

There are duplicates of getting the address of zero_blocks() and the assertion.

The first was originally introduced by [1] and got subsequently duplicated nearby with [2]. Was there a specific reason to have one copy placed after the br(LO, around) and another before it? I tried removing one instance and tests did not report any issue.

[1] 8179444: AArch64: Put zero_words on a diet, 1ce2a36#diff-fe18bdf6585d1a0d4d510f382a568c4428334d4ad941581ecc10ec60ccafca4aR4971-R4972
[2] 8270947: AArch64: C1: use zero_words to initialize all objects
6c68ce2#diff-0f4150a9c607ccd590bf256daa800c0276144682a92bc6bdced5e8bc1bb81f3aR4625-R4626

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

theRealAph · 2025-11-21T16:17:55Z

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

+    mov(r10, base); mov(r11, cnt);
+    result = zero_words(r10, r11);
+  } else {
 #ifndef PRODUCT


What is this change for?

The reason of why I swapped the if and else code block is:
1). Initially, I intended to add a check for UseBlockZeroing to determine whether to call zero_words_reg_reg. Swapping the if and else branches makes it easier to compare the behavior with and without this additional condition. Otherwise, the if-cond would be like if (!UseBlockZeroing || cnt <= (uint64_t)BlockZeroingLowLimit / BytesPerWord).
2). Later, we decided not to check UseBlockZeroing here but I still didn't roll back this change because the comments of warning There is no need to check UseBlockZeroing.. should be placed before such an if condition, instead of the old one.

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

theRealAph · 2025-11-21T16:22:14Z

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

-      uint64_t loops = cnt/16;
+    // Use 16 words as the block size which is 128 bytes on 64-bit systems.
+    // A complete loop body will be 8 STPs unrolled there.
+    const int block_size = 16;


Naming this constant block_size only adds to any confusion, IMO.

I wondered why MacroAssembler::zero_words uses 16 words to do stp unrolling, while generate_zero_blocks() 8 words (const int MacroAssembler::zero_words_block_size = 8;), so defined this variable to compare 8 vs 16 but did not find obvious performance difference.

Regarding the var name block_size, could unroll or unroll_words be better?

I wondered why MacroAssembler::zero_words uses 16 words to do stp unrolling, while generate_zero_blocks() 8 words (const int MacroAssembler::zero_words_block_size = 8;), so defined this variable to compare 8 vs 16 but did not find obvious performance difference.

Regarding the var name block_size, could unroll or unroll_words be better?

What's wrong with 16? I'm asking not from a "my teachers said always name constants" point of view, but from a reader's understanding point of view. Named constants are all well and good if the constant has some meaning, but this one is just two words. Perhaps 2 * WordSize would do.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp

cnqpzhang

Thanks for review, please see my updates and replies with the new commit.

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

cnqpzhang · 2025-11-24T07:40:19Z

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

-      uint64_t loops = cnt/16;
+    // Use 16 words as the block size which is 128 bytes on 64-bit systems.
+    // A complete loop body will be 8 STPs unrolled there.
+    const int block_size = 16;


I wondered why MacroAssembler::zero_words uses 16 words to do stp unrolling, while generate_zero_blocks() 8 words (const int MacroAssembler::zero_words_block_size = 8;), so defined this variable to compare 8 vs 16 but did not find obvious performance difference.

Regarding the var name block_size, could unroll or unroll_words be better?

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp

cnqpzhang · 2025-11-24T08:46:00Z

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

  BLOCK_COMMENT("zero_words {");
  assert(ptr == r10 && cnt == r11, "mismatch in register usage");
-  RuntimeAddress zero_blocks = RuntimeAddress(StubRoutines::aarch64::zero_blocks());
-  assert(zero_blocks.target() != nullptr, "zero_blocks stub has not been generated");


There are duplicates of getting the address of zero_blocks() and the assertion.

The first was originally introduced by [1] and got subsequently duplicated nearby with [2]. Was there a specific reason to have one copy placed after the br(LO, around) and another before it? I tried removing one instance and tests did not report any issue.

[1] 8179444: AArch64: Put zero_words on a diet, 1ce2a36#diff-fe18bdf6585d1a0d4d510f382a568c4428334d4ad941581ecc10ec60ccafca4aR4971-R4972
[2] 8270947: AArch64: C1: use zero_words to initialize all objects
6c68ce2#diff-0f4150a9c607ccd590bf256daa800c0276144682a92bc6bdced5e8bc1bb81f3aR4625-R4626

cnqpzhang · 2025-11-24T09:13:19Z

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

+    mov(r10, base); mov(r11, cnt);
+    result = zero_words(r10, r11);
+  } else {
 #ifndef PRODUCT


The reason of why I swapped the if and else code block is:
1). Initially, I intended to add a check for UseBlockZeroing to determine whether to call zero_words_reg_reg. Swapping the if and else branches makes it easier to compare the behavior with and without this additional condition. Otherwise, the if-cond would be like if (!UseBlockZeroing || cnt <= (uint64_t)BlockZeroingLowLimit / BytesPerWord).
2). Later, we decided not to check UseBlockZeroing here but I still didn't roll back this change because the comments of warning There is no need to check UseBlockZeroing.. should be placed before such an if condition, instead of the old one.

Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is…

98ee279

… false Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

openjdk bot added hotspot hotspot-dev@openjdk.org rfr Pull request is ready for review labels Aug 24, 2025

Roll back main changes on zero_words_reg_reg and generate_zero_blocks

14c18f7

Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

Reset BlockZeroingLowLimit to 4 * _zva_length

22e72f4

Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

Check is_zva_enabled when resetting BlockZeroingLowLimit

f23abb9

Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

amptest and others added 3 commits October 16, 2025 18:16

Benchmark MacroAssembler::zero_words calls

04a01da

Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

fixed format string to %zu for size_t unsigned size type

c748d21

Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

Refine the count types to pass mac and win builds

2bbc1d0

Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

theRealAph reviewed Nov 21, 2025

View reviewed changes

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp Outdated Show resolved Hide resolved

theRealAph reviewed Nov 21, 2025

View reviewed changes

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp Outdated Show resolved Hide resolved

theRealAph reviewed Nov 21, 2025

View reviewed changes

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp Outdated Show resolved Hide resolved

theRealAph reviewed Nov 21, 2025

View reviewed changes

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp Outdated Show resolved Hide resolved

cnqpzhang commented Nov 24, 2025

View reviewed changes

Improve the comments for zero_words funcs

a23ec87

Signed-off-by: Patrick Zhang <patrick@os.amperecomputing.com>

8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is false #26917

Are you sure you want to change the base?

8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is false #26917

Conversation

cnqpzhang commented Aug 24, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewing

Uh oh!

bridgekeeper bot commented Aug 24, 2025

Uh oh!

openjdk bot commented Aug 24, 2025

Uh oh!

openjdk bot commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlbridge bot commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

adinn commented Aug 26, 2025

Uh oh!

cnqpzhang commented Aug 29, 2025

Uh oh!

cnqpzhang commented Aug 29, 2025

Uh oh!

theRealAph commented Sep 1, 2025

Uh oh!

mlbridge bot commented Sep 1, 2025

Uh oh!

cnqpzhang commented Sep 3, 2025

Uh oh!

theRealAph commented Sep 4, 2025

Uh oh!

cnqpzhang commented Sep 5, 2025

Uh oh!

theRealAph commented Sep 5, 2025

Uh oh!

cnqpzhang commented Sep 7, 2025

Uh oh!

bridgekeeper bot commented Oct 5, 2025

Uh oh!

cnqpzhang commented Oct 17, 2025

Uh oh!

cnqpzhang commented Oct 22, 2025

Uh oh!

adinn commented Oct 27, 2025

Uh oh!

cnqpzhang commented Oct 28, 2025

Uh oh!

theRealAph commented Oct 28, 2025

Uh oh!

cnqpzhang commented Oct 28, 2025

Uh oh!

theRealAph commented Oct 28, 2025

Uh oh!

cnqpzhang commented Oct 29, 2025

Uh oh!

cnqpzhang commented Nov 5, 2025

Uh oh!

cnqpzhang commented Nov 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cnqpzhang Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cnqpzhang commented Aug 24, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Aug 24, 2025 •

edited

Loading

mlbridge bot commented Aug 24, 2025 •

edited

Loading

cnqpzhang Nov 24, 2025 •

edited

Loading

cnqpzhang Nov 24, 2025 •

edited

Loading