[GR-35746] Decrease aligned chunk size to 512 KB. #6115

graalvmbot · 2023-03-03T09:59:36Z

No description provided.

SergejIsbrecht · 2023-03-03T13:02:21Z

Did you measure any effect on GC activitiy? Aligned chunks are used for TLABs, right? They are not resized, therefore reducing the size of a TLAB would impact the GC activity.

peter-hofer · 2023-03-03T13:39:14Z

We did a good amount of benchmarking for this and overall, the improvements in memory usage and image size outweigh regressions in individual cases.

Aligned chunks are indeed used for TLABs and not resized, but decreasing their size does not automatically mean more garbage collections will happen. It means the memory is divided into more chunks, and that threads might need to get a new TLAB more often. It also means that threads with a low allocation rate hoard less memory in their TLABs, and that we might do fewer GCs because we decide that based on the size of all allocated chunks, not the bytes allocated in them. Smaller aligned chunks also mean less waste from chunk alignment in images (particularly small ones), and that pinning an object keeps fewer other objects alive (because a pinned object currently keeps the entire chunk alive). This overall change in behavior in turn affects the decisions of the GC policy about the size of spaces and whether to do incremental or full collections, which also has a major impact.

SergejIsbrecht · 2023-03-05T10:46:15Z

@peter-hofer ,

thank you for your explanation. There insightful.

good amount of benchmarking for this and overall

How did you measure it? I have a very large applicaiton running on aarch64 and could do some tests as well. Our allocation rate is quite high.

and that we might do fewer GCs because we decide that based on the size of all allocated chunks, not the bytes allocated in them.

I was reading https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/ and might have conclude something wrong. If a allocation can not be done in TLAB, will it be done in a new TLAB or in the heap? On OpenJDK I can remember slow path allocations in sampled stacks. Therefore, bigger TLABs would increase the throughput, because slow allocations would happens less, at least on OpenJDK.

and that pinning an object keeps fewer other objects alive (because a pinned object currently keeps the entire chunk alive)

Could that be a leak cause?
Is there any documentation what pinned objects are?
Can I somehow print out statstics about the amount of active TLABs and how many pinned objects are currently held in the Java Heap, just like VerboseGC
If a pinned object holds a TLAB with some objects in it and I would trigger a heap dump. Would the objects in the heap dump be seen as "reachable" or as "not reachable", when traversing the object graph with for example MAT (Memory Eclipse Analyser)

This overall change in behavior in turn affects the decisions of the GC policy about the size of spaces and whether to do incremental or full collections, which also has a major impact.

If I were to change the aligned chunk size to 512k, would it behave just like this branch, or do I need to test with this branch?

Thank you for your time and insight.

peter-hofer · 2023-03-13T09:18:42Z

How did you measure it?

We have CI infrastructure which uses mx benchmark to run the DaCapo, DaCapo-Scala and Renaissance as well as some microservice benchmarks. Feel free to run your own tests with -H:AlignedHeapChunkSize, which behaves the same as this PR.

I was reading https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/ and might have conclude something wrong. If a allocation can not be done in TLAB, will it be done in a new TLAB or in the heap?

That article specifically describes HotSpot and the Epsilon GC. Native Image always allocates small objects in a TLAB (aligned chunk) and large arrays that exceed a certain threshold in unaligned chunks (with near-exact size). The threshold is 1/8 of the aligned chunk size and has therefore become smaller as part of this change (128K -> 64K), but a larger threshold did not fare better. When a large-ish object that is still below the threshold does not fit into the current TLAB, we retire it and get a new TLAB to allocate the object, which, in the very worst case, leads to nearly 12.5% waste.

Indeed it would be preferable to dynamically size TLABs instead of using fixed-size aligned chunks for every workload.

Could that be a leak cause? Is there any documentation what pinned objects are?
Can I somehow print out statstics about the amount of active TLABs and how many pinned objects are ...

Yes, pinned objects can cause temporary leaks, but are commonly used just briefly to enable native code to access memory on the Java heap. See PinnedObject in the API and its usages. I don't know off the top of my head how much information we expose on pinning, but have a look at -H:VerboseGC, -H:PrintHeapShape, -H:TraceHeapChunks, etc.

SergejIsbrecht · 2023-04-19T11:05:35Z

@peter-hofer ,

I did some testing on aarch64 with 512k and 2MiB AlignedHeapChunkSize and it seems in this case allocation heavy workloads are not impacted, at least not in this case.

Tested on the latest 23.1 dev version (aarch64)

512k

Benchmark                                                       Mode  Cnt      Score     Error   Units
FromStringBench.fromString_baseline_matching                    avgt    5    103.122 ±   0.516   ns/op
FromStringBench.fromString_baseline_matching:·gc.alloc.rate     avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_baseline_matching:·gc.count          avgt    5     87.000            counts
FromStringBench.fromString_baseline_matching:·gc.time           avgt    5    688.000                ms

FromStringBench.fromString_baseline_notMatching                 avgt    5  26926.881 ± 229.601   ns/op
FromStringBench.fromString_baseline_notMatching:·gc.alloc.rate  avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_baseline_notMatching:·gc.count       avgt    5     52.000            counts
FromStringBench.fromString_baseline_notMatching:·gc.time        avgt    5    426.000                ms

FromStringBench.fromString_improved_matching                    avgt    5     16.625 ±   0.018   ns/op
FromStringBench.fromString_improved_matching:·gc.alloc.rate     avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_improved_matching:·gc.count          avgt    5        ≈ 0            counts

FromStringBench.fromString_improved_notMatching                 avgt    5     16.623 ±   0.018   ns/op
FromStringBench.fromString_improved_notMatching:·gc.alloc.rate  avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_improved_notMatching:·gc.count       avgt    5        ≈ 0            counts

2MiB

Benchmark                                                       Mode  Cnt      Score     Error   Units
FromStringBench.fromString_baseline_matching                    avgt    5     98.789 ±   2.175   ns/op
FromStringBench.fromString_baseline_matching:·gc.alloc.rate     avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_baseline_matching:·gc.count          avgt    5     91.000            counts
FromStringBench.fromString_baseline_matching:·gc.time           avgt    5    708.000                ms

FromStringBench.fromString_baseline_notMatching                 avgt    5  27154.021 ± 243.917   ns/op
FromStringBench.fromString_baseline_notMatching:·gc.alloc.rate  avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_baseline_notMatching:·gc.count       avgt    5     52.000            counts
FromStringBench.fromString_baseline_notMatching:·gc.time        avgt    5    427.000                ms

FromStringBench.fromString_improved_matching                    avgt    5     16.627 ±   0.017   ns/op
FromStringBench.fromString_improved_matching:·gc.alloc.rate     avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_improved_matching:·gc.count          avgt    5        ≈ 0            counts

FromStringBench.fromString_improved_notMatching                 avgt    5     16.632 ±   0.025   ns/op
FromStringBench.fromString_improved_notMatching:·gc.alloc.rate  avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_improved_notMatching:·gc.count       avgt    5        ≈ 0            counts

Note: the GC profiler seems to not work properly on native-image with JMH.

peter-hofer · 2023-04-24T17:12:28Z

Thanks for sharing @SergejIsbrecht , so how does that compare to the former default of 1M?

graalvmbot assigned peter-hofer Mar 3, 2023

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Mar 3, 2023

peter-hofer added 5 commits March 24, 2023 14:07

Decrease aligned chunk size to 512 KB.

75b2a4f

Use @jdk.internal.ValueBased directly.

1f5731a

Un-specify LargeArrayThreshold default.

c65c2d0

Add change log entry.

00c5439

Use current aligned chunk size in TestObjectAllocationInNewTLABEvent.

81235de

graalvmbot force-pushed the ph/GR-35746-512k branch from 4bf9a65 to 81235de Compare March 24, 2023 13:20

graalvmbot closed this Mar 25, 2023

graalvmbot merged commit b1f3a7a into master Mar 25, 2023

graalvmbot deleted the ph/GR-35746-512k branch March 25, 2023 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GR-35746] Decrease aligned chunk size to 512 KB. #6115

[GR-35746] Decrease aligned chunk size to 512 KB. #6115

graalvmbot commented Mar 3, 2023

SergejIsbrecht commented Mar 3, 2023

peter-hofer commented Mar 3, 2023

SergejIsbrecht commented Mar 5, 2023 •

edited

Loading

peter-hofer commented Mar 13, 2023

SergejIsbrecht commented Apr 19, 2023

peter-hofer commented Apr 24, 2023

[GR-35746] Decrease aligned chunk size to 512 KB. #6115

[GR-35746] Decrease aligned chunk size to 512 KB. #6115

Conversation

graalvmbot commented Mar 3, 2023

SergejIsbrecht commented Mar 3, 2023

peter-hofer commented Mar 3, 2023

SergejIsbrecht commented Mar 5, 2023 • edited Loading

peter-hofer commented Mar 13, 2023

SergejIsbrecht commented Apr 19, 2023

peter-hofer commented Apr 24, 2023

SergejIsbrecht commented Mar 5, 2023 •

edited

Loading