Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GR-35746] Decrease aligned chunk size to 512 KB. #6115

Merged
merged 5 commits into from
Mar 25, 2023
Merged

Conversation

graalvmbot
Copy link
Collaborator

No description provided.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Mar 3, 2023
@SergejIsbrecht
Copy link

Did you measure any effect on GC activitiy? Aligned chunks are used for TLABs, right? They are not resized, therefore reducing the size of a TLAB would impact the GC activity.

@peter-hofer
Copy link
Member

We did a good amount of benchmarking for this and overall, the improvements in memory usage and image size outweigh regressions in individual cases.

Aligned chunks are indeed used for TLABs and not resized, but decreasing their size does not automatically mean more garbage collections will happen. It means the memory is divided into more chunks, and that threads might need to get a new TLAB more often. It also means that threads with a low allocation rate hoard less memory in their TLABs, and that we might do fewer GCs because we decide that based on the size of all allocated chunks, not the bytes allocated in them. Smaller aligned chunks also mean less waste from chunk alignment in images (particularly small ones), and that pinning an object keeps fewer other objects alive (because a pinned object currently keeps the entire chunk alive). This overall change in behavior in turn affects the decisions of the GC policy about the size of spaces and whether to do incremental or full collections, which also has a major impact.

@SergejIsbrecht
Copy link

SergejIsbrecht commented Mar 5, 2023

@peter-hofer ,

thank you for your explanation. There insightful.

good amount of benchmarking for this and overall

How did you measure it? I have a very large applicaiton running on aarch64 and could do some tests as well. Our allocation rate is quite high.

and that we might do fewer GCs because we decide that based on the size of all allocated chunks, not the bytes allocated in them.

I was reading https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/ and might have conclude something wrong. If a allocation can not be done in TLAB, will it be done in a new TLAB or in the heap? On OpenJDK I can remember slow path allocations in sampled stacks. Therefore, bigger TLABs would increase the throughput, because slow allocations would happens less, at least on OpenJDK.

and that pinning an object keeps fewer other objects alive (because a pinned object currently keeps the entire chunk alive)

  • Could that be a leak cause?
  • Is there any documentation what pinned objects are?
  • Can I somehow print out statstics about the amount of active TLABs and how many pinned objects are currently held in the Java Heap, just like VerboseGC
  • If a pinned object holds a TLAB with some objects in it and I would trigger a heap dump. Would the objects in the heap dump be seen as "reachable" or as "not reachable", when traversing the object graph with for example MAT (Memory Eclipse Analyser)

This overall change in behavior in turn affects the decisions of the GC policy about the size of spaces and whether to do incremental or full collections, which also has a major impact.

If I were to change the aligned chunk size to 512k, would it behave just like this branch, or do I need to test with this branch?

Thank you for your time and insight.

@peter-hofer
Copy link
Member

How did you measure it?

We have CI infrastructure which uses mx benchmark to run the DaCapo, DaCapo-Scala and Renaissance as well as some microservice benchmarks. Feel free to run your own tests with -H:AlignedHeapChunkSize, which behaves the same as this PR.

I was reading https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/ and might have conclude something wrong. If a allocation can not be done in TLAB, will it be done in a new TLAB or in the heap?

That article specifically describes HotSpot and the Epsilon GC. Native Image always allocates small objects in a TLAB (aligned chunk) and large arrays that exceed a certain threshold in unaligned chunks (with near-exact size). The threshold is 1/8 of the aligned chunk size and has therefore become smaller as part of this change (128K -> 64K), but a larger threshold did not fare better. When a large-ish object that is still below the threshold does not fit into the current TLAB, we retire it and get a new TLAB to allocate the object, which, in the very worst case, leads to nearly 12.5% waste.

Indeed it would be preferable to dynamically size TLABs instead of using fixed-size aligned chunks for every workload.

Could that be a leak cause? Is there any documentation what pinned objects are?
Can I somehow print out statstics about the amount of active TLABs and how many pinned objects are ...

Yes, pinned objects can cause temporary leaks, but are commonly used just briefly to enable native code to access memory on the Java heap. See PinnedObject in the API and its usages. I don't know off the top of my head how much information we expose on pinning, but have a look at -H:VerboseGC, -H:PrintHeapShape, -H:TraceHeapChunks, etc.

@graalvmbot graalvmbot closed this Mar 25, 2023
@graalvmbot graalvmbot merged commit b1f3a7a into master Mar 25, 2023
@graalvmbot graalvmbot deleted the ph/GR-35746-512k branch March 25, 2023 14:01
@SergejIsbrecht
Copy link

@peter-hofer ,

I did some testing on aarch64 with 512k and 2MiB AlignedHeapChunkSize and it seems in this case allocation heavy workloads are not impacted, at least not in this case.

Tested on the latest 23.1 dev version (aarch64)

512k

Benchmark                                                       Mode  Cnt      Score     Error   Units
FromStringBench.fromString_baseline_matching                    avgt    5    103.122 ±   0.516   ns/op
FromStringBench.fromString_baseline_matching:·gc.alloc.rate     avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_baseline_matching:·gc.count          avgt    5     87.000            counts
FromStringBench.fromString_baseline_matching:·gc.time           avgt    5    688.000                ms

FromStringBench.fromString_baseline_notMatching                 avgt    5  26926.881 ± 229.601   ns/op
FromStringBench.fromString_baseline_notMatching:·gc.alloc.rate  avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_baseline_notMatching:·gc.count       avgt    5     52.000            counts
FromStringBench.fromString_baseline_notMatching:·gc.time        avgt    5    426.000                ms

FromStringBench.fromString_improved_matching                    avgt    5     16.625 ±   0.018   ns/op
FromStringBench.fromString_improved_matching:·gc.alloc.rate     avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_improved_matching:·gc.count          avgt    5        ≈ 0            counts

FromStringBench.fromString_improved_notMatching                 avgt    5     16.623 ±   0.018   ns/op
FromStringBench.fromString_improved_notMatching:·gc.alloc.rate  avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_improved_notMatching:·gc.count       avgt    5        ≈ 0            counts

2MiB

Benchmark                                                       Mode  Cnt      Score     Error   Units
FromStringBench.fromString_baseline_matching                    avgt    5     98.789 ±   2.175   ns/op
FromStringBench.fromString_baseline_matching:·gc.alloc.rate     avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_baseline_matching:·gc.count          avgt    5     91.000            counts
FromStringBench.fromString_baseline_matching:·gc.time           avgt    5    708.000                ms

FromStringBench.fromString_baseline_notMatching                 avgt    5  27154.021 ± 243.917   ns/op
FromStringBench.fromString_baseline_notMatching:·gc.alloc.rate  avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_baseline_notMatching:·gc.count       avgt    5     52.000            counts
FromStringBench.fromString_baseline_notMatching:·gc.time        avgt    5    427.000                ms

FromStringBench.fromString_improved_matching                    avgt    5     16.627 ±   0.017   ns/op
FromStringBench.fromString_improved_matching:·gc.alloc.rate     avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_improved_matching:·gc.count          avgt    5        ≈ 0            counts

FromStringBench.fromString_improved_notMatching                 avgt    5     16.632 ±   0.025   ns/op
FromStringBench.fromString_improved_notMatching:·gc.alloc.rate  avgt    5        ≈ 0            MB/sec
FromStringBench.fromString_improved_notMatching:·gc.count       avgt    5        ≈ 0            counts

Note: the GC profiler seems to not work properly on native-image with JMH.

@peter-hofer
Copy link
Member

Thanks for sharing @SergejIsbrecht , so how does that compare to the former default of 1M?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants