Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malloc Arena RSS increase between 2.16.11.Final and 3.x/main #36204

Closed
franz1981 opened this issue Sep 28, 2023 · 20 comments
Closed

Malloc Arena RSS increase between 2.16.11.Final and 3.x/main #36204

franz1981 opened this issue Sep 28, 2023 · 20 comments
Labels
area/core kind/enhancement New feature or request

Comments

@franz1981
Copy link
Contributor

franz1981 commented Sep 28, 2023

Description

Environment:

java version "17.0.7" 2023-04-18 LTS
Java(TM) SE Runtime Environment (build 17.0.7+8-LTS-224)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.7+8-LTS-224, mixed mode, sharing)

and OS:

Linux fedora 6.3.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jun 15 02:15:40 UTC 2023 x86_64 GNU/Linux

CPU:

AMD Ryzen 9 7950X 16-Core Processor@4.5 GHz using network latency tuned profile with turbo boost disabled

Instructions

Run https://github.com/quarkusio/quarkus-quickstarts/tree/2.16.11.Final/config-quickstart
versus https://github.com/quarkusio/quarkus-quickstarts/tree/3.4.1/config-quickstart
(or https://github.com/quarkusio/quarkus-quickstarts/tree/development with
with c2f9d46 which is the commit before a change I've sent).

Start quarkus with:

$ taskset -c 0,1,2 java -XX:+AlwaysPreTouch -Xmx40M -Xms40M -XX:+UseSerialGC -XX:TieredStopAtLevel=4 -XX:-BackgroundCompilation -Dquarkus.vertx.event-loops-pool-size=1 -jar quarkus-app/quarkus-run.jar

Using https://github.com/bric3/java-pmap-inspector vs pmap -X with:

$ pmap -X <pid of quarkus> | java <...>/java-pmap-inspector/JavaPmapInspector.java -

For 2.16.11.Final, reports:

Shared file mappings:
                   count=2     rssMin=32         rssMax=124016    

Memory mappings:
           UNKNOWN count=33    reserved=64400      rss=56876     
       MAPPED_FILE count=22    reserved=254564     rss=30104     
         JAVA_HEAP count=1     reserved=1047616    rss=1024      
      MALLOC_ARENA count=17    reserved=1114112    rss=9968    <--------------
         CODE_HEAP count=3     reserved=245760     rss=7680      
SHARED_MAPPED_FILE count=2     reserved=124048     rss=1808      
       JAVA_THREAD count=13    reserved=13312      rss=772       
   NON_JAVA_THREAD count=2     reserved=2064       rss=32        

stats: vsz=2865876 kib 
 rss=108264 kib

For c2f9d46, reports:

Shared file mappings:
                   count=2     rssMin=32         rssMax=124016    

Memory mappings:
           UNKNOWN count=33    reserved=63376      rss=57728     
       MAPPED_FILE count=21    reserved=254500     rss=30272     
         JAVA_HEAP count=1     reserved=1047616    rss=1088      
      MALLOC_ARENA count=20    reserved=1310720    rss=22656    <-------------- WOT?     
         CODE_HEAP count=3     reserved=245760     rss=8000      
SHARED_MAPPED_FILE count=2     reserved=124048     rss=1824      
       JAVA_THREAD count=17    reserved=17408      rss=1100      
   NON_JAVA_THREAD count=2     reserved=2064       rss=32        

stats: vsz=3065492 kib 
 rss=122700 kib

Which is a 14 MB increase.

Few notes:

  1. why taskset and -Dquarkus.vertx.event-loops-pool-size=1?
    To exclude the excessive parallelism mentioned in https://stackoverflow.com/questions/63699573/growing-resident-size-set-in-jvm as one of the causes which can cause increased RSS due to lack of concurrency out of glibc malloc.
    Additionally it should limit the number of compiler threads (C1 should be just use 1 IIRC)
  2. why -XX:+AlwaysPreTouch?
    To reduce the effects of whatever allocation rate is happening in both quarkus versions and affecting the Heap's RSS
  3. why -XX:+UseSerialGC?
    This GC has the lower memory footprint overall; it's a way to reduce the noise of the resulting RSS
  4. why -XX:-BackgroundCompilation?
    To improve the reproducibility of the results

IMPORTANT NOTES:

  • It seems that reducing the compilation level to just C1, adding -XX:TieredStopAtLevel=1 the RSS difference is much less.
  • Similarly, if we use just C2 and disable tiered compilation, the RSS difference become negligible ie -XX:-TieredCompilation -XX:TieredStopAtLevel=4
@franz1981 franz1981 added the kind/enhancement New feature or request label Sep 28, 2023
@franz1981
Copy link
Contributor Author

franz1981 commented Sep 28, 2023

What's funny is that if I switch to java 21
eg

java version "21" 2023-09-19 LTS
Java(TM) SE Runtime Environment (build 21+35-LTS-2513)
Java HotSpot(TM) 64-Bit Server VM (build 21+35-LTS-2513, mixed mode, sharing)

the difference is much less ie ~7 MB

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 28, 2023

mprotect.zip

These are the 2 -e mprotect flamegraphs which shows the stack traces (per-thread) where the malloc required to allocate new arenas and cause RSS increases.

The 2 flamegraphs shows that C2 compilation is causing 10X mprotect calls than 2.16.11.Final ( due to https://github.com/openjdk/jdk/blob/jdk-17%2B35/src/hotspot/share/memory/arena.cpp' allocation, failing to reuse existing space during the liveness analysis), which could explain what has made the RSS to increase, but not why.

Both compiler (JFR) events/logs and code cache statistics doesn't report a (dramatically) increased number of compiled methods.

@quarkus-bot
Copy link

quarkus-bot bot commented Sep 29, 2023

/cc @Sanne (core), @aloubyansky (core), @gsmet (core), @radcortez (core), @stuartwdouglas (core)

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 29, 2023

@ashu-mehra I've performed some malloc leak profiling for 8 seconds (to be sure the Arena's cleaner kicks-in) using https://github.com/async-profiler/async-profiler/tree/malloc#finding-native-memory-leaks using this command:

$ LD_PRELOAD=~/async-profiler/build/libasyncProfiler.so taskset -c 0,1,2 java -XX:+AlwaysPreTouch -Xmx40M -Xms40M -XX:+UseSerialGC -XX:TieredStopAtLevel=4 -XX:+TieredCompilation -XX:-BackgroundCompilation -Dquarkus.vertx.event-loops-pool-size=1 -agentpath:~/async-profiler/build/libasyncProfiler.so=start,event=nativemem,cstack=dwarf,file=malloc_leaks.jfr,timeout=8s -jar quarkus-app/quarkus-run.jar

$ java -cp ~/async-profiler/build/converter.jar MallocReport malloc_leaks.jfr malloc_leaks.html

and found that 2.16.11.Final C2 activity keep on growing its Arena's via malloc for ~15 MB without releasing it back to the OS (which should happen at ChunkPool::clean, too), while c2f9d46, just for ~1 MB.

I've modified the async-profiler malloc branch to be able to collect malloc data + size (it could be done via eBPF/perf probe, but would miss the java stacks!) in order to better understand the ratio between malloc/free ie:

  • 2.16.11.Final: requested 114 MB while kept "alive" (not free) 15.6 MB (98.4 MB returned back to the OS allocator via free)
  • 2f9d461e983f187db344f93917df46dd403b5bb: requested 143 MB while kept "alive" (not free) 1.1 MB (141.9 MB return back to the OS allocator via free)

While the modified malloc converter is at https://gist.github.com/franz1981/496d6641272370bc6f22a17ad5db2170

@franz1981
Copy link
Contributor Author

I'm adding the leaks analysis and the malloc with size data for the 2 branches at
malloc_leaks.zip

@tstuefe
Copy link

tstuefe commented Oct 4, 2023

Hi @franz1981 ,

Some notes / questions:

The chunk pools get cleaned out every 5 seconds. An 8 second test may be too short to make sure that all memory is returned to the OS.

If you don't finish the process (let it hang at the end), and you still see high RSS:

What does jcmd VM.native_memory say? (Need to enable NMT) Especially, are there still significantly sized hotspot arenas alive? If yes, that would indicate the VM does not release memory.

If the NMT report does not show anything interesting, but you see large glib arenas: what happens when you run jcmd System.trim_native_heap ? If that recovers some significant memory, the problem is glibc memory retention, and may be caused by different allocation behaviour of the VM. E.g. doing more things in parallel would cause higher temporary spikes in malloced memory, which the glibc then does retain.

@tstuefe
Copy link

tstuefe commented Oct 4, 2023

Side note, the October release for JDK 17 (JDK 21 too) will have downports of Glibc automatic heap trimming, which may help matters. Still experimental though.

@franz1981
Copy link
Contributor Author

franz1981 commented Oct 5, 2023

After some investigation from @rwestrel and @tstuefe jdk.internal.org.objectweb.asm.Frame::execute IR nodes uses much more memory and it can be translated in a more busy compiler using more memory (which exclude presence of leaks, yay!).
JDK 21, instead stops at tier 3 for some reason (yet to investigate) which explain why it does better and using less memory.

I've performed thanks to these insights, another profiling investigation by running -e jdk.internal.org.objectweb.asm.Frame.execute to track how many times the mentioned method is called, and, I have observed:

  • java/lang/reflect/ProxyGenerator.generateClassFile is present in ~60% of the stack traces (eg checking for annotation presence)
  • java/lang/invoke/LambdaForm.compileToBytecode is present in ~40% of the stack traces (eg lambda definitions)

This is similar for both Quarkus versions, although the total number of calls is rather different:

  • 4100 calls for 2.16.11.Final
  • 5400 calls for 2f9d461e983f187db344f93917df46dd403b5bb

Another data point, if we search io/smallrye/config within the 2 flamegraphs we would see

  • 4100 * 58.46% = 2397 samples for 2.16.11.Final
  • 5400 * 73.85% = 3988 samples for 2f9d461e983f187db344f93917df46dd403b5bb

which means that config is the most relevant code causing an increased number of jdk.internal.org.objectweb.asm.Frame::execute (which still doesn't mean the IR nodes of such should be heavier yet: we need confirmation from @rwestrel for this!).

The other one is org/jboss/logmanager, although in the global picture is not as much as config .

Attaching the 2 traces, to help @radcortez investigation (and maybe @dmlloyd can give some insights too):
frame_execute.zip

@geoand
Copy link
Contributor

geoand commented Oct 9, 2023

After some investigation from @rwestrel and @tstuefe jdk.internal.org.objectweb.asm.Frame::execute IR nodes uses much more memory and it can be translated in a more busy compiler using more memory (which exclude presence of leaks, yay!).
JDK 21, instead stops at tier 3 for some reason (yet to investigate) which explain why it does better and using less memory.

Very interesting insight. May we know what the steps are for us non-OpenJDK devs to be able to arrive at such conclusions?

@franz1981
Copy link
Contributor Author

franz1981 commented Oct 9, 2023

@geoand
@tstuefe is working on openjdk/jdk#16076 which would enable correlating the consumed memory and compiled methods, based on which one caused a change in the peak (JDK's Arena allocator) memory usage - in short, enabling to find top method compilation consumers and compare between runs.
The best thing would be to have JFR events which sum to the existing ones related compilation of methods, but I think we can propose such change later on; is already very key to have first the proper data collected somehow (because is an additional bookeeping which I expect come with some cost too, hence meant for troubleshooting mainly, not during normal usage).

Thanks @rwestrel and @tstuefe for the hard work!

@franz1981
Copy link
Contributor Author

franz1981 commented Oct 9, 2023

To summarize this:

We could alleviate it, by reducing the 2 types of operations which make that method to be invoked (as reported in #36204 (comment)) ie:

  • java/lang/reflect/ProxyGenerator.generateClassFile present in ~60% of the stack traces (eg checking for annotation presence)
  • java/lang/invoke/LambdaForm.compileToBytecode present in ~40% of the stack traces (eg lambda definitions)

or, just ignore it. JDK 21 doesn't need to use such method that much, lowering the number of invocations and reducing the C2 work again, improving both cpu usage and RSS.

@dmlloyd
Copy link
Member

dmlloyd commented Oct 9, 2023

or, just ignore it. JDK 21 doesn't need to use such method that much, lowering the number of invocations and reducing the C2 work again, improving both cpu usage and RSS.

Here's a caveat/explanation that we discussed in chat a bit: JDK 21 replaces many uses of this mechanism (embedded ASM) with a new one (java.lang.classfile API). For a few JDK releases (21, 22, 23 probably) we won't notice because usage will be balanced between the two. Eventually however, everything will be moved over, and we'll start seeing the same thing again, but with the new API! Furthermore it's reasonable to expect that the new API usages won't be quite as optimized at first, because it is new. So as always, interesting times are ahead.

@franz1981
Copy link
Contributor Author

Thinking about it, there is something we could do here: @geoand we know when the system is considered ready to serve requests, so we can experiment, as per the suggestion by @tstuefe to issue a jcmd System.trim_native_heap to recover the increased RSS back. Similarly would be great, if we are idle from some time, to issue the same command again, but is more something which who manager the container can perform, unless there is way for the JVM to do it itself?

@geoand
Copy link
Contributor

geoand commented Nov 18, 2023

The first suggestion we can very easily apply.

As for the second one, it would require a lot more thought

@franz1981
Copy link
Contributor Author

franz1981 commented Nov 18, 2023

I Will create a branch to test against, in the next week, with your guidance :) I think this can be an interesting improvement, the sole unhappy thing is here at https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/hotspot/share/memory/arena.cpp#L119 which hardcode to 5s the releasing (returning to the allocator, not the OS!) of the memory accumulated during JIT compilation: it could be great to make the jcmd able to first forcibly trigger such AND than trim the allocator memory, which will collect back way more RSS, wdyt @tstuefe ?
Now the important question is: what happen after the first N requests hit and the JIT compiler have to run again? It's likely the previously uncommitted memory will be required again, so we need to think if this is a benchmarketing feature or have some specific use case can benefit of

@tstuefe
Copy link

tstuefe commented Nov 20, 2023

Similarly would be great, if we are idle from some time, to issue the same command again, but is more something which who manager the container can perform, unless there is way for the JVM to do it itself?

If you use java 17.0.9 or java 21.0.1, this is not needed. Just use autotrim, eg every five seconds:

-XX:+UnlockExperimentalVMOptions -XX:TrimNativeHeapInterval=5000

With -Xlog:trimnative you see the trims in the log.

it could be great to make the jcmd able to first forcibly trigger such AND than trim the allocator memory, which will collect back way more RSS, wdyt @tstuefe ?

Hmm. For the manual jcmd command this could make sense, for the autotrim not so much. The five seconds delay are a necessary low-pass filter to be able to decouple compiler malloc needs from regular autotrim.

@franz1981
Copy link
Contributor Author

franz1981 commented Nov 20, 2023

Just use autotrim, eg every five seconds:

My concern is that whatever memory (off-heap ones from Java, JIT Arenas, GC aux structures...) is return back via free will maybe be required in some future, eg:

The key part is won't be the same because the released memory has been returned to the OS for real: I would like to make sure that enabling this always-on feature won't affect too much subsequent memory allocation operations, although I believe, at least for class load/unload and JIT related ones, it should converge quickly, while others off-heap release operations are not yet implemented (ie netty/netty#11845 (comment)), meaning that we shouldn't "fear" any adverse effect in the hot path.

@geoand for a PoC run we can just enable the autotrim + https://bugs.openjdk.org/browse/JDK-8204089

The five seconds delay are a necessary low-pass filter to be able to decouple compiler malloc needs from regular autotrim.

yeah, I can see that, although why not piggyback, while available (ie if autotrimming is configured), with the state of https://github.com/openjdk/jdk/blob/de51aa19d6a8cbd3b83bf469cb89da16f4b6f498/src/hotspot/share/runtime/trimNativeHeap.cpp#L42C11-L42C11?

I see that https://bugs.openjdk.org/browse/JDK-8204089 is using a different mechanism to release heap back to the OS, via https://github.com/openjdk/jdk/blob/de51aa19d6a8cbd3b83bf469cb89da16f4b6f498/src/hotspot/os/linux/os_linux.cpp#L3329, which explain why is using its own mechanism to do it, but I really like that is related to the "user" activity (at https://github.com/openjdk/jdk/blob/de51aa19d6a8cbd3b83bf469cb89da16f4b6f498/src/hotspot/share/gc/g1/g1PeriodicGCTask.cpp#L56 - the periodic task is the one which can enqueue collecting the heap, it seems) - but maybe a "synergy" between the 2 mechanism is not possible because system-wide malloc trim serve different off-heap allocators which not always have a 1:1 dependency to user activity as "regular" heap activities.

@tstuefe
Copy link

tstuefe commented Nov 20, 2023

Just use autotrim, eg every five seconds:

My concern is that whatever memory (off-heap ones from Java, JIT Arenas, GC aux structures...) is return back via free will maybe be required in some future, eg:

* quarkus startup and it sadly perform some reflective access as outlined at [Malloc Arena RSS increase between 2.16.11.Final and 3.x/main #36204 (comment)](https://github.com/quarkusio/quarkus/issues/36204#issuecomment-1752989061), which can cause C2 to peak JIT Arenas usage and increase RSS

* after https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/hotspot/share/memory/arena.cpp#L119 the Arena unused part is `free`ed

* fater some time `autotrim` will kick-in, releasing it back to the OS

* A bunch of HTTP requests hit the application and JIT C1 + C2 kick-ins: the cost of enlarging the `Arena's` now (ie via `malloc`) won't be the same because the released memory has been returned to the OS for real (https://elixir.bootlin.com/glibc/glibc-2.38/source/malloc/malloc.c#L5296 MADV_DONTNEED is not immediate, but let's say it is already happened, for this example)

The key part is won't be the same because the released memory has been returned to the OS for real: I would like to make sure that enabling this always-on feature won't affect too much subsequent memory allocation operations, although I believe, at least for class load/unload and JIT related ones, it should converge quickly, while others off-heap release operations are not yet implemented (ie netty/netty#11845 (comment)), meaning that we shouldn't "fear" any adverse effect in the hot path.

There is no way to know without measuring performance.

@geoand for a PoC run we can just enable the autotrim + https://bugs.openjdk.org/browse/JDK-8204089

The five seconds delay are a necessary low-pass filter to be able to decouple compiler malloc needs from regular autotrim.

yeah, I can see that, although why not piggyback, while available (ie if autotrimming is configured), with the state of https://github.com/openjdk/jdk/blob/de51aa19d6a8cbd3b83bf469cb89da16f4b6f498/src/hotspot/share/runtime/trimNativeHeap.cpp#L42C11-L42C11?

Difficult to answer without knowing what piggy-back means in this case.

I see that https://bugs.openjdk.org/browse/JDK-8204089 is using a different mechanism to release heap back to the OS, via https://github.com/openjdk/jdk/blob/de51aa19d6a8cbd3b83bf469cb89da16f4b6f498/src/hotspot/os/linux/os_linux.cpp#L3329, which explain why is using its own mechanism to do it, but I really like that is related to the "user" activity (at https://github.com/openjdk/jdk/blob/de51aa19d6a8cbd3b83bf469cb89da16f4b6f498/src/hotspot/share/gc/g1/g1PeriodicGCTask.cpp#L56 - the periodic task is the one which can enqueue collecting the heap, it seems) - but maybe a synergy between the 2 mechanism is not possible because system-wide malloc trim serves different off-heap allocators which not always have a 1:1 dependency to user activity as "regular" heap activities.

I found that predicting future malloc behavior is rather difficult. As the big known C-heap using component, we have JIT and, to a degree, GC, but also ObjectMonitors and myriads of other things. A large part of these is not even known - allocations from JDK and third-party JNI code and system libraries. Their activity may roughly synchronize with G1 activity, but possibly not. E.g. OM allocation is tied to contention. JIT activity, obviously, to compilation.

I initially thought about hooking into (hotspot-side) malloc and implementing a timeout - no mallocs for five seconds, let's trim. That didn't work so well for multiple reasons. You usually have a low-level malloc noise, which you would have to filter out - so it would be easy to arrive at a situation where you never start trimming. And I am not sure that past behavior is a good predictor for future behavior. And this approach only works for those mallocs hotspot sees, whereas a large part of mallocs are external.

I also tried to tie trimming to the GC cycle. That worked somewhat, but not so well with GC-independent malloc users, especially from outside.

Any of the solutions sketched up above increase the complexity a lot while still leaving plenty of room for corner cases where trimming would either not work as expected or be detrimental. The trim solution we ended up with is simple and pragmatic. Trim at regular intervals, but keep out of safepoints.

(it was also a question of time; I did not have that much time to burn on this feature).

@franz1981
Copy link
Contributor Author

thanks @tstuefe for the detailed answer!

@franz1981
Copy link
Contributor Author

Closing this having already found what's the cause of the regression

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core kind/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants