-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Malloc Arena RSS increase between 2.16.11.Final and 3.x/main #36204
Comments
What's funny is that if I switch to java 21
the difference is much less ie ~7 MB |
These are the 2 The 2 flamegraphs shows that C2 compilation is causing 10X Both compiler (JFR) events/logs and code cache statistics doesn't report a (dramatically) increased number of compiled methods. |
/cc @Sanne (core), @aloubyansky (core), @gsmet (core), @radcortez (core), @stuartwdouglas (core) |
@ashu-mehra I've performed some malloc leak profiling for 8 seconds (to be sure the Arena's cleaner kicks-in) using https://github.com/async-profiler/async-profiler/tree/malloc#finding-native-memory-leaks using this command:
and found that I've modified the
While the modified |
I'm adding the leaks analysis and the malloc with size data for the 2 branches at |
Hi @franz1981 , Some notes / questions: The chunk pools get cleaned out every 5 seconds. An 8 second test may be too short to make sure that all memory is returned to the OS. If you don't finish the process (let it hang at the end), and you still see high RSS: What does If the NMT report does not show anything interesting, but you see large glib arenas: what happens when you run |
Side note, the October release for JDK 17 (JDK 21 too) will have downports of Glibc automatic heap trimming, which may help matters. Still experimental though. |
After some investigation from @rwestrel and @tstuefe I've performed thanks to these insights, another profiling investigation by running
This is similar for both Quarkus versions, although the total number of calls is rather different:
Another data point, if we search
which means that The other one is Attaching the 2 traces, to help @radcortez investigation (and maybe @dmlloyd can give some insights too): |
Very interesting insight. May we know what the steps are for us non-OpenJDK devs to be able to arrive at such conclusions? |
@geoand |
To summarize this:
We could alleviate it, by reducing the 2 types of operations which make that method to be invoked (as reported in #36204 (comment)) ie:
or, just ignore it. JDK 21 doesn't need to use such method that much, lowering the number of invocations and reducing the C2 work again, improving both cpu usage and RSS. |
Here's a caveat/explanation that we discussed in chat a bit: JDK 21 replaces many uses of this mechanism (embedded ASM) with a new one ( |
Thinking about it, there is something we could do here: @geoand we know when the system is considered ready to serve requests, so we can experiment, as per the suggestion by @tstuefe to issue a |
The first suggestion we can very easily apply. As for the second one, it would require a lot more thought |
I Will create a branch to test against, in the next week, with your guidance :) I think this can be an interesting improvement, the sole unhappy thing is here at https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/hotspot/share/memory/arena.cpp#L119 which hardcode to 5s the releasing (returning to the allocator, not the OS!) of the memory accumulated during JIT compilation: it could be great to make the jcmd able to first forcibly trigger such AND than trim the allocator memory, which will collect back way more RSS, wdyt @tstuefe ? |
If you use java 17.0.9 or java 21.0.1, this is not needed. Just use autotrim, eg every five seconds:
With
Hmm. For the manual jcmd command this could make sense, for the autotrim not so much. The five seconds delay are a necessary low-pass filter to be able to decouple compiler malloc needs from regular autotrim. |
My concern is that whatever memory (off-heap ones from Java, JIT Arenas, GC aux structures...) is return back via
The key part is won't be the same because the released memory has been returned to the OS for real: I would like to make sure that enabling this always-on feature won't affect too much subsequent memory allocation operations, although I believe, at least for class load/unload and JIT related ones, it should converge quickly, while others off-heap release operations are not yet implemented (ie netty/netty#11845 (comment)), meaning that we shouldn't "fear" any adverse effect in the hot path. @geoand for a PoC run we can just enable the autotrim + https://bugs.openjdk.org/browse/JDK-8204089
yeah, I can see that, although why not piggyback, while available (ie if autotrimming is configured), with the state of https://github.com/openjdk/jdk/blob/de51aa19d6a8cbd3b83bf469cb89da16f4b6f498/src/hotspot/share/runtime/trimNativeHeap.cpp#L42C11-L42C11? I see that https://bugs.openjdk.org/browse/JDK-8204089 is using a different mechanism to release heap back to the OS, via https://github.com/openjdk/jdk/blob/de51aa19d6a8cbd3b83bf469cb89da16f4b6f498/src/hotspot/os/linux/os_linux.cpp#L3329, which explain why is using its own mechanism to do it, but I really like that is related to the "user" activity (at https://github.com/openjdk/jdk/blob/de51aa19d6a8cbd3b83bf469cb89da16f4b6f498/src/hotspot/share/gc/g1/g1PeriodicGCTask.cpp#L56 - the periodic task is the one which can enqueue collecting the heap, it seems) - but maybe a "synergy" between the 2 mechanism is not possible because system-wide malloc trim serve different off-heap allocators which not always have a 1:1 dependency to user activity as "regular" heap activities. |
There is no way to know without measuring performance.
Difficult to answer without knowing what piggy-back means in this case.
I found that predicting future malloc behavior is rather difficult. As the big known C-heap using component, we have JIT and, to a degree, GC, but also ObjectMonitors and myriads of other things. A large part of these is not even known - allocations from JDK and third-party JNI code and system libraries. Their activity may roughly synchronize with G1 activity, but possibly not. E.g. OM allocation is tied to contention. JIT activity, obviously, to compilation. I initially thought about hooking into (hotspot-side) malloc and implementing a timeout - no mallocs for five seconds, let's trim. That didn't work so well for multiple reasons. You usually have a low-level malloc noise, which you would have to filter out - so it would be easy to arrive at a situation where you never start trimming. And I am not sure that past behavior is a good predictor for future behavior. And this approach only works for those mallocs hotspot sees, whereas a large part of mallocs are external. I also tried to tie trimming to the GC cycle. That worked somewhat, but not so well with GC-independent malloc users, especially from outside. Any of the solutions sketched up above increase the complexity a lot while still leaving plenty of room for corner cases where trimming would either not work as expected or be detrimental. The trim solution we ended up with is simple and pragmatic. Trim at regular intervals, but keep out of safepoints. (it was also a question of time; I did not have that much time to burn on this feature). |
thanks @tstuefe for the detailed answer! |
Closing this having already found what's the cause of the regression |
Description
Environment:
and OS:
CPU:
Instructions
Run https://github.com/quarkusio/quarkus-quickstarts/tree/2.16.11.Final/config-quickstart
versus https://github.com/quarkusio/quarkus-quickstarts/tree/3.4.1/config-quickstart
(or https://github.com/quarkusio/quarkus-quickstarts/tree/development with
with c2f9d46 which is the commit before a change I've sent).
Start quarkus with:
Using https://github.com/bric3/java-pmap-inspector vs pmap -X with:
For
2.16.11.Final
, reports:For c2f9d46, reports:
Which is a
14 MB
increase.Few notes:
taskset
and-Dquarkus.vertx.event-loops-pool-size=1
?To exclude the excessive parallelism mentioned in https://stackoverflow.com/questions/63699573/growing-resident-size-set-in-jvm as one of the causes which can cause increased RSS due to lack of concurrency out of glibc malloc.
Additionally it should limit the number of compiler threads (C1 should be just use 1 IIRC)
-XX:+AlwaysPreTouch
?To reduce the effects of whatever allocation rate is happening in both quarkus versions and affecting the Heap's RSS
-XX:+UseSerialGC
?This GC has the lower memory footprint overall; it's a way to reduce the noise of the resulting RSS
-XX:-BackgroundCompilation
?To improve the reproducibility of the results
IMPORTANT NOTES:
-XX:TieredStopAtLevel=1
the RSS difference is much less.-XX:-TieredCompilation -XX:TieredStopAtLevel=4
The text was updated successfully, but these errors were encountered: