Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage increased between Quarkus 3.6 and 3.7 #38814

Closed
holly-cummins opened this issue Feb 16, 2024 · 55 comments · Fixed by #38818
Closed

Memory usage increased between Quarkus 3.6 and 3.7 #38814

holly-cummins opened this issue Feb 16, 2024 · 55 comments · Fixed by #38818
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@holly-cummins
Copy link
Contributor

Describe the bug

I have an extension test in the quarkus-pact project that started OOMing in the ecosystem CI recently. I've worked around the issue by increasing the memory available to the test, but we may want to investigate why the memory requirements have increased.

I can reproduce just by starting the Quarkus application with constrained memory (it fails to start). The application has two pact extensions in it, so it does more classloading than a simpler application. With Quarkus 3.6, the app starts, and with 3.7, it fails to start. I can force a failure on Quarkus 3.6 by dropping the available memory down to 110mb (from 128mb), so even with Quarkus 3.6, the memory usage was fairly high - 3.7 just happens to be the release that tipped it over the edge into failure.

The ecosystem CI caught the change: quarkiverse/quarkiverse#94 (comment). This is the CI run where the problem first appeared, which might help us identify a specific change: https://github.com/quarkiverse/quarkus-pact/actions/runs/7082294368

My initial guess is that this is related to classloading, but I don't have firm evidence. It could also be test execution. If I give the applications more memory, so that it doesn't OOM, and then trigger a dump once startup is finished, the 3.6 and 3.7 dumps both use about 81mb of memory, and look similar.

Expected behavior

No response

Actual behavior

No response

How to Reproduce?

I've attached a reproducer app. Unzip it, and then run

mvn quarkus:dev -Djvm.args="-Xmx128m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=pact-dump.hprof"

For more control, another way to reproduce is

  1. Clone https://github.com/quarkiverse/quarkus-pact
  2. Build with mvn install. It should pass.
  3. Edit https://github.com/quarkiverse/quarkus-pact/blob/e00179e4f5e0ab5c3b434393edd4e80ab3551766/cross-extension-integration-tests/src/test/java/io/quarkiverse/pact/it/DevModeContractTestIT.java#L76 and change 140m to 128m, the default. (I had to override the default memory allocation from the parent test class to get the test to pass.)
  4. cd cross-extension-integration-tests && mvn clean install should now show the failure
  5. Once everything is built, you can also
    happy-everyone-all-together-processed.zip
    run the test project standalone.
cd quarkus-pact/cross-extension-integration-tests/target/test-classes/projects/happy-everyone-all-together-processed
mvn quarkus:dev -Djvm.args="-Xmx110m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=pact-dump.hprof"

Output of uname -a or ver

No response

Output of java -version

No response

Quarkus version or git rev

No response

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

@gsmet
Copy link
Member

gsmet commented Feb 16, 2024

If you can publish a 3.6 heap dump and a 3.7 one, it probably would help to all work on the same baseline.

@holly-cummins
Copy link
Contributor Author

holly-cummins commented Feb 16, 2024

If you can publish a 3.6 heap dump and a 3.7 one, it probably would help to all work on the same baseline.

I've got the dumps, and will attach them, but I don't think I have the 'right' dump. I had two options for collection the dumps:

  • Option 1. Manually trigger a dump. To make sure I was comparing apples to apples, I dumped when the app had completed its startup. At this point, all the classes associating with classloading are no longer live, and the steady state of 3.7 looks pretty similar to the steady state of 3.6. The startup cycle is so quick, I don't think I could reliably trigger a dump partway through startup in a way that was consistent enough to allow 3.6 and 3.7 to be compared. If I'm a bit slower for one, the heap looks different, but it's just that I was 0.2 seconds later with my typing and startup had progressed more ...
  • Option 2. Dump on OOM. To do this, I have to constrain the heap in such a way that both 3.6 and 3.7 both dump. When I do this, the heap usage for 3.6 and 3.7 is the same, and the profile of what's loaded is also pretty similar. If 3.6 needs 128m, and 3.7 needs 140m, but I only give them 110m, that means my dump 'misses' the extra 30m that causes a problem in 3.7.

Dumps are here: https://drive.google.com/file/d/1gaGHo-qHu2GImdskJMiF-dNCocabGAfG/view?usp=sharing
I've done 3.6 and 3.7, for manually triggered post-startup dumps in an unconstrained heap, and dumps triggered on OOM in a constrained heap. The constrained heap dumps are bigger.

@holly-cummins
Copy link
Contributor Author

Some of what I'm seeing in the dumps also looks similar to what Guillaume investigated in #35280, so I think this issue may be in a similar domain to that one.

@holly-cummins
Copy link
Contributor Author

See also #41156. Memory requirements went up again in Quarkus 3.12.

@gsmet
Copy link
Member

gsmet commented Jul 10, 2024

I had a look at the issue and here are my findings for main.

The first thing is that the OOM is triggered by the build process, it has nothing to do with CL leaks.

Second thing is that there are two main culprits:

  • All the ZIP filesystem we keep open in the QuarkusClassLoader's SharedOpenArchivePathTrees- the cen field is huge - we know this issue very well, there's nothing we can do about it if we want to keep these FS open.
  • The second culprit is the Maven model as built by Aether. Especially we have huge lists (2300+ elements) containing the managedDependencies for each io.quarkus dependency. The Aether DataPool is 34 MB large for a very very small project.

Screenshot from 2024-07-10 10-32-55

I wonder if we could improve that. I don't fully understand why we would need to keep a list of the full managed dependencies tree once the artifact has been resolved - but I suppose there are very few projects with that many managed dependencies.
All of this is in the maven-resolver so I think we would need to adjust things upstream.

We would probably need to discuss this with @aloubyansky .

That's for the first part of the work. Now I will have a look at a heap dump from an old version to see what could be the actual regression.

@gsmet
Copy link
Member

gsmet commented Jul 10, 2024

So from one I can see, it's just Quarkus getting generally fatter:

  • The Execution object has significantly more multis (more generated classes - and the large ones generated (Config, ArC) are a bit larger, more synthetic beans)
  • The Aether DataPool object is larger (34 vs 28) - but as I get an OOM in the middle of the build, it's not very scientific. What I can say is that we have 200 more managed dependencies in some of the io.quarkus GoodDescriptor instances (for French readers: so you have good descriptors and bad descriptors, the good descriptor sees something, it describes it, the bad descriptor sees something, it describes it but it's not the same) and that really doesn't help.

The main contributors to the main BOM (I'm assuming most of the managed dependencies are coming from there - I limited myself to the ones with 10+ entries):

    677         <groupId>io.quarkus</groupId>
    245         <groupId>io.opentelemetry.javaagent.instrumentation</groupId>
    182         <groupId>io.vertx</groupId>
    131         <groupId>io.fabric8</groupId>
    103         <groupId>io.smallrye.reactive</groupId>
     73         <groupId>io.opentelemetry.instrumentation</groupId>
     56         <groupId>org.testcontainers</groupId>
     56         <groupId>io.netty</groupId>
     43         <groupId>io.smallrye</groupId>
     42         <groupId>io.grpc</groupId>
     35         <groupId>org.wildfly.security</groupId>
     35         <groupId>io.opentelemetry</groupId>
     33         <groupId>org.jboss.resteasy</groupId>
     30         <groupId>org.mvnpm</groupId>
     30         <groupId>io.micrometer</groupId>
     23         <groupId>org.jetbrains.kotlin</groupId>
     19         <groupId>com.fasterxml.jackson.module</groupId>
     18         <groupId>org.jetbrains.kotlinx</groupId>
     18         <groupId>io.smallrye.stork</groupId>
     16         <groupId>org.infinispan</groupId>
     16         <groupId>org.hibernate.search</groupId>
     16         <groupId>com.fasterxml.jackson.datatype</groupId>
     13         <groupId>io.smallrye.common</groupId>
     12         <groupId>org.junit.platform</groupId>
     12         <groupId>org.apache.maven</groupId>
     12         <groupId>io.dekorate</groupId>
     11         <groupId>com.google.protobuf</groupId>
     10         <groupId>io.quarkus.resteasy.reactive</groupId>
     10         <groupId>io.opentelemetry.javaagent</groupId>
     10         <groupId>com.google.http-client</groupId>
     10         <groupId>com.fasterxml.jackson.dataformat</groupId>

For 3.6.1, that is the other version I tested:

    568         <groupId>io.quarkus</groupId>
    240         <groupId>io.opentelemetry.javaagent.instrumentation</groupId>
    182         <groupId>io.vertx</groupId>
    122         <groupId>io.fabric8</groupId>
     95         <groupId>io.smallrye.reactive</groupId>
     72         <groupId>io.opentelemetry.instrumentation</groupId>
     54         <groupId>io.netty</groupId>
     47         <groupId>org.testcontainers</groupId>
     41         <groupId>io.smallrye</groupId>
     40         <groupId>io.grpc</groupId>
     36         <groupId>io.opentelemetry</groupId>
     35         <groupId>org.wildfly.security</groupId>
     33         <groupId>org.jboss.resteasy</groupId>
     30         <groupId>io.micrometer</groupId>
     23         <groupId>org.jetbrains.kotlin</groupId>
     18         <groupId>org.jetbrains.kotlinx</groupId>
     17         <groupId>io.smallrye.stork</groupId>
     17         <groupId>com.fasterxml.jackson.module</groupId>
     16         <groupId>org.infinispan</groupId>
     16         <groupId>com.fasterxml.jackson.datatype</groupId>
     12         <groupId>org.junit.platform</groupId>
     12         <groupId>org.apache.maven</groupId>
     12         <groupId>io.smallrye.common</groupId>
     11         <groupId>io.dekorate</groupId>
     11         <groupId>com.google.protobuf</groupId>
     10         <groupId>io.quarkus.resteasy.reactive</groupId>
     10         <groupId>io.opentelemetry.javaagent</groupId>
     10         <groupId>com.google.http-client</groupId>
     10         <groupId>com.fasterxml.jackson.dataformat</groupId>

From a quick analysis:

  • we have 100 more Quarkus artifacts there (that's a lot, especially since we are trying to limit the number of new artifacts we introduce in core but see below for a possible explanation)
  • there is one new big contributor: org.mvnpm with 30 new entries
  • the rest is spread all over the place

I think one thing we are learning with this is that keeping relocations around for ages has a cost so we need to drop the relocations at some point.

@maxandersen
Copy link
Member

I think one thing we are learning with this is that keeping relocations around for ages has a cost so we need to drop the relocations at some point.

how big impact are they ? I would imagine fixing/improving how we deal with the full resolved set (which we shouldn't have to keep around forever) then relocations should be quite minimal/not a problem?

@gsmet
Copy link
Member

gsmet commented Jul 10, 2024

fixing/improving how we deal with the full resolved set (which we shouldn't have to keep around forever)

I don't think we keep the Aether model in memory forever but we need it at some point and we are using quite a huge amount of memory for it (at least compared to our footprint - it's 1/3 of the memory footprint when we build/start dev mode).

Now, I will discuss with Alexey if things can be improved in this area but I wouldn't be surprised if we were pushing things to the limits with our huge BOM and that it wasn't really designed/optimized for such a use case.
Maybe we could find ways to improve the Maven resolver, we will see.
I wouldn't hold my breath though.

My wild guess is that it's going to get even worse for a large project with a lot of Quarkus extensions but that's just a theory, from the heap dumps I checked.

Now I'm not advocating for us to drop relocations at a fast pace. I'm just saying that we didn't really think they have a cost and they have.

So I would rather have the following policy:

  • we make sure the deprecations/relocations get in a LTS version
  • after a few months and provided above condition is true, we can consider removing them

For instance, we could remove the Quarkus REST/Quarkus Messaging relocations introduced 4 months ago after the 3.15 release.

@maxandersen
Copy link
Member

I'm definitely curious on how/why we would have to retain all around forever - wasn't designed/intended normal build should need to retain everything. lets see what @aloubyansky sees.

@gsmet
Copy link
Member

gsmet commented Jul 10, 2024

Again, we don't keep all around forever.

But each GoodDescriptor that is about an artifacts we build in the Quarkus tree (at least, that seemed like the common denominator for the huge ones) contains the whole set of managed dependencies for our BOM - and it's in memory for the time the Aether model is kept in memory.

@maxandersen
Copy link
Member

Yes I get that - but even that shouldn't be necessary.

@holly-cummins
Copy link
Contributor Author

We should probably distinguish between a "high tide" memory usage and a "steady state" memory usage. The symptom this work item was raised for was that our high tide memory usage was going up. Unloading things like the Aether memory model would/do help with steady state memory usage, but not with the high tide usage.

Big high tide memory usage could prevent people being able to build their Quarkus apps in constrained environments, but steady state memory usage is probably the metric that gets more attention/slides/blogs, etc.

@dmlloyd
Copy link
Member

dmlloyd commented Jul 10, 2024

Not sure if this is relevant here, but concurrency (particularly concurrent startup) can affect the "high tide" in ways that can be hard to predict.

@gsmet
Copy link
Member

gsmet commented Jul 11, 2024

but concurrency (particularly concurrent startup) can affect the "high tide" in ways that can be hard to predict

From what I have seen in the heap dumps, the problem we have here has little to do with concurrency.

@gsmet gsmet changed the title Memory usage increased between Quarkus 3.6 and 3.7 (classloading?) Memory usage increased between Quarkus 3.6 and 3.7 Jul 11, 2024
@gsmet
Copy link
Member

gsmet commented Jul 11, 2024

So what makes the issue so problematic for us people with a huge dependency management is that each descriptor will have its own copy of the whole dependency management tree.

See:

https://github.com/apache/maven/blob/f2a0865c7a66d44e13c88c9ef9146cfdf7024ac0/maven-resolver-provider/src/main/java/org/apache/maven/repository/internal/ArtifactDescriptorReaderDelegate.java#L65-L70

And the convert() method:

https://github.com/apache/maven/blob/f2a0865c7a66d44e13c88c9ef9146cfdf7024ac0/maven-resolver-provider/src/main/java/org/apache/maven/repository/internal/ArtifactDescriptorReaderDelegate.java#L94-L128

Given our whole dependency management tree is 800 KB, it grows very fast.

@cstamas
Copy link
Contributor

cstamas commented Jul 11, 2024

Thanks, looked at dumps. The memory hog is in session (RepositoryCache). For start, I'd recommend to test aether.dependencyCollector.pool.descriptor=weak and see does it change memory consumption (and neglect possible slow down for during experiment).

If it does, we know we are at right track.

@gsmet
Copy link
Member

gsmet commented Jul 13, 2024

@cstamas Thanks for having a look. aether.dependencyCollector.pool.descriptor=weak seems to help (at least on that front, we have a few other culprits on our side).

I can share the dump with you. Either send me your Google account at gsmet at redhat dot com or a place to upload it for you.
Keep in mind that this dump is obtained from a OOM error when building so it's really about trying to limit the peak usage of memory.

My initial idea with the patch above was to limit the number of managed dependencies to copy to only the ones that were important for a given descriptor. Meaning that in our case, all the managed dependencies are copied to quite some descriptors that don't really need it (or at least not all of them).
But from what I could see, the descriptor dependencies don't contain the transitive ones so that won't work well.

@cstamas
Copy link
Contributor

cstamas commented Jul 13, 2024

Yes, and as i mentioned, doing this on POM (removing "unused depmgt") is not feasible as POM lacks context where it will be used, hence no idea which one is "unused". Maybe we need to deconstruct poms and cache depmgt separately, as in this case all the Q1, Q2,.... would literally share the same one instance.

@gsmet
Copy link
Member

gsmet commented Jul 13, 2024

While it's not a complete fix, I created a PR to improve the situation a bit: apache/maven-resolver#534

@cstamas
Copy link
Contributor

cstamas commented Jul 13, 2024

So, two interesting PRs (for Resolver 1.9.x):

Would be good, if someone could locally merge both, and test such built Resolver 1.9.x with Quarkus.

Re timeline: I am out next 2 weeks, and Resolver 1.9.21 was just released last week...But when am back, I could do Resolver 1.9.22 w/ memory improvements, and Maven 3.9.9 is already scheduled next (so early Aug or alike).

@cstamas
Copy link
Contributor

cstamas commented Jul 13, 2024

Good news: the Maven IT suite is OK with change above: apache/maven#1617

Bad news: interning all the List<Dependency> comes at a price: Quarkus build (mvn clean install -Dquickly) on my laptop went from 9:20 (vanilla Maven 3.9.8) to 11:10 (w/ patch above). We could maybe introduce some threshold, and intern dependency lists only if they are above some size limit? Or any other ideas?

I did not measure the most interesting question: heap usage change, but I'd bet in a six-pack of beer it is HUGE 😃

@gsmet
Copy link
Member

gsmet commented Jul 13, 2024

I will have a look next week. Thanks!

@gsmet gsmet removed this from the 3.7.4 milestone Jul 13, 2024
@gsmet
Copy link
Member

gsmet commented Jul 13, 2024

I can see how it could be a good thing memory wise. I’m a bit worried about the additional work though.

Especially since in most of the cases, we are doing the exact same thing over and over (typically the convert operation in ArtifactDescriptorReaderDelegate).

Multiplied by 2300 multiplied by the number of artifacts affected, that’s a lot of useless work.

@cstamas
Copy link
Contributor

cstamas commented Jul 13, 2024

Agreed. But ArtifactDescriptorReader is not used only during collecting, but in many other places, plus, is exposed via RepositorySystem public API as well. Also, ArtifactDecriptorResult is public API as well, so we could change that only in Resolver 2.1.0 or so...

Biggest problem is that ArtifactDescriptorResult is NOT immutable. Hence, we cannot assume that something does not mutates it (like adds new dependency). Hence, current "contract" is that for each new request it hands over new result (that is instance on it's own and freely mutable). Use of reader in collector is somewhat "specific", and collector does handle artifact descriptors as immutable (does not mutate them), so I still see some room for improvement...

@gsmet
Copy link
Member

gsmet commented Jul 16, 2024

@cstamas I created a small PR (for your 1.9 PR, the same patch should be applied to the master branch): cstamas/maven-resolver#6 .

Locally, when I ran the Maven ITS, I got:

I think the way I reduce allocations by not copying the map might help getting a speed up.

It's not scientific but I think it would be worth pushing all this combined to the Apache Maven ITs and see if you get similar results.

@gsmet
Copy link
Member

gsmet commented Jul 16, 2024

@cstamas and I can confirm the memory usage is greatly reduced. So if you end up confirming that all patches applied actually make things a bit faster, that would be a win on both sides.

@cstamas
Copy link
Contributor

cstamas commented Aug 2, 2024

Back from vacation... but DK traveling still ahead.
Am picking up this one and will hopefully be present in next Resolver (1/2) and Maven (3/4) releases that are about to happen "soon-ish".

@gsmet
Copy link
Member

gsmet commented Aug 2, 2024

@cstamas maybe you could merge the two PRs for MRESOLVER-586 and rebase yours? This patch is an easy win. That way you could easily test the addition of all of them?

apache/maven-resolver#534
apache/maven-resolver#535

@gsmet
Copy link
Member

gsmet commented Aug 2, 2024

@cstamas Glad to have you back! I'm around if you want to discuss this any further or need some additional help.

@cstamas
Copy link
Contributor

cstamas commented Aug 2, 2024

Merged, rebased, Maven ITs running apache/maven#1617

@cstamas
Copy link
Contributor

cstamas commented Aug 2, 2024

Ok, so almost all merged, and MRESOLVER-587 looks good as well:

The only question here are defaults of two new config keys: for safety, am using false/false (so basically interning does NOT kick in, behaves as today) but is user configurable. As we saw, true/true tremendously lowers memory consumption at price of increased runtime. Still, as Guillaume measured (and those changes are picked up as well), there is some hope: #38814 (comment)

Ultimate fix would be in ArtifactDescriptorReader and making descriptor immutable, but sadly that is not doable in scope of 1.x/2.0.x of Resolver, is probably to happen in 2.1.x Resolver....

WDYT? Any opinion?

@gsmet
Copy link
Member

gsmet commented Aug 2, 2024

@cstamas your proposal of interning only managed dependencies by default makes sense to me.

@cstamas
Copy link
Contributor

cstamas commented Aug 12, 2024

The "wannabe" Maven 3.9.9 has all patches merged, CI built one (w/o versioned directories) can be downloaded: https://repository.apache.org/content/repositories/snapshots/org/apache/maven/apache-maven/3.9.9-SNAPSHOT/apache-maven-3.9.9-20240812.124731-17-bin.zip

Please test!

@gsmet
Copy link
Member

gsmet commented Aug 12, 2024

Awesome @cstamas I will have a look later today with the original project causing issues.

@gsmet
Copy link
Member

gsmet commented Aug 12, 2024

@cstamas so we still have an OOM with low memory but a lot later and neither Maven or the Maven Resolver are in the way anymore. I'll go address the other issues I could spot.

@gsmet
Copy link
Member

gsmet commented Aug 12, 2024

Thanks a lot for his amazing cooperation :).

@cstamas
Copy link
Contributor

cstamas commented Aug 17, 2024

FYI Maven 3.9.9 is out

@gsmet
Copy link
Member

gsmet commented Aug 18, 2024

@manofthepeace would you be interested in preparing the PR with the upgrade? We will probably wait for the corresponding mvnd release to merge but it looks like a good idea to have it ready.

@manofthepeace
Copy link
Contributor

Sure, will do.

@gsmet
Copy link
Member

gsmet commented Aug 30, 2024

This had been quite a fight but I can confirm that with Maven 3.9.9 and main (I just merged #42525 this morning), I can consistently run the tests with 100 MB of heap.

I will close this one.

Thanks to everyone involved.

@gsmet gsmet closed this as completed Aug 30, 2024
@gsmet gsmet added this to the 3.16 - main milestone Aug 30, 2024
@holly-cummins
Copy link
Contributor Author

Hurray! I'll set a reminder to myself to reduce how much memory the tests are given in a couple of months, once this is in a release. That should allow us to spot creeping regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
6 participants