Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.OutOfMemoryError: Java heap space occured in api-gateway of the Spring Petclinic Microservice variant #3027

Closed
matthias-huber opened this issue May 18, 2021 · 9 comments · Fixed by #3059
Assignees
Labels
bug Something isn't working

Comments

@matthias-huber
Copy link

matthias-huber commented May 18, 2021

Describe the bug
I used the Spring Pet Clinic (Microservice variant (commit 8818cc9)) as basis. Added opentelemetry agent 1.2.0 to all microservices (api-gateway, vets-,visits-,customer-service). Traces are exporter to OTel collector. Everything runs with docker-compose. I also changed from internal db to mysql setup. However, I don't think that it is important as the api-gateway doesn't use the database.

I ran a small loadtest with around ~10 req/s and the api-gateway ran out of memory => 2021-05-18 12:58:53.080 WARN 19 --- [or-http-epoll-5] i.n.c.AbstractChannelHandlerContext : An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:

VisualVM shows the CPU usage of the api-gateway. It shows the typical behaviour of a (memory leak) that less and less memory can be freed after each GC:

api-gateway-cpu

I took a heap dump of the api-gateway after the OutOfMemoryError occured.

Leak Suspect shows:
petclinic_leak_suspect

Histogram shows a lot of io.opentelemetry.javaagent.shaded.instrumentation.reactor.TracingSubscriber Instances

TracingSubscriber

Merge Shortest Path to GC root (with all references) shows the leak suspect "io.opentelemetry.javaagent.bootstrap.instrumentation.context.FieldBackedProvider$ContextStore$io$netty$util$concurrent$GenericFutureListener$io$netty$util$concurrent$GenericFutureListener"

TracingSubscriber_merge_shortest_root_gc

Steps to reproduce
If necessary, I can try to provide a sample or at least the important changes to the repo (-> Slack)

What did you expect to see?
That the api-gateway doesn't ran into the OutOfMemoryError

What did you see instead?
java.lang.OutOfMemoryError: Java heap space

What version are you using?
1.2.0

Environment
Java:
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)

OS:
5.8.0-53-generic #60~20.04.1-Ubuntu

Additional context

@matthias-huber matthias-huber added the bug Something isn't working label May 18, 2021
@iNikem
Copy link
Contributor

iNikem commented May 18, 2021

@matthias-huber Thank you for bug report and analysis. Can you try the same load test with previous versions of agents as well, 1.1 and 1.0? It's not required, but may help us to narrow down the potential change that introduced this bug.

@laurit
Copy link
Contributor

laurit commented May 18, 2021

@iNikem probably #2851 I guess InstrumentationContext doesn't like value referencing key.

@iNikem iNikem self-assigned this May 19, 2021
@matthias-huber
Copy link
Author

@iNikem I executed the same test (10minutes ~ 10 req/s) with the 1.1.0 agent and it doesn't show the same memory leak behaviour:

Heap (second drop at ~10:09 is caused by manually performing GC)

api-gateway-cpu-1 1 0

I also checked the live objects after the GC and as expected only a few io.opentelemetry.javaagent.shaded.instrumentation.reactor.TracingSubscriber objects are alive

tracing-subscriber-1 1 0

So, definitely it was introduced by the 1.2.0 version.

@jkwatson
Copy link
Contributor

@breedx-splk does the iguanodon pick this regression up?

@breedx-splk
Copy link
Contributor

@breedx-splk does the iguanodon pick this regression up?

It doesn't seem to, no. It uses spring-petclinic-rest, which I think doesn't trigger the reactor instrumentation.

@RashmiRam
Copy link
Contributor

@trask Does it make sense to call out this issue in 1.2.0 documentation if there is no back port of this fix is available? Should be helpful for new adopters.

@trask
Copy link
Member

trask commented Jul 1, 2021

hey @RashmiRam! do you have a suggestion on where to call out this issue? do you think new adopters are downloading older versions? thx!

@RashmiRam
Copy link
Contributor

RashmiRam commented Jul 1, 2021

AFAIk, people download penultimate versions for safety. I was thinking it could be put in the read me section as itself.

@trask
Copy link
Member

trask commented Jul 1, 2021

What do you think if the README strongly recommends to review the release notes of newer versions if you are choosing not to download the latest version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants