Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak #1940

Open
martinvisser opened this issue Jul 10, 2024 · 3 comments
Open

Potential memory leak #1940

martinvisser opened this issue Jul 10, 2024 · 3 comments
Labels
more-info-needed Request for more information from issue author

Comments

@martinvisser
Copy link

We're using Cloud Foundry's Java Buildpack to deploy our applications. During the latest platform upgrade we got the new version of splunk-otel-java, version 2.5.0.
According to the release, version 2.5.0 contains some breaking changes which on their own are resolvable. However, because we did not know we got this new version we only saw a huge increase in errors being logged about the exporter, logs like this:
io.opentelemetry.exporter.internal.http.HttpExporter - Failed to export spans. Server responded with HTTP status code 404. Error message: Unable to parse response body, HTTP status message:

I'd expect that that on its own would not cause trouble, but it seems that this now wrongly configured service might lead to a memory leak: at the same time we deployed a new Spring Boot application we started to see a rise in memory and CPU, eventually leading to crashes. We reverted back to an older deployment, but the results stayed the same: increase in memory and CPU.

After we reconfigured the service binding, by using the old values instead of the new defaults, the issues were gone.

We did not manage to reliably reproduce it and were not able to get heap dumps, because when a container crashes there's no more access to it.

@breedx-splk
Copy link
Contributor

Hey @martinvisser thanks for this report, and I'm sorry that you had a somewhat painful experience with that upgrade. I would not be surprised to hear that a misconfigured agent uses more memory, as it does take some resources to buffer telemetry, potentially do retries, and log the exceptions. Can you clarify what you meant by "eventually leading to crashes" tho?

I'm not sure what we should do about a speculative "there might be a memory leak somewhere" issue. Having a reproduction or any more specifics sure would help.

@breedx-splk breedx-splk added the more-info-needed Request for more information from issue author label Jul 19, 2024
@laurit
Copy link
Collaborator

laurit commented Jul 22, 2024

@martinvisser Have a look at the breaking changes in https://github.com/signalfx/splunk-otel-java/releases/tag/v2.0.0-alpha By default metrics and logs are now also exported, perhaps the export fails because you don't have the piplines for these signals configured in the collector? If you don't need these then you could disable the exporter with -Dotel.metrics.exporter=none and -Dotel.logs.exporter=none as described in the release notes. Another change is that default OTLP protocol has been changed from grpc to http/protobuf. Consult the release notes if you need to use grpc.

@martinvisser
Copy link
Author

@laurit That's actually what I meant by using the "old values" to make it work again.

@breedx-splk We're running on Cloud Foundry, which has a feature to kill an application if for example the health check fails. This is considered a crash. The memory increased, which cause the GC to become slower and slower, which lead to too slow of a response of the health check in the end, which then lead to a crash.
I have no way of consistently reproducing the issue as this happened in a complex application with no reliable path to the issue, but all instances of the application (again, we're running on Cloud Foundry) showed similar issues where the memory eventually increased to too high values. If possible, I would certainly provide a reproducible example, but unfortunately I cannot.
So, perhaps this is more an FYI, but using defaults should still not lead to such problems IMHO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more-info-needed Request for more information from issue author
Projects
None yet
Development

No branches or pull requests

3 participants