-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak problem with Opentelemetry Collector #29762
Comments
Are you running with some memory limit in place? It's likely you need to use GOMEMLIMIT or equivalent, and you can also run the pprof extension to capture memory usage. It look like this bug is for the tailsampling processor, is that correct? |
@atoulme |
@akiyama-naoki23-fixer can you provide us with a profile of your running Collector? You can use the pprof extension as @atoulme mentioned |
What is the memory available on the pod? |
I have not set a memory limit on the Otel collector pod, so I guess it depends on the Capacity of the Node, but it is more than 30 GiB. |
I noticed you're using the spanmetrics connector; there was a recent merge of a memory leak fix: #28847 It was just released today: https://github.com/open-telemetry/opentelemetry-collector-contrib/releases/tag/v0.91.0 It might be worth upgrading the opentelemetry-operator once it's released with collector v0.91.0. |
I am transferring this to contrib since the current theory is that this is related to the spanmetrics connector |
Thank you very much. |
@albertteoh I have updated the spanmetrics connector. But still having
|
@nifrasinnovent could you share your OTEL config please? |
|
Thanks! There's also a known issue where exemplars were observed to use a large amount of memory and a configurable limit on exemplars was added in this PR: #29242 (not merged yet). As an experiment to narrow down the root cause, perhaps you could try temporarily setting exemplars.enabled=false to see if that resolves the issue you're seeing? |
let me try that |
@albertteoh yes, it was related to exemplars. OTLP pod did not crash due to memory since 17 hours. |
We believe 0.91.0 still leaks. We are back to running a non-contrib distribution to reduce our risk of memory leaks. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
Hey folks, this is also something I am seen and it happens at random times but usually after a few days. exporters:
logging:
verbosity: basic
otlp/newrelic:
compression: gzip
endpoint: endpoint:4317
headers:
api-key: token
extensions:
health_check: null
pprof:
endpoint: 0.0.0.0:1777
zpages: null
processors:
batch:
send_batch_size: 10000
timeout: 10s
batch/sampled:
send_batch_size: 10000
timeout: 10s
filter/newrelic_and_otel:
error_mode: ignore
traces:
span:
- name == "TokenLinkingSubscriber.withNRToken"
memory_limiter:
check_interval: 5s
limit_mib: 3800
spike_limit_mib: 1000
resourcedetection/system:
detectors:
- env
- system
override: false
timeout: 2s
tail_sampling:
decision_wait: 60s
expected_new_traces_per_sec: 10000
num_traces: 50000000
policies:
- name: always_sample_error
status_code:
status_codes:
- ERROR
type: status_code
- and:
and_sub_policy:
- name: routes
string_attribute:
enabled_regex_matching: true
key: http.route
values:
- /health
- /(actuator|sys)/health
type: string_attribute
- name: probabilistic-policy
probabilistic:
sampling_percentage: 0.1
type: probabilistic
name: health_endpoints
type: and
- name: sample_10_percent
probabilistic:
sampling_percentage: 10
type: probabilistic
- latency:
threshold_ms: 3000
name: slow-requests
type: latency
receivers:
otlp:
protocols:
grpc: null
http: null
service:
extensions:
- zpages
- health_check
- pprof
pipelines:
logs/1:
exporters:
- otlp/newrelic
processors:
- resourcedetection/system
- batch
receivers:
- otlp
metrics/1:
exporters:
- otlp/newrelic
- logging
processors:
- resourcedetection/system
- batch
receivers:
- otlp
traces/1:
exporters:
- otlp/newrelic
- logging
processors:
- filter/newrelic_and_otel
- resourcedetection/system
- tail_sampling
- batch/sampled
receivers:
- otlp
telemetry:
metrics:
address: 0.0.0.0:8888
|
Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Thanks, everyone. |
@akiyama-naoki23-fixer Sorry to hear that, thank you for taking the time to report the issue and answer our questions in the first place. I am going to close this as wontfix since we won't be able to get more information about this specific case; if someone reading this finds themselves in a similar situation, please file a new issue, thanks! |
Describe the bug
Memory leak problem with Opentelemetry Collecotor
Steps to reproduce
I wasn't able to reproduce this locally, but I think it may be due to the fact that OTLP collected a huge trace with 20000 spans.
What did you expect to see?
Expected memory usage to go up and down. However, memory usage is constantly going up.
What version did you use?
opentelemetry-operator:0.37.1
tempo-distributed:1.5.4
What config did you use?
Environment
OS: AKS Ubuntu Linux
Compiler: .NET 6.0 dotnet-autoinstrumentation
The text was updated successfully, but these errors were encountered: