-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exporter/prometheusremotewrite: wal leads to oom under high load #19363
Comments
+1 - on this. We are observing even more crazy behaviour. We have Otel agent pods running on nodes containing just 5-7 pods and still the usage goes beyond 20GiB for some nodes. There seems no relation to what it is trying to scrape, its literally some uncaught memory leak thats causing this. |
I'm actively looking into this and will propose a fix once I found the cause for that |
We tried removing memory-limiter and introducing memory_ballast extension instead to avoid dropping data. |
I didn't know about the exporter code, but there's some Github issues about Prometheus consuming a lot of memory to replay wal, may this be related? |
@nicolastakashi Can you link it? Update I guess its this one? prometheus/prometheus#6934 - prometheus/prometheus#10750 Seems its coming from here:
pprof detailsSeems its coming from this scrape_loop: I enabled pprof to see whats going on. Here is the file profile.pb.gz, i assume i can continue by the end of this week. |
@nicolastakashi: I suspect that renders any prometheus specific discussions less relevant to us sadly @frzifus: |
Using a profiler confirmed the indication that wal is the culprit, as one observes a clear leak with wal enabled and none without. click images to view full flamegraph However, as observed before, the leaking memory appears to originate from the There must be very non-obvious reason why the wal keeps a hold onto that memory. will keep digging |
This is a duplicate of #15277 unfortunately. The following function deadlocks (can't get the mutex): opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter/exporter.go Lines 177 to 182 in cd146d5
And this leads to the |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Hey I think this needs to be resolved |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Any updates on this issue? We are also facing same while enabling the WAL the metrics are not being sent to remote destination and getting OOMKilled error. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
I was taking a look over #20875 and hoping to finish it. Fixes #19363 Fixes #24399 Fixes #15277 --- As mentioned in #24399 (comment), I used a library to help me understand how the deadlock was happening. (1st commit). It showed that `persistToWal` was trying to acquire the lock, while `readPrompbFromWal` held it forever. I changed the strategy here and instead of using fs.Notify, and all that complicated logic around it, we're just using a pub/sub strategy between the writer and reader Go routines. The reader go routine, once finding an empty WAL, will now release the lock immediately and wait for a notification from the writer. While previously it would hold the lock while waiting for a write that would never happen. --------- Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
I was taking a look over open-telemetry#20875 and hoping to finish it. Fixes open-telemetry#19363 Fixes open-telemetry#24399 Fixes open-telemetry#15277 --- As mentioned in open-telemetry#24399 (comment), I used a library to help me understand how the deadlock was happening. (1st commit). It showed that `persistToWal` was trying to acquire the lock, while `readPrompbFromWal` held it forever. I changed the strategy here and instead of using fs.Notify, and all that complicated logic around it, we're just using a pub/sub strategy between the writer and reader Go routines. The reader go routine, once finding an empty WAL, will now release the lock immediately and wait for a notification from the writer. While previously it would hold the lock while waiting for a write that would never happen. --------- Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Describe the bug
When running the
prometheusremotewrite
exporter in wal enabled mode under (very) high load (250k active series), it quickly builds up memory until the kernel oom kills otelcolSteps to reproduce
docker-compose.yml
What did you expect to see?
Otelcol having a (high) but (periodically) stable memory usage
What did you see instead?
Otelcol repeatedly builds up memory until it is oom killed by the operating system, only to repeat this exact behavior
What version did you use?
Version: Docker
otel/opentelemetry-collector-contrib:0.72.0
What config did you use?
See above
docker-compose.yml
Environment
docker info
Additional context
This only occurs when enabling wal mode. Other prometheus agents (Grafana Agent, Prometheus Agent Mode) do not show this behavior on the exact same input data
The text was updated successfully, but these errors were encountered: