Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exporter/prometheusremotewrite: wal leads to oom under high load #19363

Closed
sh0rez opened this issue Mar 6, 2023 · 18 comments · Fixed by #37630
Closed

exporter/prometheusremotewrite: wal leads to oom under high load #19363

sh0rez opened this issue Mar 6, 2023 · 18 comments · Fixed by #37630

Comments

@sh0rez
Copy link
Member

sh0rez commented Mar 6, 2023

Describe the bug
When running the prometheusremotewrite exporter in wal enabled mode under (very) high load (250k active series), it quickly builds up memory until the kernel oom kills otelcol

Steps to reproduce

docker-compose.yml
version: '2'
services:
  # generates 250k series to be scaped by otelcol
  avalanche:
    image: quay.io/prometheuscommunity/avalanche:main
    command:
      - --metric-count=1000
      - --series-count=250
      - --label-count=5
      - --series-interval=3600
      - --metric-interval=3600

  otel:
    image: otel/opentelemetry-collector
    volumes: [otel-cfg:/etc/otelcol]
    user: 0:0
    tmpfs:
      - /wal
    depends_on:
      otel-cfg:
        condition: service_completed_successfully
    mem_limit: 8G
    restart: always

  otel-cfg:
    image: alpine
    volumes: [otel-cfg:/etc/otelcol]
    command:
      - sh
      - -c
      - |
        cat - > /etc/otelcol/config.yaml << EOF
        receivers:
          prometheus:
            config:
              scrape_configs:
              - job_name: stress
                scrape_interval: 15s
                static_configs:
                  - targets:
                    - avalanche:9001
        processors:
          batch:
        exporters:
          prometheusremotewrite:
            endpoint: http://receiver:9090/api/v1/write
            wal:
              directory: /wal
        service:
          pipelines:
            metrics:
              receivers: [prometheus]
              processors: [batch]
              exporters: [prometheusremotewrite]
        EOF
  # dummy http server to "receive" remote_write samples by always replying with http 200
  receiver:
    image: caddy
    command: sh -c 'echo ":9090" > /tmp/Caddyfile && exec caddy run --config /tmp/Caddyfile'

  # prometheus observing resource usage of otelcol
  prometheus:
    image: prom/prometheus
    ports:
      - 9090:9090
    entrypoint: /bin/sh
    command:
      - -c
      - |
        cat - > prometheus.yml << EOF && /bin/prometheus
        global:
          scrape_interval: 5s
        scrape_configs:
          - job_name: otel
            static_configs:
              - targets:
                - otel:8888
        EOF
volumes:
  otel-cfg: {}

What did you expect to see?
Otelcol having a (high) but (periodically) stable memory usage

What did you see instead?

Otelcol repeatedly builds up memory until it is oom killed by the operating system, only to repeat this exact behavior

Screenshot from 2023-03-06 23-33-13

What version did you use?
Version: Docker otel/opentelemetry-collector-contrib:0.72.0

What config did you use?
See above docker-compose.yml

Environment

docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  compose: Docker Compose (Docker Inc., 2.13.0)

Server:
 Containers: 31
  Running: 0
  Paused: 0
  Stopped: 31
 Images: 65
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661.m
 runc version: 
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.12.10-arch1-1
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 15.39GiB
 Name: <omit>
 ID: <omit>
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: <omit>
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional context
This only occurs when enabling wal mode. Other prometheus agents (Grafana Agent, Prometheus Agent Mode) do not show this behavior on the exact same input data

@sh0rez sh0rez added the bug Something isn't working label Mar 6, 2023
@Aneurysm9 Aneurysm9 self-assigned this Mar 7, 2023
@Aneurysm9 Aneurysm9 transferred this issue from open-telemetry/opentelemetry-collector Mar 7, 2023
@datsabk
Copy link
Contributor

datsabk commented Mar 7, 2023

+1 - on this. We are observing even more crazy behaviour. We have Otel agent pods running on nodes containing just 5-7 pods and still the usage goes beyond 20GiB for some nodes. There seems no relation to what it is trying to scrape, its literally some uncaught memory leak thats causing this.

@sh0rez
Copy link
Member Author

sh0rez commented Mar 7, 2023

I'm actively looking into this and will propose a fix once I found the cause for that

@datsabk
Copy link
Contributor

datsabk commented Mar 7, 2023

We tried removing memory-limiter and introducing memory_ballast extension instead to avoid dropping data.
Does this look like the right approach to save on memory ?

@nicolastakashi
Copy link
Contributor

I didn't know about the exporter code, but there's some Github issues about Prometheus consuming a lot of memory to replay wal, may this be related?

@frzifus
Copy link
Member

frzifus commented Mar 13, 2023

I didn't know about the exporter code, but there's some Github issues about Prometheus consuming a lot of memory to replay wal, may this be related?

@nicolastakashi Can you link it?

Update I guess its this one? prometheus/prometheus#6934 - prometheus/prometheus#10750

Seems its coming from here:

github.com/prometheus/prometheus/scrape.(*scrapeLoop).run

github.com/prometheus/prometheus@v0.42.1-0.20230210113933-af1d9e01c7e4/scrape/scrape.go

  Total:           0     4.51GB (flat, cum) 86.96%
   1262            .     4.51GB           ??? 
pprof details

Seems its coming from this scrape_loop:
grafik

I enabled pprof to see whats going on. Here is the file profile.pb.gz, i assume i can continue by the end of this week.

@sh0rez
Copy link
Member Author

sh0rez commented Mar 14, 2023

@nicolastakashi:
the otel-collector prometheusremotewriteexporter implements its own write-ahead log in https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/prometheusremotewriteexporter/wal.go that is unrelated to any wal implementations as found in prometheus repositories.

I suspect that renders any prometheus specific discussions less relevant to us sadly


@frzifus:
It surprises me it shows the scrapeLoop, as I cannot observe any oom behavior when having the WAL disabled (default settings). However, from reading the code, the scapeLoop should be the same, regardless of wal setting ...

@sh0rez
Copy link
Member Author

sh0rez commented Mar 16, 2023

Using a profiler confirmed the indication that wal is the culprit, as one observes a clear leak with wal enabled and none without.

Without:
Screenshot from 2023-03-16 16-14-46

With:
Screenshot from 2023-03-16 16-14-58

click images to view full flamegraph


However, as observed before, the leaking memory appears to originate from the scrapeLoop of the receiver.

There must be very non-obvious reason why the wal keeps a hold onto that memory. will keep digging

@gouthamve
Copy link
Member

This is a duplicate of #15277 unfortunately.

The following function deadlocks (can't get the mutex):

// Otherwise the WAL is enabled, and just persist the requests to the WAL
// and they'll be exported in another goroutine to the RemoteWrite endpoint.
if err = prwe.wal.persistToWAL(requests); err != nil {
return consumererror.NewPermanent(err)
}
return nil

And this leads to the PushMetrics call never returning. This means the pipeline is getting backed up, which means we are retaining the samples (produced from scrape) in the pipeline. This builds up to an OOM.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 16, 2023
@kovrus kovrus removed the Stale label May 16, 2023
@Hronom
Copy link

Hronom commented Jun 1, 2023

Hey I think this needs to be resolved

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Aug 1, 2023
@gouthamve gouthamve removed the Stale label Aug 1, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 16, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 16, 2024
@jpkrohling jpkrohling reopened this May 6, 2024
Copy link
Contributor

github-actions bot commented Jul 8, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 8, 2024
@harishkumarrajasekaran
Copy link

Any updates on this issue? We are also facing same while enabling the WAL the metrics are not being sent to remote destination and getting OOMKilled error.

@github-actions github-actions bot removed the Stale label Aug 15, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Oct 15, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 14, 2024
songy23 pushed a commit that referenced this issue Feb 4, 2025
I was taking a look over #20875 and hoping to finish it.
Fixes #19363
Fixes #24399
Fixes #15277 

---

As mentioned in
#24399 (comment),
I used a library to help me understand how the deadlock was happening.
(1st commit). It showed that `persistToWal` was trying to acquire the
lock, while `readPrompbFromWal` held it forever.

I changed the strategy here and instead of using fs.Notify, and all that
complicated logic around it, we're just using a pub/sub strategy between
the writer and reader Go routines.

The reader go routine, once finding an empty WAL, will now release the
lock immediately and wait for a notification from the writer. While
previously it would hold the lock while waiting for a write that would
never happen.

---------

Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
chengchuanpeng pushed a commit to chengchuanpeng/opentelemetry-collector-contrib that referenced this issue Feb 8, 2025
I was taking a look over open-telemetry#20875 and hoping to finish it.
Fixes open-telemetry#19363
Fixes open-telemetry#24399
Fixes open-telemetry#15277 

---

As mentioned in
open-telemetry#24399 (comment),
I used a library to help me understand how the deadlock was happening.
(1st commit). It showed that `persistToWal` was trying to acquire the
lock, while `readPrompbFromWal` held it forever.

I changed the strategy here and instead of using fs.Notify, and all that
complicated logic around it, we're just using a pub/sub strategy between
the writer and reader Go routines.

The reader go routine, once finding an empty WAL, will now release the
lock immediately and wait for a notification from the writer. While
previously it would hold the lock while waiting for a write that would
never happen.

---------

Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment