[exporter/prometheusremotewrite] Enabling WAL prevents metrics from being forwarded #15277

ImDevinC · 2022-10-18T22:45:16Z

What happened?

Description

When using the prometheusremotewrite exporter with the WAL enabled, no metrics are sent from the collector to the remote write destination.

Steps to Reproduce

Using the config in the config section below can reproduce this error by sending metrics to this collector. Disabling the WAL section causes all metrics to be sent properly.

Expected Result

Prometheus metrics should appear in the remote write destination.

Actual Result

No metrics were sent to the remote write destination.

Collector version

0.62.1

Environment information

Environment

AWS bottlerocket running otel/opentelemetry-collector-contrib:0.36.3 docker image

OpenTelemetry Collector configuration

exporters:
  logging:
    loglevel: info
  prometheusremotewrite:
    endpoint: http://thanos-receive-distributor:19291/api/v1/receive
    remote_write_queue:
      enabled: true
      num_consumers: 1
      queue_size: 5000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: false
      initial_interval: 5s
      max_elapsed_time: 10s
      max_interval: 10s
    target_info:
      enabled: false
    timeout: 15s
    tls:
      insecure: true
    wal:
      buffer_size: 100
      directory: /data/prometheus/wal
      truncate_frequency: 45s
extensions:
  health_check: {}
  memory_ballast: {}
  pprof:
    endpoint: :1888
  zpages:
    endpoint: :55679
processors:
  batch: {}
  batch/metrics:
    send_batch_max_size: 500
    send_batch_size: 500
    timeout: 180s
  memory_limiter:
    check_interval: 5s
    limit_mib: 4915
    spike_limit_mib: 1536
receivers:
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_compact:
        endpoint: 0.0.0.0:6831
      thrift_http:
        endpoint: 0.0.0.0:14268
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: opentelemetry-collector
        scrape_interval: 10s
        static_configs:
        - targets:
          - ${MY_POD_IP}:8888
  zipkin:
    endpoint: 0.0.0.0:9411
service:
  extensions:
  - health_check
  - pprof
  - zpages
  pipelines:
    logs:
      exporters:
      - logging
      processors:
      - memory_limiter
      - batch
      receivers:
      - otlp
    metrics:
      exporters:
      - prometheusremotewrite
      processors:
      - batch/metrics
      receivers:
      - otlp
    traces:
      exporters:
      - logging
      processors:
      - memory_limiter
      - batch
      receivers:
      - otlp
      - jaeger
      - zipkin
  telemetry:
    metrics:
      address: 0.0.0.0:8888

Log output

No response

Additional context

From debugging, this looks to be a deadlock between persistToWAL() and readPrompbFromWAL(), but I'm not 100% certain

The text was updated successfully, but these errors were encountered:

HudsonHumphries · 2022-10-19T18:28:28Z

+1 I am also having issues when using the WAL for the prometheusremotewrite exporter. The only way I could get it to export metrics was by setting the buffer_size to 1 and exporting 1 metric at a time is not an option

github-actions · 2022-12-19T03:30:48Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

ImDevinC · 2023-01-11T04:47:57Z

We have moved off of the prometheusremotewrite and it looks like there's no action on this. Closing the ticket

kovrus · 2023-01-11T10:52:10Z

@ImDevinC Can you reopen this issue? It has to be investigated and fixed anyways.

ckt114 · 2023-03-03T15:52:25Z

Any update on this? I'm seeing the same issue. As soon as I enable WAL no metric is sent out.

gouthamve · 2023-03-16T16:52:19Z

This is a deadlock. From what I can see the following is happening:

readPrompbFromWAL:

Takes mutex
Reads data
If data is found, returns

The problem is when data is not found, it watches the file:

Takes mutex
Reads data
If no data is found, watch the file for updates
Get blocked because the mutex is taken and writes can't happen.

Removing the file watcher fixes the issue.

However, it exposes another bug, we keep reading the same data and resending the requests again and again. I think the WAL implementation needs a closer look.

kumar0204 · 2023-04-03T11:42:43Z

I am working on set up as follows, has the same issue where in Opentelemetry metrics are not forwaded to Grafana via Victoriametrics when WAL configuration is enabled in Opentelemetry collector configuration. however when wal is disabled we can see metrics are seen on grafana dashboard.

flow: App-->Otel Agen--> VictoriaMetrics--> grafana
use case: I want to implement persistence of metrics in the event of any failures.
ex: in this flow if VM insert/VM is down and comes back online after some downtime Otel agent to should retry the failed metrics and post on VM and same should be seen grafana.

Please advise if any better solution available for my use case.
Please someone help to fix the issue.

frzifus · 2023-05-25T12:20:32Z

I can confirm the same. To be able to test it faster, I moved the relevant parts into a config file that works locally.

Details: Locally tested config with reported settings

---
exporters:
  logging:
    verbosity: detailed
  prometheusremotewrite:
    endpoint: http://127.0.0.1:9090/api/v1/write
    remote_write_queue:
      enabled: true
      num_consumers: 1
      queue_size: 5000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: false
      initial_interval: 5s
      max_elapsed_time: 10s
      max_interval: 10s
    target_info:
      enabled: false
    timeout: 15s
    tls:
      insecure: true
    wal:
      buffer_size: 100
      directory: ./wal
      truncate_frequency: 45s
extensions:
  health_check: {}
  memory_ballast: {}
  pprof:
    endpoint: :1888
processors:
  batch: {}
  batch/metrics:
    send_batch_max_size: 500
    send_batch_size: 500
    timeout: 180s
  memory_limiter:
    check_interval: 5s
    limit_mib: 4915
    spike_limit_mib: 1536
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
service:
  extensions: [health_check,pprof]
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch/metrics]
      exporters: [logging,prometheusremotewrite]
  telemetry:
    metrics:
      address: 0.0.0.0:8888

Then I used telemetrygen to generate some data. The collector starts to hang and needs to be force killed.

telemetrygen metrics --otlp-insecure --duration 45s --rate 500

But using this patch #20875 from @sh0rez I start to receive metrics:

# HELP rwrecv_requests_total 
# TYPE rwrecv_requests_total counter
rwrecv_requests_total{code="200",method="GET",path="/metrics",remote="localhost"} 3
rwrecv_requests_total{code="200",method="POST",path="/api/v1/write",remote="localhost"} 29
# HELP rwrecv_samples_received_total 
# TYPE rwrecv_samples_received_total counter
rwrecv_samples_received_total{remote="localhost"} 7514

zakariais · 2023-06-15T10:10:50Z

I am working on set up as follows, has the same issue where in Opentelemetry metrics are not forwaded to Grafana via Victoriametrics when WAL configuration is enabled in Opentelemetry collector configuration. however when wal is disabled we can see metrics are seen on grafana dashboard.

flow: App-->Otel Agen--> VictoriaMetrics--> grafana
use case: I want to implement persistence of metrics in the event of any failures.
ex: in this flow if VM insert/VM is down and comes back online after some downtime Otel agent to should retry the failed metrics and post on VM and same should be seen grafana.

Please advise if any better solution available for my use case.
Please someone help to fix the issue.

@kumar0204 I'm looking to do the same thing for OTEL to retry failed in case of backend goes down, did you find something for this like persistence or anything with remote write exporter?

frzifus · 2023-06-15T14:26:19Z

@zakariais is the filestorage extension what you are looking for?

zakariais · 2023-06-15T14:27:44Z

@zakariais is the filestorage extension what you are looking for?

@frzifus does the file storage extension work with prometheus remote write exporter?
I didn't see that it works in the README of it.

kumar0204 · 2023-06-16T07:45:59Z

I am working on set up as follows, has the same issue where in Opentelemetry metrics are not forwaded to Grafana via Victoriametrics when WAL configuration is enabled in Opentelemetry collector configuration. however when wal is disabled we can see metrics are seen on grafana dashboard.
flow: App-->Otel Agen--> VictoriaMetrics--> grafana
use case: I want to implement persistence of metrics in the event of any failures.
ex: in this flow if VM insert/VM is down and comes back online after some downtime Otel agent to should retry the failed metrics and post on VM and same should be seen grafana.
Please advise if any better solution available for my use case.
Please someone help to fix the issue.

@kumar0204 I'm looking to do the same thing for OTEL to retry failed in case of backend goes down, did you find something for this like persistence or anything with remote write exporter?

I have 2 types of persistence used in our set up.
my flow is like this
Service/Application-->Otel Agent with filestorage extention used for persistence -> Otel collector /Gateway with WriteAheadLog using prometheusremotewrite for persistence --> Victoria metrics ( SRE Back end) --> Grafana

1 use case:
in the above set up metrics are stored at agent end using filestorage extention, in case if Gateway is down then metrics are replayed from otel agent side.
2nd use case:in case Victoriametrics/Promethus is down metrics are stored at WAL log, once the SRE back end is up and running metrics are replayed from gateway.
my second use case has issue with WAL, when WAL enabled metrics are not reaching grafana.
hope you understood the issue clearly.

github-actions · 2023-08-16T03:29:22Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions · 2023-10-16T03:30:40Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions · 2023-10-24T15:29:10Z

Pinging code owners for exporter/prometheusremotewrite: @Aneurysm9 @rapphil. See Adding Labels via Comments if you do not have permissions to add labels yourself.

cheskayang · 2023-11-10T02:57:27Z

i have a similar setup as @kumar0204 and running into the exact same issue with enabling wal on prometheusremotewrite

frzifus · 2023-11-16T00:29:33Z

There is actually already a fix that has to be polished: #20875

Do you want to work on that @cheskayang ?

cheskayang · 2023-12-22T20:38:07Z

@frzifus thx for letting me know! i saw you opened a pr after this comment, but it's stale, #29297

do you still plan to ship the fix?

devyanigoil · 2024-02-15T14:59:16Z

@kumar0204 Even i have a similar setup. Were you able to solve the WAL issue?

diranged · 2024-05-02T13:48:47Z

Ping ... we'd really like to see this get fixed as well... :/

sh0rez · 2024-05-02T15:08:22Z

i've reopened and rebased #20875 which will fix this

a-shoemaker · 2024-06-05T16:19:58Z

prometheusremotewrite with WAL enabled just flat out doesn't work. I've never seen it work anyway. There has been a PR out there to fix for over a year it looks like. Curious what the plan is here? Merge that, get a different fix, just remove WAL, or just leave it out there indifferently not working at all?

morytina · 2024-11-07T06:58:20Z

I am also having the same issue.
Is this issue currently being worked on for a solution?
As mentioned above, I would like to be able to use otlp's sending_queue (persistent queue) for the solution, or have the WAL option work properly.

ImDevinC added bug Something isn't working needs triage New item requiring triage labels Oct 18, 2022

ImDevinC mentioned this issue Oct 18, 2022

[exporter/prometheusremotewrite] Add timeout to prevent deadlock #15278

Closed

evan-bradley added priority:p2 Medium exporter/awsprometheusremotewrite AWS PRW exporter related issues and removed needs triage New item requiring triage labels Oct 19, 2022

github-actions bot added the Stale label Dec 19, 2022

ImDevinC closed this as completed Jan 11, 2023

Aneurysm9 reopened this Jan 11, 2023

Aneurysm9 removed the Stale label Jan 11, 2023

gouthamve mentioned this issue Mar 16, 2023

exporter/prometheusremotewrite: wal leads to oom under high load #19363

Closed

github-actions bot added the Stale label Aug 16, 2023

frzifus removed the Stale label Aug 16, 2023

github-actions bot added the Stale label Oct 16, 2023

crobert-1 added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Oct 24, 2023

crobert-1 added exporter/prometheusremotewrite and removed exporter/awsprometheusremotewrite AWS PRW exporter related issues labels Oct 24, 2023

sh0rez mentioned this issue May 6, 2024

prometheusremotewrite: release lock when wal empty #20875

Closed

Kaivalya1997 mentioned this issue May 21, 2024

Replace the RemoteWriteQueue and WAL with the exporterhelper queue (sending_queue) in Prometheusremotewriteexporter #33137

Closed

jmichalek132 mentioned this issue Sep 3, 2024

Enabling WAL is not exporting metrics to Mimir backend using Prometheus remote write exporter #33238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/prometheusremotewrite] Enabling WAL prevents metrics from being forwarded #15277

[exporter/prometheusremotewrite] Enabling WAL prevents metrics from being forwarded #15277

ImDevinC commented Oct 18, 2022

HudsonHumphries commented Oct 19, 2022

github-actions bot commented Dec 19, 2022

ImDevinC commented Jan 11, 2023

kovrus commented Jan 11, 2023

ckt114 commented Mar 3, 2023

gouthamve commented Mar 16, 2023

kumar0204 commented Apr 3, 2023

frzifus commented May 25, 2023

zakariais commented Jun 15, 2023

frzifus commented Jun 15, 2023

zakariais commented Jun 15, 2023

kumar0204 commented Jun 16, 2023

github-actions bot commented Aug 16, 2023

github-actions bot commented Oct 16, 2023

github-actions bot commented Oct 24, 2023

cheskayang commented Nov 10, 2023

frzifus commented Nov 16, 2023

cheskayang commented Dec 22, 2023 •

edited

Loading

devyanigoil commented Feb 15, 2024

diranged commented May 2, 2024

sh0rez commented May 2, 2024

a-shoemaker commented Jun 5, 2024

morytina commented Nov 7, 2024

[exporter/prometheusremotewrite] Enabling WAL prevents metrics from being forwarded #15277

[exporter/prometheusremotewrite] Enabling WAL prevents metrics from being forwarded #15277

Comments

ImDevinC commented Oct 18, 2022

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

HudsonHumphries commented Oct 19, 2022

github-actions bot commented Dec 19, 2022

ImDevinC commented Jan 11, 2023

kovrus commented Jan 11, 2023

ckt114 commented Mar 3, 2023

gouthamve commented Mar 16, 2023

kumar0204 commented Apr 3, 2023

frzifus commented May 25, 2023

zakariais commented Jun 15, 2023

frzifus commented Jun 15, 2023

zakariais commented Jun 15, 2023

kumar0204 commented Jun 16, 2023

github-actions bot commented Aug 16, 2023

github-actions bot commented Oct 16, 2023

github-actions bot commented Oct 24, 2023

cheskayang commented Nov 10, 2023

frzifus commented Nov 16, 2023

cheskayang commented Dec 22, 2023 • edited Loading

devyanigoil commented Feb 15, 2024

diranged commented May 2, 2024

sh0rez commented May 2, 2024

a-shoemaker commented Jun 5, 2024

morytina commented Nov 7, 2024

cheskayang commented Dec 22, 2023 •

edited

Loading