Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/prometheusremotewrite] Enabling WAL prevents metrics from being forwarded #15277

Open
ImDevinC opened this issue Oct 18, 2022 · 23 comments
Labels
bug Something isn't working exporter/prometheusremotewrite never stale Issues marked with this label will be never staled and automatically removed priority:p2 Medium

Comments

@ImDevinC
Copy link

What happened?

Description

When using the prometheusremotewrite exporter with the WAL enabled, no metrics are sent from the collector to the remote write destination.

Steps to Reproduce

Using the config in the config section below can reproduce this error by sending metrics to this collector. Disabling the WAL section causes all metrics to be sent properly.

Expected Result

Prometheus metrics should appear in the remote write destination.

Actual Result

No metrics were sent to the remote write destination.

Collector version

0.62.1

Environment information

Environment

AWS bottlerocket running otel/opentelemetry-collector-contrib:0.36.3 docker image

OpenTelemetry Collector configuration

exporters:
  logging:
    loglevel: info
  prometheusremotewrite:
    endpoint: http://thanos-receive-distributor:19291/api/v1/receive
    remote_write_queue:
      enabled: true
      num_consumers: 1
      queue_size: 5000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: false
      initial_interval: 5s
      max_elapsed_time: 10s
      max_interval: 10s
    target_info:
      enabled: false
    timeout: 15s
    tls:
      insecure: true
    wal:
      buffer_size: 100
      directory: /data/prometheus/wal
      truncate_frequency: 45s
extensions:
  health_check: {}
  memory_ballast: {}
  pprof:
    endpoint: :1888
  zpages:
    endpoint: :55679
processors:
  batch: {}
  batch/metrics:
    send_batch_max_size: 500
    send_batch_size: 500
    timeout: 180s
  memory_limiter:
    check_interval: 5s
    limit_mib: 4915
    spike_limit_mib: 1536
receivers:
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_compact:
        endpoint: 0.0.0.0:6831
      thrift_http:
        endpoint: 0.0.0.0:14268
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: opentelemetry-collector
        scrape_interval: 10s
        static_configs:
        - targets:
          - ${MY_POD_IP}:8888
  zipkin:
    endpoint: 0.0.0.0:9411
service:
  extensions:
  - health_check
  - pprof
  - zpages
  pipelines:
    logs:
      exporters:
      - logging
      processors:
      - memory_limiter
      - batch
      receivers:
      - otlp
    metrics:
      exporters:
      - prometheusremotewrite
      processors:
      - batch/metrics
      receivers:
      - otlp
    traces:
      exporters:
      - logging
      processors:
      - memory_limiter
      - batch
      receivers:
      - otlp
      - jaeger
      - zipkin
  telemetry:
    metrics:
      address: 0.0.0.0:8888

Log output

No response

Additional context

From debugging, this looks to be a deadlock between persistToWAL() and readPrompbFromWAL(), but I'm not 100% certain

@ImDevinC ImDevinC added bug Something isn't working needs triage New item requiring triage labels Oct 18, 2022
@HudsonHumphries
Copy link
Member

+1 I am also having issues when using the WAL for the prometheusremotewrite exporter. The only way I could get it to export metrics was by setting the buffer_size to 1 and exporting 1 metric at a time is not an option

@evan-bradley evan-bradley added priority:p2 Medium exporter/awsprometheusremotewrite AWS PRW exporter related issues and removed needs triage New item requiring triage labels Oct 19, 2022
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Dec 19, 2022
@ImDevinC
Copy link
Author

We have moved off of the prometheusremotewrite and it looks like there's no action on this. Closing the ticket

@kovrus
Copy link
Member

kovrus commented Jan 11, 2023

@ImDevinC Can you reopen this issue? It has to be investigated and fixed anyways.

@Aneurysm9 Aneurysm9 reopened this Jan 11, 2023
@Aneurysm9 Aneurysm9 removed the Stale label Jan 11, 2023
@ckt114
Copy link

ckt114 commented Mar 3, 2023

Any update on this? I'm seeing the same issue. As soon as I enable WAL no metric is sent out.

@gouthamve
Copy link
Member

This is a deadlock. From what I can see the following is happening:

readPrompbFromWAL:

  1. Takes mutex
  2. Reads data
  3. If data is found, returns

The problem is when data is not found, it watches the file:

  1. Takes mutex
  2. Reads data
  3. If no data is found, watch the file for updates
  4. Get blocked because the mutex is taken and writes can't happen.

Removing the file watcher fixes the issue.


However, it exposes another bug, we keep reading the same data and resending the requests again and again. I think the WAL implementation needs a closer look.

@kumar0204
Copy link

I am working on set up as follows, has the same issue where in Opentelemetry metrics are not forwaded to Grafana via Victoriametrics when WAL configuration is enabled in Opentelemetry collector configuration. however when wal is disabled we can see metrics are seen on grafana dashboard.

flow: App-->Otel Agen--> VictoriaMetrics--> grafana
use case: I want to implement persistence of metrics in the event of any failures.
ex: in this flow if VM insert/VM is down and comes back online after some downtime Otel agent to should retry the failed metrics and post on VM and same should be seen grafana.

Please advise if any better solution available for my use case.
Please someone help to fix the issue.

@frzifus
Copy link
Member

frzifus commented May 25, 2023

I can confirm the same. To be able to test it faster, I moved the relevant parts into a config file that works locally.

Details: Locally tested config with reported settings
---
exporters:
  logging:
    verbosity: detailed
  prometheusremotewrite:
    endpoint: http://127.0.0.1:9090/api/v1/write
    remote_write_queue:
      enabled: true
      num_consumers: 1
      queue_size: 5000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: false
      initial_interval: 5s
      max_elapsed_time: 10s
      max_interval: 10s
    target_info:
      enabled: false
    timeout: 15s
    tls:
      insecure: true
    wal:
      buffer_size: 100
      directory: ./wal
      truncate_frequency: 45s
extensions:
  health_check: {}
  memory_ballast: {}
  pprof:
    endpoint: :1888
processors:
  batch: {}
  batch/metrics:
    send_batch_max_size: 500
    send_batch_size: 500
    timeout: 180s
  memory_limiter:
    check_interval: 5s
    limit_mib: 4915
    spike_limit_mib: 1536
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
service:
  extensions: [health_check,pprof]
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch/metrics]
      exporters: [logging,prometheusremotewrite]
  telemetry:
    metrics:
      address: 0.0.0.0:8888

Then I used telemetrygen to generate some data. The collector starts to hang and needs to be force killed.

telemetrygen metrics --otlp-insecure --duration 45s --rate 500

But using this patch #20875 from @sh0rez I start to receive metrics:

# HELP rwrecv_requests_total 
# TYPE rwrecv_requests_total counter
rwrecv_requests_total{code="200",method="GET",path="/metrics",remote="localhost"} 3
rwrecv_requests_total{code="200",method="POST",path="/api/v1/write",remote="localhost"} 29
# HELP rwrecv_samples_received_total 
# TYPE rwrecv_samples_received_total counter
rwrecv_samples_received_total{remote="localhost"} 7514

@zakariais
Copy link

I am working on set up as follows, has the same issue where in Opentelemetry metrics are not forwaded to Grafana via Victoriametrics when WAL configuration is enabled in Opentelemetry collector configuration. however when wal is disabled we can see metrics are seen on grafana dashboard.

flow: App-->Otel Agen--> VictoriaMetrics--> grafana
use case: I want to implement persistence of metrics in the event of any failures.
ex: in this flow if VM insert/VM is down and comes back online after some downtime Otel agent to should retry the failed metrics and post on VM and same should be seen grafana.

Please advise if any better solution available for my use case.
Please someone help to fix the issue.

@kumar0204 I'm looking to do the same thing for OTEL to retry failed in case of backend goes down, did you find something for this like persistence or anything with remote write exporter?

@frzifus
Copy link
Member

frzifus commented Jun 15, 2023

@zakariais is the filestorage extension what you are looking for?

@zakariais
Copy link

@zakariais is the filestorage extension what you are looking for?

@frzifus does the file storage extension work with prometheus remote write exporter?
I didn't see that it works in the README of it.

@kumar0204
Copy link

I am working on set up as follows, has the same issue where in Opentelemetry metrics are not forwaded to Grafana via Victoriametrics when WAL configuration is enabled in Opentelemetry collector configuration. however when wal is disabled we can see metrics are seen on grafana dashboard.
flow: App-->Otel Agen--> VictoriaMetrics--> grafana
use case: I want to implement persistence of metrics in the event of any failures.
ex: in this flow if VM insert/VM is down and comes back online after some downtime Otel agent to should retry the failed metrics and post on VM and same should be seen grafana.
Please advise if any better solution available for my use case.
Please someone help to fix the issue.

@kumar0204 I'm looking to do the same thing for OTEL to retry failed in case of backend goes down, did you find something for this like persistence or anything with remote write exporter?

I have 2 types of persistence used in our set up.
my flow is like this
Service/Application-->Otel Agent with filestorage extention used for persistence -> Otel collector /Gateway with WriteAheadLog using prometheusremotewrite for persistence --> Victoria metrics ( SRE Back end) --> Grafana

1 use case:
in the above set up metrics are stored at agent end using filestorage extention, in case if Gateway is down then metrics are replayed from otel agent side.
2nd use case:in case Victoriametrics/Promethus is down metrics are stored at WAL log, once the SRE back end is up and running metrics are replayed from gateway.
my second use case has issue with WAL, when WAL enabled metrics are not reaching grafana.
hope you understood the issue clearly.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Aug 16, 2023
@frzifus frzifus removed the Stale label Aug 16, 2023
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Oct 16, 2023
@crobert-1 crobert-1 added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Oct 24, 2023
@github-actions
Copy link
Contributor

Pinging code owners for exporter/prometheusremotewrite: @Aneurysm9 @rapphil. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@cheskayang
Copy link
Contributor

i have a similar setup as @kumar0204 and running into the exact same issue with enabling wal on prometheusremotewrite

@frzifus
Copy link
Member

frzifus commented Nov 16, 2023

There is actually already a fix that has to be polished: #20875

Do you want to work on that @cheskayang ?

@cheskayang
Copy link
Contributor

cheskayang commented Dec 22, 2023

@frzifus thx for letting me know! i saw you opened a pr after this comment, but it's stale, #29297

do you still plan to ship the fix?

@devyanigoil
Copy link

@kumar0204 Even i have a similar setup. Were you able to solve the WAL issue?

@diranged
Copy link

diranged commented May 2, 2024

Ping ... we'd really like to see this get fixed as well... :/

@sh0rez
Copy link
Member

sh0rez commented May 2, 2024

i've reopened and rebased #20875 which will fix this

@a-shoemaker
Copy link

prometheusremotewrite with WAL enabled just flat out doesn't work. I've never seen it work anyway. There has been a PR out there to fix for over a year it looks like. Curious what the plan is here? Merge that, get a different fix, just remove WAL, or just leave it out there indifferently not working at all?

@morytina
Copy link

morytina commented Nov 7, 2024

I am also having the same issue.
Is this issue currently being worked on for a solution?
As mentioned above, I would like to be able to use otlp's sending_queue (persistent queue) for the solution, or have the WAL option work properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/prometheusremotewrite never stale Issues marked with this label will be never staled and automatically removed priority:p2 Medium
Projects
None yet
Development

Successfully merging a pull request may close this issue.