[exporter/prometheus] Expired metrics were not be deleted #17306

ralphgj · 2022-12-30T06:31:04Z

Component(s)

exporter/prometheus

What happened?

Description

Hi, I am trying to use spanmetrics processor and prometheus exporter to transform spans to metrics. But I found some expired metrics seems to be appeared repeatedly when new metrics received. And also the memory usage continued to rise. So is this a bug of prometheus exporter?

Steps to Reproduce

examples:
when I posted span to collector, the prometheus exporter exported the metric like this

calls_total{db_instance="N/A",db_name="name-KSDORKdOKV",db_sql_table="table-IWstkE",db_system="redis",operation="get",service_name="go-project.examples",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET"} 1

after 5 seconds, the metric will disappear. Then I posted another span to collector, the prometheus exporter exported two metrics included the expired one

calls_total{db_instance="N/A",db_name="name-KSDORKdOKV",db_sql_table="table-IWstkE",db_system="redis",operation="get",service_name="go-project.examples",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET"} 1
calls_total{db_instance="N/A",db_name="name-QSvMJKDYso",db_sql_table="table-ZHdGvF",db_system="redis",operation="set",service_name="go-project.examples",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET"} 1

Expected Result

Expired metrics will be deleted.

Actual Result

Expired metrics seems to be still stored in memory cache.

Collector version

2ada50f

Environment information

Environment

OS: MacOS 13.0.1
Compiler(if manually compiled): go 1.19.3

OpenTelemetry Collector configuration

receivers:
  # Dummy receiver that's never used, because a pipeline is required to have one.
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: "localhost:12345"

  otlp:
    protocols:
      grpc:
        endpoint: "localhost:55677"

processors:
  batch:
  spanmetrics:
    metrics_exporter: otlp/spanmetrics
    latency_histogram_buckets: [10ms, 100ms]
    dimensions:
      - name: db.system
        default: N/A
      - name: db.name
        default: N/A
      - name: db.sql.table
        default: N/A
      - name: db.instance
        default: N/A
    dimensions_cache_size: 1000
    aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"

exporters:
  logging:
    verbosity: basic

  otlp/spanmetrics:
    endpoint: "localhost:55677"
    tls:
      insecure: true

  prometheus:
    endpoint: "0.0.0.0:8889"
    metric_expiration: 5s

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [spanmetrics, batch]
      exporters: [logging]

    # The exporter name must match the metrics_exporter name.
    # The receiver is just a dummy and never used; added to pass validation requiring at least one receiver in a pipeline.
    metrics/spanmetrics:
      receivers: [otlp/spanmetrics]
      exporters: [otlp/spanmetrics]

    metrics:
      receivers: [otlp]
      exporters: [prometheus]

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

Aneurysm9 · 2022-12-30T18:57:56Z

Can you put a logging exporter on both of the metrics pipelines? I'd like to localize the issue to either the spanmetrics processor or the prometheus exporter. My first impression is that this is likely a result of memory in the spanmetrics processor emitting the last-known state of all metrics it has generated.

albertteoh · 2023-01-03T08:02:50Z

I think @Aneurysm9 is right. Should the spanmetrics processor be introducing some sort of expiry on its in-memory collection of metrics as well to maintain consistency with prometheus exporter?

ralphgj · 2023-01-04T09:32:27Z

@Aneurysm9 sorry for the late reply, I updated the config with logging exporter and debug log level

receivers:
  # Dummy receiver that's never used, because a pipeline is required to have one.
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: "localhost:12345"

  otlp:
    protocols:
      grpc:
        endpoint: "localhost:55677"

processors:
  batch:
  spanmetrics:
    metrics_exporter: otlp/spanmetrics
    latency_histogram_buckets: [10ms, 100ms]
    dimensions:
      - name: db.system
        default: N/A
      - name: db.name
        default: N/A
      - name: db.sql.table
        default: N/A
      - name: db.instance
        default: N/A
    dimensions_cache_size: 1000
    aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"

exporters:
  logging:
    verbosity: basic
  logging/1:
    verbosity: normal
  logging/2:
    verbosity: normal

  otlp/spanmetrics:
    endpoint: "localhost:55677"
    tls:
      insecure: true

  prometheus:
    endpoint: "0.0.0.0:8889"
    metric_expiration: 5s

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [spanmetrics, batch]
      exporters: [logging]

    # The exporter name must match the metrics_exporter name.
    # The receiver is just a dummy and never used; added to pass validation requiring at least one receiver in a pipeline.
    metrics/spanmetrics:
      receivers: [otlp/spanmetrics]
      exporters: [otlp/spanmetrics, logging/1]

    metrics:
      receivers: [otlp]
      exporters: [prometheus, logging/2]
  telemetry:
    logs:
      level: debug

When I repeated the before steps, I got logs like this:

2023-01-04T17:15:03.247+0800    debug   prometheusexporter/collector.go:343     collect called  {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:03.247+0800    debug   prometheusexporter/accumulator.go:279   Accumulator collect called      {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:03.747+0800    debug   prometheusexporter/accumulator.go:90    accumulating metric: calls_total        {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:03.747+0800    debug   prometheusexporter/accumulator.go:90    accumulating metric: latency    {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:03.747+0800    info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "logging/2", "#metrics": 2}
2023-01-04T17:15:03.755+0800    info    TracesExporter  {"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 1}
2023-01-04T17:15:05.784+0800    debug   prometheusexporter/collector.go:343     collect called  {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:05.784+0800    debug   prometheusexporter/accumulator.go:279   Accumulator collect called      {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:05.784+0800    debug   prometheusexporter/collector.go:367     metric served: Desc{fqName: "calls_total", help: "", constLabels: {}, variableLabels: [service_name operation span_kind status_code db_system db_name db_sql_table db_instance]} {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:05.784+0800    debug   prometheusexporter/collector.go:367     metric served: Desc{fqName: "latency", help: "", constLabels: {}, variableLabels: [service_name operation span_kind status_code db_system db_name db_sql_table db_instance]}     {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}



2023-01-04T17:15:11.680+0800    debug   prometheusexporter/collector.go:343     collect called  {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:11.680+0800    debug   prometheusexporter/accumulator.go:279   Accumulator collect called      {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:11.680+0800    debug   prometheusexporter/accumulator.go:288   metric expired: latency {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:11.680+0800    debug   prometheusexporter/accumulator.go:288   metric expired: calls_total     {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}



2023-01-04T17:15:18.675+0800    debug   prometheusexporter/collector.go:343     collect called  {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:18.675+0800    debug   prometheusexporter/accumulator.go:279   Accumulator collect called      {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:18.746+0800    debug   prometheusexporter/accumulator.go:90    accumulating metric: calls_total        {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:18.746+0800    debug   prometheusexporter/accumulator.go:90    accumulating metric: latency    {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:18.746+0800    info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "logging/2", "#metrics": 2}
2023-01-04T17:15:18.804+0800    info    TracesExporter  {"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 1}
2023-01-04T17:15:23.098+0800    debug   prometheusexporter/collector.go:343     collect called  {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:23.098+0800    debug   prometheusexporter/accumulator.go:279   Accumulator collect called      {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:23.098+0800    debug   prometheusexporter/collector.go:367     metric served: Desc{fqName: "latency", help: "", constLabels: {}, variableLabels: [service_name operation span_kind status_code db_system db_name db_sql_table db_instance]}     {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:23.098+0800    debug   prometheusexporter/collector.go:367     metric served: Desc{fqName: "latency", help: "", constLabels: {}, variableLabels: [service_name operation span_kind status_code db_system db_name db_sql_table db_instance]}     {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:23.098+0800    debug   prometheusexporter/collector.go:367     metric served: Desc{fqName: "calls_total", help: "", constLabels: {}, variableLabels: [service_name operation span_kind status_code db_system db_name db_sql_table db_instance]} {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:23.098+0800    debug   prometheusexporter/collector.go:367     metric served: Desc{fqName: "calls_total", help: "", constLabels: {}, variableLabels: [service_name operation span_kind status_code db_system db_name db_sql_table db_instance]} {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}

github-actions · 2023-03-06T03:33:27Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

danielgblanco · 2023-03-08T18:39:35Z

We're also seeing this in some collectors where we do spanmetrics. We had to set the expiration of prometheus exporter to a large value otherwise any cumulativetodelta calculations are affected., but we try to limit the cardinality very drastically on these metrics (around 100k time series per day). It does sound like it's a spanmetrics issue though. p.histograms is only reset for delta metrics here, and it seems that's what being used to produce metrics on any export interval.

github-actions · 2023-03-09T19:12:21Z

Pinging code owners for processor/spanmetrics: @albertteoh. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-03-09T19:12:25Z

Pinging code owners for connector/spanmetrics: @albertteoh. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-05-10T03:30:29Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/spanmetrics: @albertteoh
connector/spanmetrics: @albertteoh @kovrus

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-07-09T05:19:52Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

mfilipe · 2023-09-18T20:49:33Z

@albertteoh @kovrus we are seeing this problem with spanmetrics connector and prometheus exporter. Collectors are using spanmetrics.aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE" and prometheus.metric_expiration: 60s, but the metrics are available in /metrics for some days.

The panel below shows how /metrics grows with the time and keeps stable when nobody is using:

rate(otelcol_exporter_sent_metric_points[1m])

albertteoh · 2023-09-19T11:06:10Z

Hi @mfilipe , the spanmetrics connector will continue to emit metrics at a configurable time interval (default every 15s) even when no new spans are sent to it, so I believe this is what's preventing the prometheus.metric_expiration to kick in.

There are some knobs available to influence the number of metrics the spanmetrics connector emits which are dimensions_cache_size and the buckets config parameters.

The former ensures no more than this many metrics will be store in memory and emitted downstream.

The latter speaks to the resolution of histograms, so a smaller number should reduced the number of metrics exported.

nijave · 2023-09-22T16:33:20Z

Hi @mfilipe , the spanmetrics connector will continue to emit metrics at a configurable time interval (default every 15s) even when no new spans are sent to it, so I believe this is what's preventing the prometheus.metric_expiration to kick in.

There are some knobs available to influence the number of metrics the spanmetrics connector emits which are dimensions_cache_size and the buckets config parameters.

The former ensures no more than this many metrics will be store in memory and emitted downstream.

The latter speaks to the resolution of histograms, so a smaller number should reduced the number of metrics exported.

dimensions_cache_size doesn't actually control how many metric series are kept. See grafana/agent#5271

tldr; processorImp.histograms isn't pruned in-sync with metricKeyToDimensions

nijave · 2023-09-22T17:20:02Z

It appears the updated timestamp on the metric is set when it's "accumulated" (called by ConsumeMetrics here https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/spanmetricsprocessor/processor.go#L266C39-L266C39) so metric_expiration will only apply to things the spanmetricsprocessor stops sending to the metricsConsumer

Put another way, spanmetricsprocessor needs to expire metrics internally otherwise the downstream prometheusexporter thinks they keep getting updated regardless of if the value changed.

spanmetricsprocessor.exportMetrics -> ConsumeMetrics
prometheusexporter.ConsumeMetrics -> processMetrics (for each metric)
prometheusexporter.processMetrics -> Accumulate
prometheusexporter.Accumulate (gets now as timestamp) -> addMetric
prometheusexporter.addMetric -> accumulate<T> (timestamp now gets set to updated field)

matej-g · 2023-12-20T15:44:57Z

I believe this issue should be reopened, it seems it was closed due to inactivity but it still persists, I also tried to reach out on Slack some time ago but no luck unfortunately (https://cloud-native.slack.com/archives/CJFCJHG4Q/p1700844297561809)

I'm facing same scenario as @mfilipe but none of the suggested solutions really satisfy my needs, as I simply want to stop exporting metrics for spans from services that no longer produce spans. I don't think configuring cache size or buckets will help me achieve it. I'd like to have a mechanism where such metrics are "expired" once no new spans are being produced.

From my understanding the plan proposed by @nijave but it seems like this has not been implemented yet. Are there any objections or known blockers?

nijave · 2023-12-20T15:48:52Z

I believe this issue should be reopened, it seems it was closed due to inactivity but it still persists, I also tried to reach out on Slack some time ago but no luck unfortunately (https://cloud-native.slack.com/archives/CJFCJHG4Q/p1700844297561809)

I'm facing same scenario as @mfilipe but none of the suggested solutions really satisfy my needs, as I simply want to stop exporting metrics for spans from services that no longer produce spans. I don't think configuring cache size or buckets will help me achieve it. I'd like to have a mechanism where such metrics are "expired" once no new spans are being produced.

From my understanding the plan proposed by @nijave but it seems like this has not been implemented yet. Are there any objections or known blockers?

You would need a feature request assuming it's related to what I thought it was. The spanmetricsprocessor doesn't take time into account and keeps publishing metrics subject to dimensions_cache_size (so it always keeps last ###)

However, the spanmetricsprocessor should actually respect dimensions_cache_size now (#27083 ). Also worth noting spanmetricsprocessor is deprecated and replaced with spanmetricsconnector

nijave · 2023-12-20T17:25:08Z

I believe this issue should be reopened, it seems it was closed due to inactivity but it still persists, I also tried to reach out on Slack some time ago but no luck unfortunately (https://cloud-native.slack.com/archives/CJFCJHG4Q/p1700844297561809)

I'm facing same scenario as @mfilipe but none of the suggested solutions really satisfy my needs, as I simply want to stop exporting metrics for spans from services that no longer produce spans. I don't think configuring cache size or buckets will help me achieve it. I'd like to have a mechanism where such metrics are "expired" once no new spans are being produced.

From my understanding the plan proposed by @nijave but it seems like this has not been implemented yet. Are there any objections or known blockers?

Actually I might have lied. aggregation_temporality to AGGREGATION_TEMPORALITY_DELTA probably does what you want ~~but keep in mind that changes to gauges instead of counters~~ (actually not sure if this is true). See https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/spanmetricsprocessor. In that mode, the series get purged every cycle https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/spanmetricsprocessor/processor.go#L256

rotscher · 2024-01-05T14:07:18Z

We're also facing the case where metrics are not deleted, e.g. when a service is stopped and is not producing spans anymore. In that case the metric stops counting up but it is still exported.
As a consequence the target_info metric of prometheusexporter also never stops to be exported.

Everything gets (of course) to a clean state when the opentelemetry collector instance is restarted.

To me this issue seems related to the issue #21101.

Configuring aggregation_temporality to AGGREGATION_TEMPORALITY_DELTA does not resolve the issue as the spanmetrics (which are of type Sum and are monotonic) gets dropped by the prometheus exporter when I understand open-telemetry/opentelemetry-java#5062 correctly.

I've used version 0.90.1 of opentelemetry-collector.

ralphgj added bug Something isn't working needs triage New item requiring triage labels Dec 30, 2022

ralphgj changed the title ~~Expired metrics were not be deleted~~ [exporter/prometheus] Expired metrics were not be deleted Dec 30, 2022

github-actions bot added the Stale label Mar 6, 2023

Aneurysm9 added processor/spanmetrics Span Metrics processor connector/spanmetrics and removed Stale labels Mar 9, 2023

atoulme removed the needs triage New item requiring triage label Mar 10, 2023

github-actions bot added the Stale label May 10, 2023

github-actions bot added the closed as inactive label Jul 9, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 9, 2023

nijave mentioned this issue Sep 22, 2023

spanmetricsprocessor doesn't prune histograms when metric cache is pruned #27080

Closed

luistilingue mentioned this issue Dec 14, 2023

Incorrect Behavior in OpenTelemetry Collector Spanmetrics #27472

Closed

rotscher mentioned this issue Jan 5, 2024

Spanmetrics Connector are not giving correct metrics of spans #21101

Open

matej-g mentioned this issue Jan 15, 2024

[connector/spanmetrics] Metrics keeps being produced for spans that are no longer being received #30559

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/prometheus] Expired metrics were not be deleted #17306

[exporter/prometheus] Expired metrics were not be deleted #17306

ralphgj commented Dec 30, 2022

Aneurysm9 commented Dec 30, 2022

albertteoh commented Jan 3, 2023

ralphgj commented Jan 4, 2023

github-actions bot commented Mar 6, 2023

danielgblanco commented Mar 8, 2023 •

edited

Loading

github-actions bot commented Mar 9, 2023

github-actions bot commented Mar 9, 2023

github-actions bot commented May 10, 2023

github-actions bot commented Jul 9, 2023

mfilipe commented Sep 18, 2023 •

edited

Loading

albertteoh commented Sep 19, 2023

nijave commented Sep 22, 2023

nijave commented Sep 22, 2023 •

edited

Loading

matej-g commented Dec 20, 2023

nijave commented Dec 20, 2023 •

edited

Loading

nijave commented Dec 20, 2023 •

edited

Loading

rotscher commented Jan 5, 2024

[exporter/prometheus] Expired metrics were not be deleted #17306

[exporter/prometheus] Expired metrics were not be deleted #17306

Comments

ralphgj commented Dec 30, 2022

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

Aneurysm9 commented Dec 30, 2022

albertteoh commented Jan 3, 2023

ralphgj commented Jan 4, 2023

github-actions bot commented Mar 6, 2023

danielgblanco commented Mar 8, 2023 • edited Loading

github-actions bot commented Mar 9, 2023

github-actions bot commented Mar 9, 2023

github-actions bot commented May 10, 2023

github-actions bot commented Jul 9, 2023

mfilipe commented Sep 18, 2023 • edited Loading

albertteoh commented Sep 19, 2023

nijave commented Sep 22, 2023

nijave commented Sep 22, 2023 • edited Loading

matej-g commented Dec 20, 2023

nijave commented Dec 20, 2023 • edited Loading

nijave commented Dec 20, 2023 • edited Loading

rotscher commented Jan 5, 2024

danielgblanco commented Mar 8, 2023 •

edited

Loading

mfilipe commented Sep 18, 2023 •

edited

Loading

nijave commented Sep 22, 2023 •

edited

Loading

nijave commented Dec 20, 2023 •

edited

Loading

nijave commented Dec 20, 2023 •

edited

Loading