Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling race condition causes initial metric to be reported with a 0 value #31807

Closed
sergiuiacob1 opened this issue Mar 18, 2024 · 6 comments
Closed
Labels
bug Something isn't working closed as inactive data:metrics Metric related issues exporter/datadog Datadog components priority:p2 Medium Stale waiting for author

Comments

@sergiuiacob1
Copy link

sergiuiacob1 commented Mar 18, 2024

Component(s)

exporter/datadog

What happened?

Description

I've set up a counter metric to track "events". The counter is emitted using Prometheus Push Gateway and scraped by an Open Telemetry agent. I've set OTel to export data to Datadog. I've set it to report only the delta for the counters, i.e.

      datadog:
        metrics:
          sums:
            cumulative_monotonic_mode: to_delta
          histogram:
            counters: true
        api:
          site: datadoghq.com
          key: ${env:DATADOG_API_KEY}

I can see OTel scraping the first 2 values for my metric with the new set of label values

OTEL reports first datapoint:
NumberDataPoints #2
Data point attributes:
     -> app: Str(shoreline)
     -> customer_id: Str(cust0)
     -> exported_instance: Str(cust0-proxy-backend-0)
     -> exported_job: Str(ops_backend)
     -> exported_namespace: Str(cust0-backend)
     -> name: Str(a2)
     -> namespace: Str(cust0-backend)
     -> operation: Str(create)
     -> service: Str(cust0-proxy-backend-0)
     -> status: Str(succeeded)
     -> type: Str(action)
StartTimestamp: 2024-03-15 16:35:05.515 +0000 UTC
Timestamp: 2024-03-15 16:35:05.515 +0000 UTC
Value: 1.000000

OTEL reports second datapoint:
NumberDataPoints #2
Data point attributes:
     -> app: Str(shoreline)
     -> customer_id: Str(cust0)
     -> exported_instance: Str(cust0-proxy-backend-0)
     -> exported_job: Str(ops_backend)
     -> exported_namespace: Str(cust0-backend)
     -> name: Str(a2)
     -> namespace: Str(cust0-backend)
     -> operation: Str(create)
     -> service: Str(cust0-proxy-backend-0)
     -> status: Str(succeeded)
     -> type: Str(action)
StartTimestamp: 2024-03-15 16:35:05.515 +0000 UTC
Timestamp: 2024-03-15 16:36:05.515 +0000 UTC
Value: 1.000000

Datadog receives new time series at 18:36:05. Because the value difference between the 2 timestamps is 0, the initial metric value (initial delta) is 0 as well.
The previous Datadog metric timestamp was at 18:35:05.

Screenshot 2024-03-15 at 18 43 36

Steps to Reproduce

  1. Create a counter
  2. Set up Datadog exporter
  3. Emit datapoints through OTel with new label values
  4. Eventually, there will be a "miss" and the delta for a new label set will be 0

Expected Result

OTel should always report the delta for an initial set of label values for a counter to be that metric's value – in my case, 1.

Specifically, in my screenshot in Datadog, I should have seen the initial delta, which is 1 for my metric.

Actual Result

Sometimes the initial counter value is the true initial delta (the first value counted), sometimes it's 0.

Collector version

0.96.0

Environment information

Environment

OS: Ubuntu 20.04
OTel version: 0.96.0. Also reproduced on 0.71.0

OpenTelemetry Collector configuration

datadog:
        metrics:
          sums:
            cumulative_monotonic_mode: to_delta
          histogram:
            counters: true
        api:
          site: datadoghq.com
          key: ${env:DATADOG_API_KEY}


### Log output

_No response_

### Additional context

_No response_
@sergiuiacob1 sergiuiacob1 added bug Something isn't working needs triage New item requiring triage labels Mar 18, 2024
@github-actions github-actions bot added the exporter/datadog Datadog components label Mar 18, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@sergiuiacob1
Copy link
Author

Easily recreated this and not very sure it's actually a race condition. I took a look again at the timestamps:

First sample at 2024-03-19 14:12:07.844 +0000 UTC

NumberDataPoints #1
Data point attributes:
     -> app: Str(shoreline)
     -> customer_id: Str(cust0)
     -> exported_instance: Str(cust0-proxy-backend-0)
     -> exported_job: Str(ops_backend)
     -> exported_namespace: Str(cust0-backend)
     -> name: Str(sergiu_test_2)
     -> namespace: Str(cust0-backend)
     -> operation: Str(create)
     -> service: Str(cust0-proxy-backend-0)
     -> status: Str(succeeded)
     -> type: Str(action)
StartTimestamp: 2024-03-19 14:12:07.844 +0000 UTC
Timestamp: 2024-03-19 14:12:07.844 +0000 UTC
Value: 1.000000

Second sample at 2024-03-19 14:13:07.844 +0000 UTC:

NumberDataPoints #1
Data point attributes:
     -> app: Str(shoreline)
     -> customer_id: Str(cust0)
     -> exported_instance: Str(cust0-proxy-backend-0)
     -> exported_job: Str(ops_backend)
     -> exported_namespace: Str(cust0-backend)
     -> name: Str(sergiu_test_2)
     -> namespace: Str(cust0-backend)
     -> operation: Str(create)
     -> service: Str(cust0-proxy-backend-0)
     -> status: Str(succeeded)
     -> type: Str(action)
StartTimestamp: 2024-03-19 14:12:07.844 +0000 UTC
Timestamp: 2024-03-19 14:13:07.844 +0000 UTC
Value: 1.000000

Datadog receives first datapoint at 14:13:05 UTC:
CRUD events

In the case above, the first datapoint at 14:13:05 UTC should have had the value 1.

@sergiuiacob1
Copy link
Author

And here is an example where things worked as expected:

First sample reported at 2024-03-19 14:11:07.844 +0000 UTC

NumberDataPoints #0
Data point attributes:
     -> app: Str(shoreline)
     -> customer_id: Str(cust0)
     -> exported_instance: Str(cust0-proxy-backend-0)
     -> exported_job: Str(ops_backend)
     -> exported_namespace: Str(cust0-backend)
     -> name: Str(sergiu_test)
     -> namespace: Str(cust0-backend)
     -> operation: Str(create)
     -> service: Str(cust0-proxy-backend-0)
     -> status: Str(succeeded)
     -> type: Str(action)
StartTimestamp: 2024-03-19 14:11:07.844 +0000 UTC
Timestamp: 2024-03-19 14:11:07.844 +0000 UTC
Value: 1.000000

Second sample reported at 2024-03-19 14:12:07.844 +0000 UTC:

NumberDataPoints #0
Data point attributes:
     -> app: Str(shoreline)
     -> customer_id: Str(cust0)
     -> exported_instance: Str(cust0-proxy-backend-0)
     -> exported_job: Str(ops_backend)
     -> exported_namespace: Str(cust0-backend)
     -> name: Str(sergiu_test)
     -> namespace: Str(cust0-backend)
     -> operation: Str(create)
     -> service: Str(cust0-proxy-backend-0)
     -> status: Str(succeeded)
     -> type: Str(action)
StartTimestamp: 2024-03-19 14:11:07.844 +0000 UTC
Timestamp: 2024-03-19 14:12:07.844 +0000 UTC
Value: 1.000000

Datadog received the first metric datapoint correctly, with a value of 1:
Screenshot 2024-03-19 at 16 21 20

@mx-psi
Copy link
Member

mx-psi commented Mar 26, 2024

Hi @sergiuiacob1, can you try using the cumulative to delta processor instead of the metrics::sums::cumulative_monotonic_mode option?

This is what we currently recommend for these setups, and I want to verify if the issue persists when using this component instead of the fallback logic in the exporter.

@mx-psi mx-psi added waiting for author priority:p2 Medium and removed needs triage New item requiring triage labels Mar 26, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 27, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working closed as inactive data:metrics Metric related issues exporter/datadog Datadog components priority:p2 Medium Stale waiting for author
Projects
None yet
Development

No branches or pull requests

2 participants