Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prometheusremotewrite] exporter complaining about temporality in OTEL pipeline #30094

Closed
ashishthakur55525 opened this issue Dec 19, 2023 · 19 comments

Comments

@ashishthakur55525
Copy link

Describe the bug
We are using opentelemetry SDK to send metrics to open-telemetry collector and there we have two exporter (otlp) which sends metrics to honeycomb and other (prometheusremotewrite) exporter which write data to local prometheus running on same EKS cluster. Problem is we keep getting temporality errors like below, we worked with Dev team to set temporality to cumulative because prometheus accept that only and we validated its changed but still getting below error. After setting temporality to cumulative we still get this error but for certain point of time we got those metrics in prometheus but very broken state and then stopped again.

2023-12-19T14:14:35.655Z error exporterhelper/queued_retry.go:401 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: invalid temporality and type combination for metric "app.counter.apiStatusCode"; invalid temporality and type combination for metric "app.counter.apis"", "dropped_items": 2}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
go.opentelemetry.io/collector/exporter@v0.73.0/exporterhelper/queued_retry.go:401
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
go.opentelemetry.io/collector/exporter@v0.73.0/exporterhelper/metrics.go:136
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
go.opentelemetry.io/collector/exporter@v0.73.0/exporterhelper/queued_retry.go:205
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
go.opentelemetry.io/collector/exporter@v0.73.0/exporterhelper/internal/bounded_memory_queue.go:60

Steps to reproduce
not really sure.

What did you expect to see?
We should not see these errors and get metric in prometheus.

What did you see instead?
We got drop in metrics, got only broken metric and error is still there.

What version did you use?
Version: v0.73.0 Open-telemetry collector version

What config did you use?
Config:
prometheusremotewrite:
endpoint: 9090/api/v1/write

Environment
OS: (e.g., "Amazon linux, EKS cluster")
Compiler(if manually compiled): (e.g., "go 14.2")

Additional context
Add any other context about the problem here.

@ashishthakur55525 ashishthakur55525 added the bug Something isn't working label Dec 19, 2023
@mx-psi mx-psi transferred this issue from open-telemetry/opentelemetry-collector Dec 19, 2023
@bryan-aguilar
Copy link
Contributor

bryan-aguilar commented Dec 19, 2023

Can you use the debug exporter with detailed verbosity to get more information on app.counter.apiStatusCode? Also, v0.73.0 is a bit dated, could you replicate with a newer version of the collector?

Copy link
Contributor

Pinging code owners for exporter/prometheusremotewrite: @Aneurysm9 @rapphil. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

crobert-1 commented Dec 19, 2023

Note: This is potentially a duplicate of #15281

@ashishthakur55525
Copy link
Author

Note: This is potentially a duplicate of #15281

But what was the resolution, i dont see resolution which fixed it for folks.

@ashishthakur55525
Copy link
Author

Can you use the debug exporter with detailed verbosity to get more information on app.counter.apiStatusCode? Also, v0.73.0 is a bit dated, could you replicate with a newer version of the collector?

Not sure if I defined debug exporter correctly here but you can check below. And upgraded to 0.89.0 version, after doiing that i dont see those temporality errors again, its just vanished then i enabled debug mode as you suggested where it says temporality is still DELTA, it set to cumulative in code. Can you please guide further?

exporters:
logging: {}
debug:
verbosity: detailed

and under service this is what I added:
metrics/debug:
exporters:
- debug
receivers:
- otlp/pcs-cas. - these are my otlp receivers where I will get data (metrics & trace)

After adding this I could see something in logs (below), i dont know why it still say temporality DELTA, even we confirmed in services logs that its set to cumulative.

StartTimestamp: 2023-12-20 11:58:05.087889 +0000 UTC
Timestamp: 2023-12-20 11:58:35.087889 +0000 UTC
Value: 1
{"kind": "exporter", "data_type": "metrics", "name": "debug"}
2023-12-20T11:58:35.262Z debug memorylimiterprocessor@v0.89.0/memorylimiter.go:273 Currently used memory. {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics/pcs-cas", "cur_mem_mib": 235}
2023-12-20T11:58:35.553Z info MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 3, "data points": 5}
2023-12-20T11:58:35.553Z info ResourceMetrics #0
Resource SchemaURL:
Resource attributes:
-> service.namespace: Str()
-> SERVICE_NAME: Str()
-> service.name: Str()
-> service.version: Str(PCS-23.12.1-DR-Test-4289242)
-> stack: Str(app)
-> telemetry.sdk.language: Str(java)
-> telemetry.sdk.name: Str(opentelemetry)
-> telemetry.sdk.version: Str(1.15.0)
ScopeMetrics #0
ScopeMetrics SchemaURL:
InstrumentationScope OtelBeanConfig$$EnhancerBySpringCGLIB$$7045e4df
Metric #0
Descriptor:
-> Name: app.counter.apiStatusCode
-> Description: StatusCode count for API
-> Unit:
-> DataType: Sum
-> IsMonotonic: true
-> AggregationTemporality: Delta
NumberDataPoints #0
Data point attributes:
-> HTTP_STATUS_CODE: Str(200)
-> api-identifier: Str(Health Check)
-> callingService: Str(default)
StartTimestamp: 2023-12-20 11:58:05.552448 +0000 UTC
Timestamp: 2023-12-20 11:58:35.552452 +0000 UTC
Value: 1
Exemplars:
Exemplar #0
-> Trace ID: b65f0a37589d73acb39606ea017ce96b
-> Span ID: 7d43a712802b721c
-> Timestamp: 2023-12-20 11:58:31.482782 +0000 UTC
-> Value: 1
NumberDataPoints #1
Data point attributes:
-> HTTP_STATUS_CODE: Str(200)
-> api-identifier: Str(List Cloud Accounts)
-> callingService: Str(default)
StartTimestamp: 2023-12-20 11:58:05.552448 +0000 UTC
Timestamp: 2023-12-20 11:58:35.552452 +0000 UTC
Value: 1
Exemplars:
Exemplar #0
-> Trace ID: d272ea2e688e2ade2cca4c4c091b8e20
-> Span ID: f1e8c883ec8763bd
-> Timestamp: 2023-12-20 11:58:32.227663 +0000 UTC
-> Value: 1
Metric #1
Descriptor:
-> Name: app.service.counter
-> Description: Services calling CAS APIs
-> Unit:
-> DataType: Sum
-> IsMonotonic: true
-> AggregationTemporality: Delta
NumberDataPoints #0
Data point attributes:
-> SERVICE_CALLED_CAS: Str(pcs-ui-automation+master@company.com)
-> api-identifier: Str(get-cloud-accounts)
-> requested-uri: Str(/cloud)
StartTimestamp: 2023-12-20 11:58:05.552448 +0000 UTC
Timestamp: 2023-12-20 11:58:35.552452 +0000 UTC
Value: 1
Exemplars:
Exemplar #0
-> Trace ID: d272ea2e688e2ade2cca4c4c091b8e20
-> Span ID: f1e8c883ec8763bd
-> Timestamp: 2023-12-20 11:58:32.187731 +0000 UTC
-> Value: 1
Metric #2
Descriptor:
-> Name: app.counter.apis
-> Description: Counts per API
-> Unit:
-> DataType: Sum
-> IsMonotonic: true
-> AggregationTemporality: Delta
NumberDataPoints #0
Data point attributes:
-> api-identifier: Str(get-cloud-accounts)
-> callingService: Str(default)
StartTimestamp: 2023-12-20 11:58:05.552448 +0000 UTC
Timestamp: 2023-12-20 11:58:35.552452 +0000 UTC
Value: 1
Exemplars:
Exemplar #0
-> Trace ID: d272ea2e688e2ade2cca4c4c091b8e20
-> Span ID: f1e8c883ec8763bd
-> Timestamp: 2023-12-20 11:58:32.190339 +0000 UTC
-> Value: 1
NumberDataPoints #1
Data point attributes:
-> api-identifier: Str(health-check)
-> callingService: Str(default)
StartTimestamp: 2023-12-20 11:58:05.552448 +0000 UTC
Timestamp: 2023-12-20 11:58:35.552452 +0000 UTC
Value: 1
Exemplars:
Exemplar #0
-> Trace ID: b65f0a37589d73acb39606ea017ce96b
-> Span ID: 7d43a712802b721c
-> Timestamp: 2023-12-20 11:58:31.482445 +0000 UTC
-> Value: 1
{"kind": "exporter", "data_type": "metrics", "name": "debug"}

@bryan-aguilar
Copy link
Contributor

And upgraded to 0.89.0 version, after doiing that i dont see those temporality errors again

So did the error resolve itself after upgrading? Are the metrics present when you query your prometheus server for them? If so, then I think it should be fair to say that something was fixed between v0.73.0 and now.

@ashishthakur55525
Copy link
Author

And upgraded to 0.89.0 version, after doiing that i dont see those temporality errors again

So did the error resolve itself after upgrading? Are the metrics present when you query your prometheus server for them? If so, then I think it should be fair to say that something was fixed between v0.73.0 and now.

No actually, the error is gone but still i am not able to see those metrics in Prometheus. Also one more thing as I enabled debug mode right where it says temporality is DELTA but in code we set it up to CUMULATIVE, i dont know where is the miss, what can we do to fix this?

@ashishthakur55525
Copy link
Author

@bryan-aguilar any thoughts\suggestion on above?

@ashishthakur55525
Copy link
Author

ashishthakur55525 commented Jan 9, 2024

Anyone from this thread has any suggestion\recommendation, please?

@crobert-1
Copy link
Member

The promotheus remote write exporter does not support DELTA metrics, as stated in the README. A component has been proposed in the collector to properly handle this situation. I don't believe there's anything that can be done at this time as a workaround, other than what was proposed in the bug I linked earlier.

I'll have to defer to others though in case there's something I'm missing.

@crobert-1 crobert-1 added question Further information is requested and removed bug Something isn't working needs triage New item requiring triage labels Jan 17, 2024
@ashishthakur55525
Copy link
Author

@crobert-1 which bug you are referring to, so we have made changes to have cumulative metrics only & that error also gone but still why we dont see metrics in prometheus, no error nothing. how we can be make sure its working then?

@ceastman-r7
Copy link

@crobert-1 what changed did you make to have cumulative metrics only?

@crobert-1
Copy link
Member

crobert-1 commented Feb 28, 2024

@crobert-1 which bug you are referring to, so we have made changes to have cumulative metrics only & that error also gone but still why we dont see metrics in prometheus, no error nothing. how we can be make sure its working then?

The bug I was referencing was in this comment above.

@crobert-1 what changed did you make to have cumulative metrics only?

I believe modifying the solution provided in this comment to your situation may work.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Apr 29, 2024
@ashishthakur55525
Copy link
Author

@open-telemetry/collector-contrib-triagers can you please help here ? i am reopening this issue again.

@ashishthakur55525
Copy link
Author

ashishthakur55525 commented May 29, 2024

@bryan-aguilar @crobert-1 any thoughts here, I am still on same point where we left, option gives above not worked in my case. We are tryng it send metrics from Java application & temporality set to cumulative only.

@crobert-1
Copy link
Member

Sorry @ashishthakur55525, I don't have any more information to share here.

@github-actions github-actions bot removed the Stale label May 30, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 30, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants