OpenTelemetry stats reports histograms incorrectly #31016

dashpole · 2023-11-22T14:19:08Z

Title: OpenTelemetry stats reports histograms incorrectly

Description:

Sending envoy OpenTelemetry metrics to an OpenTelemetry collector, and using the logging exporter, I observed a histogram where the Count did not match the count of the buckets (see below). From the OTLP proto definition:

// bucket_counts is an optional field contains the count values of histogram
// for each bucket.
//
// The sum of the bucket_counts must equal the value in the count field.
//
// The number of elements in bucket_counts array must be by one greater than
// the number of elements in explicit_bounds array.
repeated fixed64 bucket_counts = 6;

The number of bucket_counts also appears to be the same as the number of explicit bounds, rather than one greater.

Reading through the implementation, it looks like we are using computedBuckets():

envoy/source/extensions/stat_sinks/open_telemetry/open_telemetry_impl.cc

Line 136 in 2de016d

data_point->add_bucket_counts(histogram_stats.computedBuckets()[i]);

... which appears to be the count of the number below the threshold:

envoy/envoy/stats/histogram.h

Lines 65 to 70 in 2de016d

 /** 

  * Returns computed bucket values during the period. The vector contains an approximation 

  * of samples below each quantile bucket defined in supportedBuckets(). This vector is 

  * guaranteed to be the same length as supportedBuckets(). 

  */ 

 virtual const std::vector<uint64_t>& computedBuckets() const PURE;

computeDisjointBuckets() seems like it potentially does what we are looking for.

envoy/envoy/stats/histogram.h

Lines 72 to 76 in 2de016d

 /** 

  * Returns version of computedBuckets() with disjoint buckets. This vector is 

  * guaranteed to be the same length as supportedBuckets(). 

  */ 

 virtual std::vector<uint64_t> computeDisjointBuckets() const PURE;

Collector logging exporter output:

StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2023-11-22 00:54:12.184643877 +0000 UTC
Count: 1
Sum: 375.000000
ExplicitBounds #0: 0.500000
ExplicitBounds #1: 1.000000
ExplicitBounds #2: 5.000000
ExplicitBounds #3: 10.000000
ExplicitBounds #4: 25.000000
ExplicitBounds #5: 50.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 1000.000000
ExplicitBounds #10: 2500.000000
ExplicitBounds #11: 5000.000000
ExplicitBounds #12: 10000.000000
ExplicitBounds #13: 30000.000000
ExplicitBounds #14: 60000.000000
ExplicitBounds #15: 300000.000000
ExplicitBounds #16: 600000.000000
ExplicitBounds #17: 1800000.000000
ExplicitBounds #18: 3600000.000000
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 1
Buckets #9, Count: 1
Buckets #10, Count: 1
Buckets #11, Count: 1
Buckets #12, Count: 1
Buckets #13, Count: 1
Buckets #14, Count: 1
Buckets #15, Count: 1
Buckets #16, Count: 1
Buckets #17, Count: 1
Buckets #18, Count: 1

The sum of buckets is 10, but the count is 1.

Repro steps:
Run envoy configured with the OpenTelemetry stats sync and send to an OpenTelemetry collector with the logging exporter, with logLevel: debug to print out the OTLP.

Note: The Envoy_collect tool
gathers a tarball with debug logs, config and the following admin
endpoints: /stats, /clusters and /server_info. Please note if there are
privacy concerns, sanitize the data prior to sharing the tarball/pasting.

Admin and Stats Output:

Include the admin output for the following endpoints: /stats,
/clusters, /routes, /server_info. For more information, refer to the
admin endpoint documentation.

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Config:

Include the config used to configure Envoy.

Logs:

Include the access logs and the Envoy logs.

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Call Stack:

If the Envoy binary is crashing, a call stack is required.
Please refer to the Bazel Stack trace documentation.

cc @ohadvano

The text was updated successfully, but these errors were encountered:

bhavisubu · 2023-11-27T23:23:01Z

This is the output from otel collector which includes the metric name.

Metric #184
Descriptor:
-> Name: tls_inspector.bytes_processed
-> Description:
-> Unit:
-> DataType: Histogram
-> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
-> hostname: Str(,region:us-east1)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2023-11-22 00:51:12.180321127 +0000 UTC
Count: 1
Sum: 375.000000
ExplicitBounds #0: 0.500000
ExplicitBounds #1: 1.000000
ExplicitBounds #2: 5.000000
ExplicitBounds #3: 10.000000
ExplicitBounds #4: 25.000000
ExplicitBounds #5: 50.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 1000.000000
ExplicitBounds #10: 2500.000000
ExplicitBounds #11: 5000.000000
ExplicitBounds #12: 10000.000000
ExplicitBounds #13: 30000.000000
ExplicitBounds #14: 60000.000000
ExplicitBounds #15: 300000.000000
ExplicitBounds #16: 600000.000000
ExplicitBounds #17: 1800000.000000
ExplicitBounds #18: 3600000.000000
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 1
Buckets #9, Count: 1
Buckets #10, Count: 1
Buckets #11, Count: 1
Buckets #12, Count: 1
Buckets #13, Count: 1
Buckets #14, Count: 1
Buckets #15, Count: 1
Buckets #16, Count: 1
Buckets #17, Count: 1
Buckets #18, Count: 1
{"kind": "exporter", "data_type": "metrics", "name": "logging"}
2023-11-22T00:51:13.349Z error exporterhelper/queued_retry.go:364 Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors {"kind": "exporter", "data_type": "metrics", "name": "googlecloud", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[164].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 19 which does not equal the |count| field value of 1.; Field timeSeries[163].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 5447 which does not equal the |count| field value of 480.; Field timeSeries[167].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 19 which does not equal the |count| field value of 1.\nerror details: name = Unknown desc = total_point_count:168 success_point_count:165 errors:{status:{code:3} point_count:3}"}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
go.opentelemetry.io/collector/exporter@v0.84.0/exporterhelper/queued_retry.go:364
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
go.opentelemetry.io/collector/exporter@v0.84.0/exporterhelper/metrics.go:191
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
go.opentelemetry.io/collector/exporter@v0.84.0/exporterhelper/queued_retry.go:202
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
go.opentelemetry.io/collector/exporter@v0.84.0/exporterhelper/internal/bounded_memory_queue.go:47

sundarms · 2023-12-01T07:40:32Z

@dashpole Thanks for raising this issue.
As per openmetrics spec https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#histogram-1

# TYPE foo histogram
foo_bucket{le="0.0"} 0
foo_bucket{le="1e-05"} 0
foo_bucket{le="0.0001"} 5
foo_bucket{le="0.1"} 8
foo_bucket{le="1.0"} 10
foo_bucket{le="10.0"} 11
foo_bucket{le="100000.0"} 11
foo_bucket{le="1e+06"} 15
foo_bucket{le="1e+23"} 16
foo_bucket{le="1.1e+23"} 17
foo_bucket{le="+Inf"} 17
foo_count 17
foo_sum 324789.3
foo_created 1520430000.123

Count should be no of observed events and sum should be sum of all observed values correct?

In case of envoy output

Timestamp: 2023-11-22 00:54:12.184643877 +0000 UTC
Count: 1
Sum: 375.000000
ExplicitBounds #0: 0.500000
ExplicitBounds #1: 1.000000
ExplicitBounds #2: 5.000000
ExplicitBounds #3: 10.000000
ExplicitBounds #4: 25.000000
ExplicitBounds #5: 50.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 1000.000000
ExplicitBounds #10: 2500.000000
ExplicitBounds #11: 5000.000000
ExplicitBounds #12: 10000.000000
ExplicitBounds #13: 30000.000000
ExplicitBounds #14: 60000.000000
ExplicitBounds #15: 300000.000000
ExplicitBounds #16: 600000.000000
ExplicitBounds #17: 1800000.000000
ExplicitBounds #18: 3600000.000000
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 1
Buckets #9, Count: 1
Buckets #10, Count: 1
Buckets #11, Count: 1
Buckets #12, Count: 1
Buckets #13, Count: 1
Buckets #14, Count: 1
Buckets #15, Count: 1
Buckets #16, Count: 1
Buckets #17, Count: 1
Buckets #18, Count: 1

I see it is giving correct output
No of observed events (Count = 1)
Sum of observed values (Sum = 375.000000)

Is exporter expecting count as no of buckets? Isn't that incorrect? Did i miss anything?

dashpole · 2023-12-01T14:37:39Z

Hi @sundarms, OpenTelemetry (opentelemetry.io) differs from OpenMetrics (openmetrics.io) in this regard. OpenMetrics buckets are cumulative (count includes all observations below the threshold), as you point out, but OpenTelemetry buckets are not:

https://opentelemetry.io/docs/specs/otel/metrics/data-model/#histogram

The correct output should have been:

Timestamp: 2023-11-22 00:54:12.184643877 +0000 UTC
Count: 1
Sum: 375.000000
ExplicitBounds #0: 0.500000
ExplicitBounds #1: 1.000000
ExplicitBounds #2: 5.000000
ExplicitBounds #3: 10.000000
ExplicitBounds #4: 25.000000
ExplicitBounds #5: 50.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 1000.000000
ExplicitBounds #10: 2500.000000
ExplicitBounds #11: 5000.000000
ExplicitBounds #12: 10000.000000
ExplicitBounds #13: 30000.000000
ExplicitBounds #14: 60000.000000
ExplicitBounds #15: 300000.000000
ExplicitBounds #16: 600000.000000
ExplicitBounds #17: 1800000.000000
ExplicitBounds #18: 3600000.000000
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 1
- Buckets #9, Count: 1
+ Buckets #9, Count: 0
- Buckets #10, Count: 1
+ Buckets #10, Count: 0
- Buckets #11, Count: 1
+ Buckets #11, Count: 0
- Buckets #12, Count: 1
+ Buckets #12, Count: 0
- Buckets #13, Count: 1
+ Buckets #13, Count: 0
- Buckets #14, Count: 1
+ Buckets #14, Count: 0
- Buckets #15, Count: 1
+ Buckets #15, Count: 0
- Buckets #16, Count: 1
+ Buckets #16, Count: 0
- Buckets #17, Count: 1
+ Buckets #17, Count: 0
- Buckets #18, Count: 1
+ Buckets #18, Count: 0

I.e. a single Count: 1 at Bucket #8. The sum of all bucket counts is then 1, which equals the Count field at the top.

See, for example, the conversion of prometheus histogram bucket counts to OpenTelemetry histogram buckets here. It subtracts the previous bucket's count of a prometheus histogram from the current bucket's count to get the OpenTelemetry histogram bucket's count.

sundarms · 2023-12-01T18:24:12Z

@dashpole thanks for the explanation.

owensulei3841 · 2023-12-04T17:54:31Z

Hi @ohadvano , could I ask what's the estimated timeline for this issue be resolved?
Thank you

ohadvano · 2023-12-05T13:43:30Z

Hi, I'll try to get to this soon. I am now noticing two problems actually. First is as described in this issue, the second is that the bucket_counts array needs to be 1 element bigger than the explicit_bounds array (according to the specs). This causes a problem where if you set bounds to be [1,2,3], and a histogram record of '200' is inserted, the count would be 1, the sum would be 200, but there won't be a bucket_count element indicating this (i.e., all values of the array will be 0) - [0, 0, 0]. I believe that in this case, it should be [0, 0, 0, 1] as the last element indicates the explicit_bounds[i-1], +infinity) for i == size(explicit_bounds)?

Fixing the first is trivial, but I'm trying to see how to fix the second problem. I'm not sure there's an API to get the count of elements out of the explicit bounds, so might need to get this in another way

dashpole · 2023-12-05T14:28:21Z

Thanks @ohadvano!

I believe that in this case, it should be [0, 0, 0, 1] as the last element indicates the explicit_bounds[i-1], +infinity) for i == size(explicit_bounds)?

Yes, thats correct.

jmarantz · 2023-12-05T15:19:46Z

@ohadvano I'm not sure what level of API you are looking at, but you can get all of the data out of the histogram structures. E.g. see the implementation of the 'detailed' histogram admin endpoint used for rendering, here:

envoy/source/common/stats/thread_local_store.cc

Line 964 in a672f7e

ParentHistogramImpl::detailedlBucketsHelper(const histogram_t& histogram) {

sundarms · 2023-12-06T17:48:28Z

@ohadvano thanks for fix pr. Is there any specific release version/date this will be pushed out?

ohadvano · 2023-12-06T21:21:34Z

You can either wait for the next official release, which is next month, or cherry pick the commit

dashpole added bug triage Issue requires triage labels Nov 22, 2023

KBaichoo added area/opentelemetry and removed triage Issue requires triage labels Nov 23, 2023

jmarantz assigned ohadvano Dec 1, 2023

ohadvano mentioned this issue Dec 6, 2023

otlp_stats_sink: Fix otlp histogram buckets #31197

Merged

mattklein123 closed this as completed in #31197 Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenTelemetry stats reports histograms incorrectly #31016

OpenTelemetry stats reports histograms incorrectly #31016

dashpole commented Nov 22, 2023 •

edited

Loading

bhavisubu commented Nov 27, 2023

sundarms commented Dec 1, 2023

dashpole commented Dec 1, 2023

sundarms commented Dec 1, 2023

owensulei3841 commented Dec 4, 2023

ohadvano commented Dec 5, 2023 •

edited

Loading

dashpole commented Dec 5, 2023

jmarantz commented Dec 5, 2023

sundarms commented Dec 6, 2023

ohadvano commented Dec 6, 2023

OpenTelemetry stats reports histograms incorrectly #31016

OpenTelemetry stats reports histograms incorrectly #31016

Comments

dashpole commented Nov 22, 2023 • edited Loading

bhavisubu commented Nov 27, 2023

sundarms commented Dec 1, 2023

dashpole commented Dec 1, 2023

sundarms commented Dec 1, 2023

owensulei3841 commented Dec 4, 2023

ohadvano commented Dec 5, 2023 • edited Loading

dashpole commented Dec 5, 2023

jmarantz commented Dec 5, 2023

sundarms commented Dec 6, 2023

ohadvano commented Dec 6, 2023

dashpole commented Nov 22, 2023 •

edited

Loading

ohadvano commented Dec 5, 2023 •

edited

Loading