Adapt default histogram boundaries to seconds as the new base unit #3509

fstab · 2023-05-16T10:41:16Z

Changes

There was a decision by the TC (technical committee) to use seconds as the base unit in OpenTelemetry rather than milliseconds.

As a result, we should also scale down the default histogram bucket boundaries by a factor of 1000.

Note that scaled down boundaries are already in use, see for example semantic conventions for http.server.duration.

Related issues

Semantic Conventions: default to seconds for duration units #2977 (comment)

Signed-off-by: Fabian Stäber <fabian@fstab.de>

pirgeo · 2023-05-16T10:54:17Z

I don't think we can do that, since we marked the defaults for the explicit bucket histogram as stable. We were able to change it for http.server.duration because the semantic conventions themselves are not marked stable yet.

MrAlias

These buckets are used for more than measurements of time. Changing their value is not motivated.

fstab · 2023-05-19T21:18:51Z

since we marked the defaults for the explicit bucket histogram as stable

I thought that changing default bucket boundaries is not considered a breaking change in the OTel Spec, but I might be wrong here.

are used for more than measurements of time

I'm not sure about this argument. The current defaults were explicitly defined for representing durations in milliseconds (they were copied from Prometheus' defaults for seconds and scaled by a factor or 1000).

MrAlias · 2023-05-19T22:08:59Z

The current defaults were explicitly defined for representing durations..

"Designed for" != "Used for"

pirgeo · 2023-05-22T09:32:49Z

I thought that changing default bucket boundaries is not considered a breaking change in the OTel Spec, but I might be wrong here.

I am now not sure anymore about the outcome of that discussion. My understanding was that changing the default buckets for the Histogram instrument itself is a breaking change, but changing the defaults for certain semantic conventions is not. This is because the Histogram spec is marked stable, but the semantic conventions are not. IIRC, that was our "out" in #2977 for being compatible but also not breaking the specification.

All the semconv metrics recorded with OTel should be compatible with Prometheus, as the defaults are now in seconds. The hints/advice API gives you a way to set these defaults. Instrumentations that want to be compliant with the semantic convention specification must export these metrics in seconds. For custom metrics outside the semantic conventions, users would probably want to set up "the right buckets" themselves (or use exponential histograms). They are the domain experts. My feeling is that the defaults we have today are very useful to many people in many situations, and should not be changed, even if it turns out not to be a breaking change.

jsuereth · 2023-05-25T12:32:05Z

I thought that changing default bucket boundaries is not considered a breaking change in the OTel Spec, but I might be wrong here.

Changing bucket boundaries is not considered a breaking change for instrumentation (this is also true in prometheus land). It does alter error rates, but should not alter interaction.

That said, changing unit does break downstream usage, so perhaps there was a conflation of concerns there.

fstab · 2023-05-26T09:52:25Z

Just to clarify: This is about changing the default buckets, not about changing the unit.

If a user wants to measure latencies of a custom business transaction, the current OpenTelemetry default behavior is as follows:

Unit is seconds.
Buckets are 0, 5s, 10s, 25s, 50s, 75s, 100s, 250s, 500s, 750s, 1000s, 2500s, 5000s, 7500s, 10000s.

So all observations < 5s will end up in the same bucket. This will make the default histogram useless in many cases.

I agree that you can come up with scenarios where the current buckets are useful, but I think the most common usage of histograms is to measure latencies, and the default should be reasonable buckets for measuring latencies in seconds.

pirgeo · 2023-05-30T10:25:13Z

It does alter error rates, but should not alter interaction.

Not sure I follow. Could you elaborate on this a little more? Do you mean if all of the data falls in the 10-Infinity bucket for second-aligned buckets, you could still query for "how many requests complete within <250 (previously milliseconds, but now seconds)"? The result would obviously have massive errors attached to it, but the query itself would not fail.

For someone who is using a stable metrics SDK today, recording business transaction latencies in milliseconds (since it was the default before) this would break their stuff, right? They have their alerts configured to trigger at 250ms, relying on the fact that the SDK provides consistent data. Their backend might not do automatic unit conversion or not take the unit into account at all. Then they upgrade to a later version of the metrics SDK that implements this updated spec, and all of a sudden all of their data ends up in the bucket >10 since the way the data is calculated did not change. I guess this is dependent on the backend implementation, but either the alerts now trigger all the time or never. Neither is good, and it requires the user to change code in order to upgrade the SDK, which feels like a breaking change to me - but please correct me if I am wrong here. I could also imagine the person seeing the alerts might not be the same person that did the SDK upgrade. If we don't change the defaults, users can go in in a coordinated way whenever they have the time, decide what unit makes the most sense for their use-case, switch the buckets over seconds-based buckets (if they decide that is the way to go), update how the data is collected (or, more realistically, multiply the value that they collected in milliseconds to get to seconds), and set up their alerts while they are touching the code anyway.

I feel like I am missing an important piece of the puzzle here that seems clear to everyone else 😅

For semantic conventions, we don't have this problem. We can switch over the instruments that collect semantic conventions to seconds-based buckets, and when user dashboard/alerts break, we can point to the notice saying "Don't rely on the semantic conventions yet, they were not marked as stable". These semantic conventions actually report data in seconds from now on.

fstab · 2023-05-30T10:48:01Z

Yes, the current buckets make sense for measuring latencies in milliseconds. However, OpenTelemetry decided to switch to seconds, and I think the default should not be "milliseconds for custom metrics" and "seconds for semantic conventions".

pirgeo · 2023-05-30T11:48:35Z

I get that, my point is: The Metrics SDK specification containing the boundaries is marked stable. Users are relying on the default buckets defined in the stable spec for their custom metrics. If the spec changes the defaults, SDK authors follow the specification and release a new version with updated buckets. After updating the SDK, the user's histograms can't be used for estimations anymore because everything ends up in one bucket. This happens even though everything that user used was marked stable and no breaking change was indicated. I think it is a breaking change, but @jsuereth pointed out that it is not, IIUC (?)

Finally, users might actually have consciously decided for going with milliseconds (or some other unit that aligns well with these buckets) because it is the best fit for their metric or they don't care about Prometheus because they don't use it - this change would force them to change their buckets manually to what they assumed was a good and stable default.

JamesNK · 2023-06-01T08:19:38Z

I agree with this PR. I created an issue and then noticed this PR - #3525

There are several options:

Good: Recommending seconds for durations and changing the default boundaries. Completely aligns with Prometheus.
Good: Reverting back to recommending milliseconds for durations and leaving the boundaries as they are today.
Bad: Recommending seconds for durations and leaving the default boundaries configured for milliseconds.

Unfortunately, OTel currently is in the bad state.

Are https://github.com/open-telemetry/opentelemetry-specification and https://github.com/open-telemetry/semantic-conventions different people? I think some communication is required to figure out how to get back to a good state.

jack-berg · 2023-06-01T15:27:07Z

@JamesNK / @fstab have you seen #2977? Contains a lot of context around this issue. Summary is:

Defaulting to second for durations is desirable for alignment with prometheus.
We'd like to change the default explicit bucket boundaries such that their sensible for measuring client / server response durations, but there isn't any consensus about whether this is an allowed change given that section of the metrics SDK document is stable.
The advice API was recently added which allows instrumentation to provide a recommendation to SDKs on useful default bucket boundaries.
The TC voted on this and decided to change the default duration unit to seconds. The expectation is that the advice API will be used for instrumentation recording http.[client|server].duration to specify bucket boundaries that are useful when measurements are recorded in seconds.

Unfortunately, OTel currently is in the bad state.

Its not ideal, but with the advice API it should be ok. The default bucket boundaries were only ever going to be useful for a pretty particular use case. Instrumentation can use advice to ensure resulting bucket boundaries are useful out of the box.

github-actions · 2023-06-09T03:18:06Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2023-06-17T03:16:49Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Adapt default histogram boundaries to seconds as the new base unit

9081844

Signed-off-by: Fabian Stäber <fabian@fstab.de>

fstab requested review from a team May 16, 2023 10:41

github-actions bot assigned yurishkuro May 16, 2023

MrAlias requested changes May 16, 2023

View reviewed changes

JamesNK mentioned this pull request Jun 1, 2023

Default histogram bucket bounds are too big for seconds durations #3525

Closed

github-actions bot added the Stale label Jun 9, 2023

github-actions bot closed this Jun 17, 2023

mateuszrzeszutek mentioned this pull request Jul 4, 2023

[Micrometer metrics bridge] Default bucket boundaries assume milliseconds base unit open-telemetry/opentelemetry-java-instrumentation#8841

Closed

JamesNK mentioned this pull request Aug 22, 2023

Update default values for explicit histogram bucket boundaries open-telemetry/opentelemetry-dotnet#4797

Closed

aabmass mentioned this pull request Aug 29, 2023

[metrics] Allow SDKs to change ExplicitBucketBoundaries default based on unit? #3672

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt default histogram boundaries to seconds as the new base unit #3509

Adapt default histogram boundaries to seconds as the new base unit #3509

fstab commented May 16, 2023

pirgeo commented May 16, 2023

MrAlias left a comment

fstab commented May 19, 2023 •

edited

Loading

MrAlias commented May 19, 2023

pirgeo commented May 22, 2023

jsuereth commented May 25, 2023

fstab commented May 26, 2023

pirgeo commented May 30, 2023

fstab commented May 30, 2023

pirgeo commented May 30, 2023

JamesNK commented Jun 1, 2023 •

edited

Loading

jack-berg commented Jun 1, 2023

github-actions bot commented Jun 9, 2023

github-actions bot commented Jun 17, 2023

Adapt default histogram boundaries to seconds as the new base unit #3509

Adapt default histogram boundaries to seconds as the new base unit #3509

Conversation

fstab commented May 16, 2023

Changes

Related issues

pirgeo commented May 16, 2023

MrAlias left a comment

Choose a reason for hiding this comment

fstab commented May 19, 2023 • edited Loading

MrAlias commented May 19, 2023

pirgeo commented May 22, 2023

jsuereth commented May 25, 2023

fstab commented May 26, 2023

pirgeo commented May 30, 2023

fstab commented May 30, 2023

pirgeo commented May 30, 2023

JamesNK commented Jun 1, 2023 • edited Loading

jack-berg commented Jun 1, 2023

github-actions bot commented Jun 9, 2023

github-actions bot commented Jun 17, 2023

fstab commented May 19, 2023 •

edited

Loading

JamesNK commented Jun 1, 2023 •

edited

Loading