Should `jvm.gc.duration` histogram have any default buckets? #274

trask · 2023-08-18T20:38:20Z

Currently the jvm.gc.duration histogram has this bucket definition:

This metric SHOULD be specified with
ExplicitBucketBoundaries
of [] (single bucket histogram capturing count, sum, min, max).

Opening this as a tracking issue since it has come up as an open question from the semantic convention working group.

The text was updated successfully, but these errors were encountered:

jack-berg · 2023-08-21T14:58:57Z

My point of view:

Most users will just want a summary metrics on this (min, max, sum, count), and a single bucket histogram fulfills this
Users that care can use views upgrade to a histogram with bucket boundaries that reflect the thresholds they care about
It will be hard to find a default set of bucket boundaries that is useful in all GC situations. Not impossible, but will take some thoughtful analysis including data from real-world systems. And even after all that, we'll never get an answer that satisfies everyone.
If we insist on having default buckets, perhaps we keep the boundaries simple and informed by defaults for JVM gcs. For example, the G1 gc has a -XX:MaxGCPauseMillis=200. If we set the bucket boundaries to be [200], users can know the percentage of GCs which met the goal for desired maximum pause time.

jack-berg · 2023-08-21T15:11:18Z

And more generally, how do we generally think about the tradeoff between size of metric and value with respect to adding more histogram buckets?

On one end of the spectrum, we could have very small buckets (i.e. one bucket per millisecond). These would produce unfeasibly large payloads, but offer high density for computing percentiles. On the other end, we have histograms with zero buckets, which have the smallest payload but offer the no ability to compute percentiles. Then we have the messy middle ground, where we try to choose a set of bucket boundaries that balances payload size with having buckets useful for computing percentiles.

In the case of http...duration, there was some prior art in the prometheus bucket boundaries which made the conversation straight forward. But prior art will often not be available.

Without a general set of guidelines for making this decision, I suspect each new proposed histogram metric will repeat this conversation.

trask · 2023-08-28T18:42:31Z

We discussed in last week's Java SIG meeting.

One thing that we confirmed is that the gc durations (which are emitted by MemoryPoolMXBean#getUsage()), encompass the entire GC cycle, not only the "application pause" phase(s) of the GC cycle, which means that it isn't related to the -XX:MaxGCPauseMillis value.

Still trying to reach out to more JVM folks who may have idea(s) for bucket boundaries here.

trask · 2023-08-28T18:54:36Z

Just reiterating what I think is the goal: if possible, to have a small number of buckets (<=3?) which would be useful for a majority of applications to help identify long GCs which would make sense to drill into.

However I'm wondering if this is possible, since it's generally the long "application pauses" (e.g. what you would get from gc logs) that are useful to drill into, and I'm just not sure we get that from these gc metrics.

kittylyst · 2023-08-30T16:53:34Z

The problem is, I don't think we can meaningfully characterize "a majority of applications" (or, for that matter, "long GC pauses") - there's just too much difference between workloads.

So I'm +1 on @jack-berg original comment.

trask · 2023-08-30T18:57:45Z

What about 3 super generic bucket boundaries:

0.1 seconds
1 second
10 seconds

Do we think this could be used to answer some basic questions for a reasonable set of applications, e.g. show me some long old (or young) GC events?

One possible advantage to having a couple of buckets is to make it more visible to users that this is a histogram that they can tune further if they want.

(sorry, just trying to play out all possibilities before we make the decision to not have any buckets)

kittylyst · 2023-08-30T21:02:50Z

I wouldn't even like to guess what percentage of application processes would be automatically killed (e.g. by k8s) if they experienced a 10s GC STW event.

My feeling is just that domain of possible workloads is just too complex for any single set of defaults to make sense.

Curse you, JVM, for being so applicable to such a wide range of possible execution parameters!

jackshirazi · 2023-08-31T12:50:54Z

We can leave out all the low latency applications, they know about monitoring the GC pause latencies and either do it another way or will configure the buckets as they need. So the remaining applications are broadly those that need reasonable inter-service latency pause times (typically these need pauses to be under 25ms), and those that need reasonable user-interaction pause times (pauses need to be under 250ms) and throughput applications that need to avoid a timeout (common ones are 5/10/30 seconds, usually because of proxy or comms issues at these boundaries). So for me these give slightly uneven boundaries of 0.025, 0.25 and 2.5 seconds.

Complicating this is that the young gen times are STW but as we've seen the old gen ones are not necessarily so we'd get some high values that don't actually matter

I'm fine with no histo. If there's a single bucket, I'd go for 250ms

kittylyst · 2023-08-31T12:55:17Z

Sound advice - maybe this should go into the documentation if we decide to go for single bucket? In fact, maybe a general writeup of the consensus would also be helpful.

breedx-splk · 2023-08-31T15:52:22Z

If there's a single bucket, I'd go for 250ms

Might be pedantic, but if there's a single boundary then technically there are 2 buckets, right? The data above the boundary and the data below the boundary...right?

In any case, I appreciate the pragmatism in this discussion here. I do think that @jackshirazi has slightly better numbers (and reason for choosing those) vs. @trask's .1/1/10s. I especially think that it's important to have something in the lower tens of millis, in part because the threshhold of human visual perception is around 12ms.

I'm +1 for 0.025, 0.25, and 2.5s.

trask · 2023-08-31T22:52:53Z

How important do we think limiting buckets for cost is?

This has got me wondering about using the same buckets as http durations, since it gives nice coverage of the range of interesting timings discussed above.

[ 0, 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10 ]

PeterF778 · 2023-09-01T01:09:18Z

I don't want to be too picky, but what is the point of having a boundary of 0 if we know the values we observe cannot be negative?

trask · 2023-09-01T01:27:02Z

hm, I'm not sure, I just opened #298 to get more attention to this question

trask · 2023-11-09T16:43:40Z

Proposal: [0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]

Reasoning:

@jonatan-ivanov has suggested having more buckets "on the low end and in between" compared to the original proposal of [ 0.01, 0.1, 1, 10 ] (Update jvm.gc.duration histogram buckets to [ 0.01, 0.1, 1, 10 ] #317 (comment))
given that http.server.request.duration lowest bucket boundary is 0.005, I'm not sure we need to go lower than that

jack-berg · 2023-11-09T16:51:49Z

I'd rather go with the original [0.01, .1, 1, 10]. Its easy to make the argument that more buckets is useful, but subjectively speaking from my own intuition, I think that fewer buckets will suffice for most users. I think if we go with more buckets, more often than not, users who look closely will opt to reduce the number of buckets. In contrast if we go with fewer buckets, more often than not, I think users will be content and stick with the default.

I wouldn't block the proposal for more buckets, but I do think less is best in this case.

trask added this to Spec: JVM runtime metric stability Aug 18, 2023

github-actions bot assigned arminru Aug 18, 2023

trask moved this to Todo in Spec: JVM runtime metric stability Aug 18, 2023

trask mentioned this issue Aug 24, 2023

Unexpected aggregation temporality for process.runtime.jvm.gc.duration open-telemetry/opentelemetry-java-instrumentation#7273

Closed

trask mentioned this issue Sep 1, 2023

Should HTTP duration metrics include bucket boundary of zero? #298

Closed

This was referenced Sep 12, 2023

Provide some guidance for coming up with default histogram buckets for various metrics #316

Open

Update jvm.gc.duration histogram buckets to [ 0.01, 0.1, 1, 10 ] #317

Merged

AlexanderWert closed this as completed in #317 Nov 13, 2023

github-project-automation bot moved this from Todo to Done in Spec: JVM runtime metric stability Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should `jvm.gc.duration` histogram have any default buckets? #274

Should `jvm.gc.duration` histogram have any default buckets? #274

trask commented Aug 18, 2023 •

edited

Loading

jack-berg commented Aug 21, 2023

jack-berg commented Aug 21, 2023

trask commented Aug 28, 2023

trask commented Aug 28, 2023

kittylyst commented Aug 30, 2023

trask commented Aug 30, 2023

kittylyst commented Aug 30, 2023

jackshirazi commented Aug 31, 2023

kittylyst commented Aug 31, 2023

breedx-splk commented Aug 31, 2023 •

edited

Loading

trask commented Aug 31, 2023

PeterF778 commented Sep 1, 2023

trask commented Sep 1, 2023

trask commented Nov 9, 2023

jack-berg commented Nov 9, 2023

Should jvm.gc.duration histogram have any default buckets? #274

Should jvm.gc.duration histogram have any default buckets? #274

Comments

trask commented Aug 18, 2023 • edited Loading

jack-berg commented Aug 21, 2023

jack-berg commented Aug 21, 2023

trask commented Aug 28, 2023

trask commented Aug 28, 2023

kittylyst commented Aug 30, 2023

trask commented Aug 30, 2023

kittylyst commented Aug 30, 2023

jackshirazi commented Aug 31, 2023

kittylyst commented Aug 31, 2023

breedx-splk commented Aug 31, 2023 • edited Loading

trask commented Aug 31, 2023

PeterF778 commented Sep 1, 2023

trask commented Sep 1, 2023

trask commented Nov 9, 2023

jack-berg commented Nov 9, 2023

Should `jvm.gc.duration` histogram have any default buckets? #274

Should `jvm.gc.duration` histogram have any default buckets? #274

trask commented Aug 18, 2023 •

edited

Loading

breedx-splk commented Aug 31, 2023 •

edited

Loading