-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should jvm.gc.duration
histogram have any default buckets?
#274
Comments
My point of view:
|
And more generally, how do we generally think about the tradeoff between size of metric and value with respect to adding more histogram buckets? On one end of the spectrum, we could have very small buckets (i.e. one bucket per millisecond). These would produce unfeasibly large payloads, but offer high density for computing percentiles. On the other end, we have histograms with zero buckets, which have the smallest payload but offer the no ability to compute percentiles. Then we have the messy middle ground, where we try to choose a set of bucket boundaries that balances payload size with having buckets useful for computing percentiles. In the case of http...duration, there was some prior art in the prometheus bucket boundaries which made the conversation straight forward. But prior art will often not be available. Without a general set of guidelines for making this decision, I suspect each new proposed histogram metric will repeat this conversation. |
We discussed in last week's Java SIG meeting. One thing that we confirmed is that the gc durations (which are emitted by MemoryPoolMXBean#getUsage()), encompass the entire GC cycle, not only the "application pause" phase(s) of the GC cycle, which means that it isn't related to the Still trying to reach out to more JVM folks who may have idea(s) for bucket boundaries here. |
Just reiterating what I think is the goal: if possible, to have a small number of buckets (<=3?) which would be useful for a majority of applications to help identify long GCs which would make sense to drill into. However I'm wondering if this is possible, since it's generally the long "application pauses" (e.g. what you would get from gc logs) that are useful to drill into, and I'm just not sure we get that from these gc metrics. |
The problem is, I don't think we can meaningfully characterize "a majority of applications" (or, for that matter, "long GC pauses") - there's just too much difference between workloads. So I'm +1 on @jack-berg original comment. |
What about 3 super generic bucket boundaries:
Do we think this could be used to answer some basic questions for a reasonable set of applications, e.g. show me some long old (or young) GC events? One possible advantage to having a couple of buckets is to make it more visible to users that this is a histogram that they can tune further if they want. (sorry, just trying to play out all possibilities before we make the decision to not have any buckets) |
I wouldn't even like to guess what percentage of application processes would be automatically killed (e.g. by k8s) if they experienced a 10s GC STW event. My feeling is just that domain of possible workloads is just too complex for any single set of defaults to make sense. Curse you, JVM, for being so applicable to such a wide range of possible execution parameters! |
We can leave out all the low latency applications, they know about monitoring the GC pause latencies and either do it another way or will configure the buckets as they need. So the remaining applications are broadly those that need reasonable inter-service latency pause times (typically these need pauses to be under 25ms), and those that need reasonable user-interaction pause times (pauses need to be under 250ms) and throughput applications that need to avoid a timeout (common ones are 5/10/30 seconds, usually because of proxy or comms issues at these boundaries). So for me these give slightly uneven boundaries of 0.025, 0.25 and 2.5 seconds. Complicating this is that the young gen times are STW but as we've seen the old gen ones are not necessarily so we'd get some high values that don't actually matter I'm fine with no histo. If there's a single bucket, I'd go for 250ms |
Sound advice - maybe this should go into the documentation if we decide to go for single bucket? In fact, maybe a general writeup of the consensus would also be helpful. |
Might be pedantic, but if there's a single boundary then technically there are 2 buckets, right? The data above the boundary and the data below the boundary...right? In any case, I appreciate the pragmatism in this discussion here. I do think that @jackshirazi has slightly better numbers (and reason for choosing those) vs. @trask's .1/1/10s. I especially think that it's important to have something in the lower tens of millis, in part because the threshhold of human visual perception is around 12ms. I'm +1 for 0.025, 0.25, and 2.5s. |
How important do we think limiting buckets for cost is? This has got me wondering about using the same buckets as http durations, since it gives nice coverage of the range of interesting timings discussed above.
|
I don't want to be too picky, but what is the point of having a boundary of 0 if we know the values we observe cannot be negative? |
hm, I'm not sure, I just opened #298 to get more attention to this question |
Proposal: Reasoning:
|
I'd rather go with the original I wouldn't block the proposal for more buckets, but I do think less is best in this case. |
Currently the
jvm.gc.duration
histogram has this bucket definition:Opening this as a tracking issue since it has come up as an open question from the semantic convention working group.
The text was updated successfully, but these errors were encountered: