-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Referendum on Histogram format #1776
Comments
@jmacd thanks for your effort and your comparison. I just miss a discussion about the space-efficiency: One of the main reasons to use histogram-like sketches for the summarization of distributions is that they provide strict error guarantees when estimating quantiles. (This is in contrast to other sketches like KLL or t-digest). If you want to estimate a certain quantile, histograms allow you to determine the bucket which contains that quantile value. The corresponding bucket boundaries will obviously bracket that quantile and can be used for estimation. The maximum relative estimation error is therefore dependent on the maximum relative bucket width (relative distance between subsequent boundaries). Assume you want to estimate quantiles with a relative accuracy of 2.5%. This would roughly require a maximum relative bucket width of 5%, if the mid-point of the bucket is used as estimate. When using the base-2 solution, you would therefore have to choose To have the same maximum relative error when using OpenHistogram, you would have to choose |
We also think that computation cost and space efficiency should be driving the choice of the bucket index calculation, and in this regard, a base-2 solution makes more sense to us. It is also not very clear to us why a base-10 approach is more tailored for environments without floating point processors. Base-2 log-linear bucket indexes can also be computed without floating point operations (using the number of leading zeros and the trailing part of the binary representation of integers, as we use the exponent and the mantissa of the binary representation of floating-point values). Unless I'm mistaken, that is essentially what HDR histograms do. In addition, while a base 10 might be more intuitive, we do not think that end users, especially non-technical people, who see the output of such histogram-like sketches through the quantiles that they produce, or the visualizations that are built from them, should have to reason about it. Most visualizations (histograms, heatmaps, etc.) generally require bucket remapping anyway (e.g., into constant-width buckets), and this can be achieved satisfactorily using an appropriate remapping method (e.g., allocating counts based on bucket overlapping), independently of the internal bucket index calculation. Even when interfacing with the bucket index mapping is necessary, we found out that a proper bucket mapping abstraction (which properly exposes the bucket index a value belongs to, the boundaries of a bucket of a given index, etc.) allows abstracting away the internal index calculation method. |
Space Efficiency
This is a good observation. I just re-ran the math and come to the same conclusion (cf. tables below). In principle the relative accuracy is inverse-proportional to the bucket count in 1-100, irrespective of the exact bucket layout (more buckets => more accuracy). The phenomenon we are running into is that the relative size of the bucket at the low end of the linear scale is much larger than on the high end of the linear scale: (1.0 .. 1.1) has ~10% the size of the bucket boundary, whereas (9.9 ... 10) as ~1%-the size of the bucket boundary. Hence OpenHistogram needs more buckets to bound the worst-case error. OpenHistogram and Floating Point
There are two properties that make inserting data into OpenHist work well without floating-point math:
Here is a pointer to the C implementation: https://github.com/openhistogram/libcircllhist/blob/master/src/circllhist.c#L793 Base-10 and Human Readability
The most significant advantage for the end-users of using base-10 (and linear sub-binning) is that you get exact upper/lower counts for the human-readable thresholds. This is relevant for, e.g. for Latency SLOs, where you want to know which percentage of requests was faster than a human-set threshold (e.g. 100ms) over the last month. With Accuracy Tables@jmacd -- If you add those error columns to the tables in the description, I'll remove these tables here. OTEP-149
OpenHistogram
*) for the worst-case bucket with boundary |
As consideration for histogram formats is evolving here, both for new formats and existing ones, it's probably worth looking at HdrHistogram and it's established data structure and wire-form as one possible format to use. HdrHistogram has been around for roughly 10 years now, as a pure (and permissive) OSS thing, and is currently supported with community-created implementations in Java, JavaScript, Python, Go, C/C++, C#/.NET, Rust, and Erlang. It has evolved from a super-fast log-linear Java implementation to a common data structure and compact (compressed) wire format that is shared across multiple language implementations. It has support for both integer and floating point histogram forms, as well as sparse or packed in-memory representations in various implementations. The wire format is specifically decoupled from the in-memory representation, allowing end points to make local tradeoffs of e.g. speed vs memory footprint (e.g. a packed histogram and a non-packed one have the same wire format), or recording time vs iteration/quantile calculation time (packed representations are slower to record into, but faster to iterate and compute on). The wire form itself has proven to be quite compact for real-world applications, and it's BLOB is commonly shared/stored in single (Base64 or binary) string when not being actively recorded into (see example in e.g. a jHiccup log where each line represents a fully recorded histogram capturing 1 second of (~1000) recorded samples). It would be interesting to see what the footprint (in bytes in a wire form used in e.g. a database or data transmittal) looks like for the various histogram format choice alternatives, as well as what basic metrics like recording speed/cost, memory footprint, and integration time look like. |
I did a recording speed benchmark a month ago. The results show that the proposed base-2 solution (RecordingSpeedBenchmark.recordDynaHistDynamicOpenTelemetryExponentialBuckets) can be as fast as HdrHistogram.DoubleHistogram. If a static array is used (RecordingSpeedBenchmark.recordDynaHistStaticOpenTelemetryExponentialBuckets) it is even faster. As always, benchmark results need to be used with a lot of caution. @giltene, please help in improving this benchmark in case you think I should configure DoubleHistogram in a different way to have a more fair comparison. Regarding space efficiency, log-linear bucketing has an inherent space overhead and needs roughly 44% more buckets to limit the maximum relative error over a certain value range compared to an ideal bucketing as we have already discussed 6 years ago (see HdrHistogram/HdrHistogram#54). Reducing this space overhead was actually one of the main motivations to develop DynaHist (https://github.com/dynatrace-oss/dynahist). |
I’ll take a look at the benchmark. But regarding space efficiency, I think that space spent in the wire form (serialized) and in various in memory implementations (packed vs flat) would lead to very different comparisons. The wire forms of HdrHistogram, for example, takes advantage of the sparse nature of real world data to dramatically reduce footprint while still maintaining the same bounded value error characteristics. Using a combination of ZigZag, Run Length encoding, and compression, the actual BLOBs transmitted and stored are extremely compact , to the point where they can (literally) be tweeted. The same concept is used for the in-memory packed forms (see e.g. PackedDoubleHistogram). These techniques are obviously not limited to HdrHistogram, and I assume that similar packing and compression techniques can be used for other forms of log-linear and exponential (small base, as in 1.01) histograms, but the post-packing footprint differences between the base layouts will likely be insignificant, and the work on this in HdrHistogram is already done. |
Some thoughts on comparing possible encodings and formats:
|
Thank you for joining this discussion, @giltene! The HDR Histogram documentation focuses a lot on the Java API, so it's not easy to find a specification for the wire format. Because OTLP uses Google protobufs, OpenTelemetry will (I assume, speaking for myself) prefer to the interpret histogram data in a |
I've corrected the link to OTEP 149: https://github.com/open-telemetry/oteps/blob/main/text/0149-exponential-histogram.md |
@jmacd a specification of the integer histogram encoding can currently be found here as part of the .NET implementation. DoubleHistograms are not specified there, but since DoubleHistograms use integer histograms internally, they differ from the integer variant only in the cookie and header parts [an implementation can be found here, and yes, a common spec would be better]. As to interaction with protobuf and/or other storage/transmission formats, I'd expect histograms to show up a a "histogram BLOB" in any such protocol. with a specified internal format, and for there to be histogram encoders/decoders in various languages for whatever the format chosen is. There are many advantages to doing so (detached from the encompassing protocol), including a dramatic footprint impact, and portability across tools and such. |
Space costLog linear costs more than exponential at the same quantile relative error. And the larger the base, the higher the cost. As pointed out in other comments, at the relative error of a few percent, base2 log linear costs 44% more than base2 scaled exponential, base10 log linear costs 3x more than base2 scale exponential. On the client side (an app being monitored), 1kB vs. 3kB might not look like much in today's systems of Gigabytes of memory (even on a cell phone). But on the server side (the backend that stores and queries the data), with billions of metric data points, 3x memory is a lot of $$$. Cpu costWhat is most important is the value to bucket mapping, which is in the critical path of inserting a value to a histogram. Other parts, like bucket management, are largely portable across different mapping methods. For dense arrays, bucket addressing can be as simple as array indexing. Base10 log linearFor base10 (OpenHistogram) value to bucket mapping, there is a significant cost of binary to decimal conversion, because input is always binary, for integer and floating point. An example can be found at int_scale_to_hist_bucket(), where a loop of integer division is used (https://github.com/openhistogram/libcircllhist/blob/master/src/circllhist.c#L807). And in double_to_hist_bucket(), where a log10() call is used (https://github.com/openhistogram/libcircllhist/blob/master/src/circllhist.c#L831) Base2 log linearBase2 log linear is very fast, when using exponent and mantissa from the native “double” representation. An example is https://github.com/DataDog/sketches-java/blob/master/src/main/java/com/datadoghq/sketch/ddsketch/mapping/BitwiseLinearlyInterpolatedMapping.java#L60, which is essentially just a few bit wise operations. Base2 scaled exponentialBase2 scaled exponential can be made nearly as fast as base2 log linear. As shown in this example from @oertl lookup table
Here “indices” and “boundaries” are precomputed arrays of 2^scale entries each. The core computation involves a few bitwise operations, plus 3 array lookups. The only method that is faster is base2 log linear. But that costs 44% more buckets at the same relative error. Other methods in the log or log linear family won’t even get close. I had previously thought about this and came up with a similar method, where the arrays have 2^(scale+1) entries, so that each linear bucket represented by the arrays spans at most 2, instead of 3 log scale buckets, so that it is not necessary to look at boundaries[i + 1]. This further reduces cpu cost, at the expense of static array size. An example of the enhanced method (equivalent to computing "k" in the example above):
The lookup methods essentially use precomputed lookup tables to quickly map a value from a linear bucket into a log bucket. At the scale of interest, where relative error is a few percent, lookup array size is totally manageable (the arrays are static arrays shared across all instances in a process).
Log linear has long been used as a fast approximation of pure log scale buckets. But now with lookup tables, pure log scale can be computed nearly as fast as base2 log linear, yet cost 44% less memory. And it is easy to understand and implement. Everybody understands log scale, and when performance is not critical, a simple implementation can call standard log() function to avoid the lookup table logic. Use casesLet's consider the pros and cons of the options in various use cases: Histogram chartingHistogram charting in linear scaleIn this case, base10 log linear does have the advantage of bucket bounds aligning up with decimal numbers. However, at the relative error of a few percent, hundreds of buckets are needed even for a small range of 1..100. With this many buckets, individual buckets are hard to see on a curve. Unless a user is “bucket peeking” (as in pixel peeking on an image), they will hardly notice the underlying buckets’ bounds. This comes to the most significant advantage of a base10 histogram: more human friendly when a human is actually eyeballing the individual buckets. But this use case is more limited to scenarios like debugging than a common end user use case (Can you imagine a business person scrolling through hundreds of histogram buckets?). Histogram charting in log scaleIf the chart is rendered in log scale, a pure log scale histogram such as scaled base2 actually has advantage over a log linear histogram, because a pure log scale histogram’s bucket are of equal width on log scale, while log linear buckets are not. SLO measurementA typical SLO is like: 99% of requests must be processed in less than 100ms.
For method 2, base10 log linear can give the exact answer when the target ms falls on bucket bound, which is expected to be common because people set the target. But this method is less intuitive in practice. Suppose the SLI says 97% when SLO is 99%. You know 2% of requests are longer than desired, but not how much longer. And it is hard to explain to the user the precision of the system. Business people are not going to like a statement like “if the target ms is on the series of 100, 105, 110 …, 1000, 1050, 1100 …, the result percentage has zero error, otherwise, the error is variable (data dependent)”. Mathematically, “variable (data dependent)” is the accurate statement. If the requested ms falls into a bucket of 5% total population, the error is +/- 2.5% (absolute error on percentage, assuming the query returns the mid point of the bucket). If it’s in a bucket of 10% population, the error is +/- 5%. For this reason, you might as well just say there is no error guarantee. Alternatively, you can return two numbers, both with zero error, like “the percentage of requests faster than 103ms is between 91.62% and 97.34%”. But the user friendliness of such a system is not going to be good. Method 1, the percentile method is more user friendly. For example, SLI says 99% is 127ms, when SLO is 100ms. You know you are 27ms away. In this use case, it is important to give the user the upper bound of relative errors. For example, if the histogram has an error bound of 2%, the user knows that 127ms is meaningful against the 100ms target. Closing thoughtsSurvey questions are full of pitfalls. If you ask users: would you like your histogram bounds to be on decimal? They are going to say yes. But if you add: for 3x more money, many will say otherwise. And even when money is not an object, decimal comes with log-linear scale, which has its own pros and cons. Back to the beginning: Another lesson I learned from that PR is that log linear is harder to explain. You would think “linear subbuckets in log buckets” is easy to understand, but I had to explain it to people again and again, while everybody instantly understood log scale histograms. Log scale (exponential) histogram is mathematically well understood and with many implementations. As long as an implementation can set the base, it can produce scaled base2 histograms (just set base to a base from the scaled series). |
On space cost: I think once should compare the "packed form" space costs rather than the "faster flat array" forms. By "packed form", I am referring to in-memory or wire formats that take advantages of the sparse nature of the data and of the tendency to not need to the full fidelity of e.g. 64 bit counts for each element. On base2 vs base10: I see no real benefit for base 10 when a value error level is expected. The values captured by either will be the same to within the value error level (of e.g. 1%), and for every case where someone is "happy" with a record 10.0 showing up as exactly 10.0 there will be someone who is "upset" with some other number not exhibiting that quality. Since real world recorded numbers don't tend to align to precise powers of 10, the nicety of base10 is not likely to be relevant IMO. |
BTW, for details about the packed arrays I've designed to hold in-memory representations of log linear histograms, see e.g. the external API here and detailed comments about the internal structure here. While these packed arrays (useable for both concurrent and single threaded operations) were designed in the context of HdrHistogram, I believe they can be generically used for other things that need an "...array context is optimized for tracking sparsely set (as in mostly zeros) values that tend to not make use of the full 64 bit value range even when they are non-zero." As such, an exponential histogram implementation that normally maps it's internal math to a flat physical array can probably be enhanced to use the same packed array implementation to provide a virtual array of the same size, achieving nearly identical packing levels and footprints. |
One more thing to compare is the scale interval. For scaled based 2, it is regular, with base[scale + 1] = squareRoot(base[scale]), where "scale" is continuous in integer range. For base10 log linear, the "precision" series is uneven, ranging from 2x, 3x, to 5x change from one to the next. In general the series alternates between 2x and 5x to keep the 10x decimal ratio: 9, 18, 90, 180, 900, 1800 ..., in order for any two precisions to be mergeable. This gives 2 uneven stops between each 10x of precision. The resulting error series is roughly: |
@giltene Shouldn't the bucket mapping on the one hand, and the internal/wire representation and tracking of bucket counts on the other hand be two distinct problems that we can independently optimize for? Correct me if I'm wrong, but as is the case with DDSketch or DynaHist, it seems HDR histograms also involve mapping the input value to a bucket index, then updating the count at the given index. I agree that there is often more to gain by optimizing the bucket count representation rather than the index mapping itself (as we experienced in DDSketch), but in any case, it seems to me that any implementation would benefit from a non-negligible 44% space efficiency uplift thanks to a more optimal index mapping that would minimize the number of distinct indexes to keep track of. @yzhuge For exhaustiveness, even though we may not want to favor them if simplicity matters, we could also mention polynomially interpolating mappings (log-linear is a special case). They offer better space efficiency than log-linear at a slightly higher CPU cost. Their CPU cost may be higher than the logarithmic method with precomputed sub-indexes though.
I believe that any method that involves bitwise operations using the exponent and the mantissa from the binary representation of a floating-point value should also work on integer values without conversion by using the number of leading zeros (many architectures natively support computing it) and the first non-zero bits of their binary representations. |
I think there are two separate questions here bundled into the choice as presented.
Independent of performance considerations I believe in general customers would
In terms of performance the conversion from user space to histogram space happens on the end client. This may matter to some users but only really if they are recording around of a million measurements per second (at least on mainstream architectures) . As discussed above, at least for base 2, there are optimization techniques that can be used to make log about as performant as log linear in this regard. On the storage and query side I think the number of bins that have to be stored and queried will matter to end users as they will in the end have to pay for it. It is worth bearing in mind that current time series storage formats are using a couple of bits per datapoint for counters and gauges, storing 40% more bins per histogram datapoint is going to be significant. Between the two choices presented I believe customers would prefer log histograms with base-2. |
@CharlesMasson about bucket mapping being a separate consideration than internal/wire representation: “yes and no”:
Packed representations do come at a cost. But for context (in some trivial one-shot u-bench testing), the current (HdrHistogram) PackedDoubleHistogram seems to clock in at ~18M recordings/sec on a single thread on a 2.4GHz x86 cpu, on the same data test that DoubleHistogram clocks in at 90M/sec for. So a 5x hit in CPU, but still super-fast. In tests that include a single writer Recorder pattern (which adds overhead for coordinating double buffering, but is the most likely pattern to be actually used by performant recorders) this packed representation overhead ratio drops to ~2:1. So “comes with some very real overhead, but still darn fast” is how I’d describe it. And unless you actually record ~1M value per second per thread, they will probably all seem the same. With a memory footprint savings of 1-3 orders of magnitude (data set dependent, precision dependent), packed in memory representations are likely to be the ones more commonly used over time IMO, and this is probably true regardless of the index mapping scheme chosen. Assuming this (packed representations dominating over time) is true, and assuming my assertion that in packed representations of the different index mapping schemes at similar value error levels the physical footprint becomes very similar is also true, memory footprint is not going to be the main consideration in choosing the format, because the (packed, physical) memory footprint will end up being very similar for the various alternatives. |
@oertl and @pirgeo I've been looking at the prototype mentioned here: open-telemetry/oteps#149 (comment) I would like your assistance with a bit more discovery into the direct use of the IEEE floating point representation to compute a histogram bucket. Suppose I want to create a histogram aggregator that uses fixed Suppose I give you a single 64-bit floating point value, the first to be inserted into the histogram. I would like to see (in pseudo-code, at least) the algorithm for determining:
The goal is to show that this can be done using the IEEE representation plus primitives to compute the number of leading and/or trailing zeros in a value. The next step (again, asking for pseudo-code) would be to handle new values:
I feel that the above pseudo-code will demonstrate that the proposed base-2 exponential scheme is flexible enough to support both high resolution and low resolution and make use of the IEEE representation to its advantage. Would you help flesh this out? |
@open-telemetry/specs-approvers To summarize the discussion above, we have an OTEP and strong support for exponential histograms with base-2 bucketing. There is ongoing consideration into two log-linear alternatives, one base-10 (OpenHistogram) and one base-2 (HDRHistogram). We are also trying to end a debate about sparse vs dense encodings. This was discussed in the Tuesday July 6 Metrics SIG meeting and we concluded with a call to see prototypes. The DynaHist prototype (mentioned in my preceding comment, above) will be presented next Tuesday, and I've asked for pseudo-code to help demonstrate the benefit of choosing base-2 for low-resolution histograms. Our goal is to choose one "family" of histogram encodings, and community support is what matters most. We have several community supported Histograms we could choose, but the community that matters most in this decision is Prometheus. We are not ignoring the work done on sparse, exponential histograms in Prometheus here, but the communities have been mostly separate. I would like to invite @beorn7, who has been working a branch that is compatible with the base-2 exponential histogram in OTEP 149. @beorn7 would you please post a copy of the current |
Most people involved here probably already know, but just to be sure: The approach of adding sparse dynamic high-resolution histograms to Prometheus is described in a design doc, which also explains why Prometheus (with its model of stateless scrapes) cannot just use an existing histogram implementation as a drop in. (But general principles of implementing histograms still apply, of course.) The current implementation progress (which is still early PoC state) can be followed in the
Now the latter is a bit delicate: Prometheus 2.x dropped the traditional protobuf scrape format, even though I had vague plans of better histograms for Prometheus using protobuf already back then. (If you want to have a laugh listening to a grumpy old man complaining about the course of history, watch my FOSDEM talk about the subject.) For the PoC, we decided to resurrect the old protobuf format and extend it for the new sparse high-res histograms. That shall not imply that this old protobuf format is coming back. More likely, a future or extended version of OpenMetrics will accommodate the new histograms, maybe just in its protobuf version, or we will find a good way of representing them in the text-oriented format, too. Long story short: The semantics of the current PoC exposition format is the intended one, but the details are definitely not the final form. Having said that, these are the relevant new lines in
As explained in the design doc, this attempts to represent a (moderately) sparse histogram in a form that is both compact and easy to en-/decode. It is not meant to be an explicit representation that would be easy to parse for humans. So far, the experience in the prototype are quite good. I understand, however, that the trade-offs for OTel might be quite different. What I have seen so far in the OTEP looks like it will be easy to convert back and forth. Also note that it's our intention to keep Prometheus open for other bucket layouts (marked by currently unused values of the |
@yzhuge I am interested in your thoughts on #1776 (comment). It looks like the protocol buffer messages written by @beorn7 are compatible with the ones you wrote, with a new detail about the width of the zero bucket. Having worked on code that converts from explicit-boundary to exponential-boundary histograms, I support the notion that the zero bucket has finite width, being the bucket between the smallest positive and negative boundaries. A little bit of text will have to be written about how to merge histograms with different zero bucket widths. One reason for my earlier question (#1776 (comment)) is that I want to understand what the limits (on scale and bucket index) imply for the zero bucket. |
Scaled base2 exponential histograms can handle any small number (or large number) encoded by double. When number of buckets exceeds configured max, it is preferable to downscale to preserve constant relative error across the full range, rather than to "cut off" at the high end (overflow), or at the low end (underflow). A zero bucket with a width like [0, x] is effectively an underflow bucket. I prefer not to allow such buckets in OTel standards. Overflow/underflow buckets open a can of worms. They are often put in as a means to limit number of buckets in producers not supporting downscaling. In scaled producers, scaling is preferred over overflow buckets. Another reason to favor scaling over cut off is that it is hard to decide the overflow and underflow threshold. In one app, .001 may be a good underflow threshold, yet in another, 10E-9 would be. So the threshold runs counter to our goal of no/minimal config. In contrast, the scaling approach is totally automagic. To convert histograms with overflow buckets into pure log scale (no overflow buckets), we will have to twist bucket semantic. A bucket like (0, x] or [x, +Inf] can be treated as [x, x] (all values exactly at x). The changes semantic of the bucket. But quantile calculation is goofy (with unbounded error) in the overflow/underflow range anyway, whether it is computed from the original histogram or the converted one. An example of such calculation can be found at https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile @beorn7 what do you think of not using underflow bucket (of course, we will always have a bucket for values exactly at 0)? |
@jmacd @yzhuge I think that kicking around the question of whether or not the notion of a zero bucket width is actually needed may be useful before settling on things. Specifically, in my previous experience, I found that auto-ranging eliminates the need for specifying either teh covered range or a zero-bucket width, allowing for a truly zero-config (other than precision, which could default to e.g. +/-1% or better) histograms. The notion basically auto-ranges (as values are recorded) e.g. a 1 : 2^55 dynamic range within the wide 2^-1023 to 2^1023 range possible through Double, it is also very performant, as the number of auto-ranging transitions is capped (to no more than e.g. 55 in a 1:2^55 range) and each transition is fairly cheap (only needs to actually touch a precision-capped set of buckets). In effect, the width of the zero bucket auto-adjusts according to the lowest non-zero value actually experienced. Auto-ranging can probably be logically separated form the choice of log-linear vs. exponential, and implemented for either. In my log-linear (powers of 2 for both log and linear) implementation in HdrHistogram, auto-ranging is achieved by logically shifting the underlying integer array value representation. Logical shifting is effectively free for all values except those recorded in the the lowest logarithmic half-bucket in a left shift operation. Power-of-two log-linear representation made the scaling (logical shifting) operation needed for auto-ranging both easy and fast, but there is probably a logical equivalent that can be used in the other representations. For an example of how auto-ranging logic works, you can look at the autoAdjustRangeForValueSlowPath logic in HdrHistogram, which uses shiftCoveredRangeToTheRight and shiftCoveredRangeToTheLeft to adjust the covered range. You can drill down from there for details, but the "interesting" specific logic for handling for the lowest logarithmic half bucket in a left shift can be found here. |
Yes, that's why you can configure it if need be. |
I am sorry to say that I don't have the capacity for a detailed discussion here about what Prometheus should or shouldn't do now. I also don't think that's the purpose of this issue. We (Prometheus folks) are merely implementing a PoC right now, based on plenty of discussions we had earlier. At some point, we just had to draw a line and start trying things out. Based on the experiences from the PoC and on whatever people might bring up in the meantime, things can be revisited extra carefully before anything is released into production with stability guarantees. But right now, we would just like to implement the agreed design and see how it works. Personally, I think a standard should be built upon something that has already proven to work well in real-world production scenarios (rather than creating a standard on the drawing board, for which implementations are created after the fact). From the Prometheus side, we aren't even there yet (when it comes to sparse high-res histograms), so I would anyway recommend to take anything we do right now with a grain of salt. I'm happy to provide pointers to what we are doing, in case it helps OTel to sketch out their path, but I cannot reboot the design process at this point. |
When we are talking about numbers in the range of E-308 (double min) or E-39 (Prometheus noise reduction threshold), we are in hair splitting land. I propose a no-cost fuzzy solution: |
I am trying to find a middle ground on this issue about the zero bucket, and I think there is one. @beorn7 wrote
@yzhuge wrote
@giltene wrote
For me the middle ground is:
In other words, we can specify that the zero bucket should be used for values that the producer considers effectively zero subject to their own threshold. @beorn7 it seems we all agree that producers should decide their zero threshold. You didn't explicitly mention why the protocol should encode that value for the consumer to see. The consumer will see the smallest value that is observed greater than the threshold, which conveys information about the tolerance anyway. Is there a great loss of information by not conveying the zero threshold? |
Thanks to @pirgeo for presenting the DynaHist algorithm mentioned above (#1776 (comment)) in today's Metrics SIG meeting. He presented a version of the same code written in Go, here. |
@giltene I am interested in finding a way that OpenTelemetry SDKs can take advantage of HDRHistogram implementations and generate a format that is compatible with the exponential sparse histogram encoding by @beorn7 shown here. I read over the HDR encoding description mentioned here and the Go implementation here. It seems that common HDRHistogram implementations can be configured to compute pure exponential histograms by selecting the appropriate precision for the maximum range. It takes a few reads through the The Java HDR implementation is more sophisticated and also supports auto-ranging. If we aim to support auto-ranging (which appears to be an extremely popular idea), then it appears reasonable to drop the linear component of the bucket index, since the auto-ranging is done using exponential scale. To me, this suggests that HDRHistogram implementations could be adapted to output to proposed format with little effort. @giltene let us know what you feel will be lost to produce the following bucket encoding (which is @beorn7's encoding without the zero bucket width)?
|
@jmacd
|
It's certainly true that the zero bucket will never contain any observations larger than lower bound of the bucket closest to zero (and mirrored correspondingly for negative observations). However, with a sparse encoding, there might be many many empty buckets in between the zero bucket and the populated bucket closest to zero (which is in fact the most common case in our PoC studies so far, if the zero bucket is used at all). So while your assumption is correct, it also overestimates the size of the zero bucket. That's probably fine in cases like DDSketch where (IIRC) expanding the zero bucket is used as a strategy to reduce the bucket count, but it hits the other end of the spectrum, with the most extreme case being that the zero bucket only takes observations of precisely zero (as also suggested here). |
Continuing on my [slightly edited/updated comment above]
One can debate the precision levels needed, and if +/-0.5% error rate is needed. I find that to be the most commonly stated level. It "feels high resolution" and meets this intuitive, seems-to-make-sense to humans level for "the value you get out will be no more than 1% away from the value you put in". Somehow people tend to accept that much better than "no more than 1.5625% away from each other"... It seem to be "a strange number" and "not as good as 1%" (their mind for not-mathematically-sound intuitive interpretation of "what decimal position will the error appear in" tend to jump to the next decimal order of magnitude, which would be 10%). Said differently, when you give people a choice of precision levels, they tend to choose numbers like 10%, 1%, 0.1%. [hence the original (and debatable) choice to state precision in terms of decimal points accuracy in the HdrHisatogram API] And when you ask what they mean by that, they generally mean that they are good dropping (or rounding) the decimal positions that are beyond the level they stated. Beyond just power of tens, there other "seem natural to humans" levels for "no more than X% apart", which tend to placate their "but the number I put in is not the same as what is reported" complaints include the "round numbers" of 2 and 5. E.g. people will often think in terms of 1,2,5,10,20,50,100,200,1000,... when looking for naturally stated exponentially increasing levels at a finer-than-10x stepping rate. This numbers seem to be easier for people to intuitively reason about and deal with (where 16, 64, or 27.1828 seem like "strange boundaries" that are "neither there nor there" to most non-bit-thinking or non-mathematicians). We can look at the number of buckets needed to provide a given "no more than X% apart" for various human0natural levels, and at the error rates provided at each: (I'm hoping/assuming my math on these is actually correct...)
As this table presumably shows, while exponential does provide a tighter maximum relative error at any bucket number (because it's maximum relative error is uniformly distributed across the buckets), the number of buckets needed to meet a typical human expectation boundary is only better at the lower resolutions (10%, 5%) but ends up being the same for no more than {2%, 1% 0.5%, 0.2%, and 0.1%} apart. [Exponential will likely win again at 0.01%, but there are few applications that go there]. Note that these bucket numbers are all in "virtual bucket count" sense. Sparse and/or packed representations (like the one being proposed above for OTEP- 149, HDRHistograms's existing wire form and in-memory packed array representations, and lots of other wire and in-0memory representations for various schemes) will mostly only "pay" for the physical non-zero-count buckets. But when the virtual bucket counts are the same, the number of non-zero buckets that result in recordings is likely to be "similar" (there will likely be differences in either direction because of the bucket boundaries not being the same). On auto-ranging and zero-bucket witdthIn any case, regardless of exponential vs. log-linear (which both have merits, and likely end up with similar footprints when exponential is restricted to whole-powers-of-two resolutions) the notions of auto-ranging (or not) and of protocol-stated-0-bucket-width (or not) are probably orthogonal (they can both be applied to log linear or to exponential). I found a need for a "lowestDiscernibleValue" in integer (values) histograms, which may seem logically equivalent to a 0 bucket width (it;'s not, it a "what does an integer unit mean" actually). But I found that in floating point (values) histogram the equivalent need disappeared when auto-ranging was introduced. In my implementations it disappears as long as you are willing to live with the limitation that a single histogram will not hold values across a dynamic range larger than e.g. 1:2^55 (for 1% precision expectations), but even that limitation can probably be removed if the underlying representation was not an integer-values histogram coupled with an integer-to-floating-point normalization factor. [the 1:2^55 range limitation at 1% precision is derived from the signed 64 bit range limitation on underlying integer values]. A way to reason about how auto-ranging eliminates the need for a zero bucket width is to consider "pure 0" as it's own special bucket (with no width). For all non-zero values within a covered dynamic range (with a value ratio within the range of no more than e.g. 2^55), the normal bucket scheme (log-linear or exponential) is used with what is presumably flat array of value counts. But 0 is special, and has it's own count (which can actually sit in the logical index 0 location in the array, as it is not used for anything else) . Auto-ranging is then a matter of shifting the logical position where the "0 index" of the array falls for all non-zero values. Auto-sizing (expanding the array range to fit a larger-than-previously-covered range) can involves moving values around in the array (in straight linear memcpy form). But since the number of times such moved might occur is limited (to the number of autosizing steps it gets to make it large enough to cover the range), these auto-sizings are not a real concern in anything other than strictly-consistent-very-low-latency-envrionments (and this can avoid auto sizing by ranging correctly from the start). |
On Jul 15, 2021, at 11:11 AM, Joshua MacDonald notifications@github.com wrote:
While these minimum values do seem "small enough to not matter" when we think in terms and units we commonly see, I don't see a necessity to impose a minimum larger than 2^-1023.... Perhaps more importantly, avoiding this limitation probably comes hand-in-hand with avoided wasted space when the entire value range being recorded is far away from the "base" we might statically choose. This os a key point about auto-ranging, which I think we should consider before nailing down the meaning of scale: In the proposed format, scale (which controls the ratio between consecutive buckets) and base (an arbitrary point in the floating point range that is assumed to fall within the covered dynamic range) are implicitly related. But there is no need for them to be, and de-coupling them has real benefit. The scale will determine the number of buckets needed to cover a given range of numbers. Various practical considerations can constrain the maximum dynamic size of a range (dynamic range) in terms of the ratio between the smallest and largets numbers represented at the same time in the same data set. But where that range (constrained or not) falls within the [2^-1023 to 2^1023] range encode-able using double precision floating point numbers does not need be limited in any way. The reason to avoid an implied base (which would dictate an implied number range) is that we don't actually know what the units of the things being put into the histograms are. E.g. some application may report response times in nanoseconds, or milliseconds, or seconds (or days, or years). and since histograms might be used for arbitrary stuff using arbitrary units (e.g. size in picowidgets or terawidgets, wavelength in angstroms or lightyears, acceleration in nanometers per second per second, or furlongs per fortnight per fortnight) there is great benefit for allowing the same dynamic range to be used regardless of unit choice. This was the original reason I build auto-ranging for DoubleHistogram: A group building a monitoring system wanted high-dynamic-range (HDR) histograms that could be used to track the value distribution of data that they did not understand or control, and did not know the meaning or units of. That reason and motivation would seem to apply directly here (open telemetry) when consider what histograms can support. Pre-determining the base [as e.g. 2^(2^-scale) in the example above] will limit the dynamic range. But auto-raging basically makes the base a function of the smallest experienced non-zero number in the data set. The base can then move to accommodate the data as it is recorded, as long as the range of values recorded in the data set does not exceed the supported dynamic range. Lets discuss this in practical terms for an exponential representation (the log-linear representation can be seen in existing code): In an exponential encoding (with a scale defining the ratios between consecutive buckets as 2^(2^-scale)) that can use an arbitrary base, supporting a dynamic range that would cover the entire 2^-1023 to 2^1023 possible value range would require (2048 * 2^scale) indexes. so e.g. at a scale of 7, a counts array that can grow to 2048 * 128 = 262144 (256K) entires can cover the entire possible dynamic value range. Auto-sizing the counts array allows the space used to be limited to the amount needed for the actual value data being recorded, and doing auto-sizing in chucks (e.g. 128 at a time) coupled with limiting the array size (to e.g. 7,040 max entires, to support a dynamic range of up to 1 : 2^55) will contain the worst case number of resize operations, along with the memcpy costs involved in that constrained number of resize operations, to the point where the implementation becomes very practical and performant. [Note that when a sparse (or packed) representation is used, the size of the array doesn't matter as much ( because it is a virtual array, and only non-zero-count entires need a physical manifestation] but it can still be good to limit the array size to contain worst case expansions on degrease data sets]. |
(There was a grave error in the output from the program quoted above, so I deleted the comment. Please disregard.) @giltene Thank you for continuing to patiently explain. I had a couple of misunderstandings about HDRHistogram's representation. I think I now understand there to be a connection between the scale=0 base-2 exponential histogram of OTEP 148 and the HDRHistogram log-linear representation. The algorithm shared by @pirgeo and @oertl appears to show how to convert between the linear portion of the HDRHistogram bucket index and the correct scale=0 base-2 exponential bucket in O(1) steps. You raised a point about the relationship between scale and base. Earlier in this (lengthy!) discussion, various participants have proposed we avoid loss of precision that results from changing the reference base between arbitrary values, which led us here. |
@beorn7 It feels like there is a consensus around the protocol you shared, except for this point about the zero bucket resolution. How would you feel if the zero bucket width were an optional part of the protocol? Users could specify that number to indicate the "smallest discernible value" if they please. When merging histograms with different smallest discernible values, what do you recommend? I would suggest we output a merged histogram with the minimum of smallest discernible values, otherwise sum their buckets. I would not make any effort to maintain the maximum smallest discernible value by collapsing some of the non-zero buckets into the zero bucket. |
Can you elaborate eon the precision loss concern? A simple way to avoid concerns around conversions or moves between arbitrary bases, and to address precision questions around e.g. merging data from two histograms that used two different arbitrary bases, is to normalize the base choices by e.g. only using bases that are powers of 2 (but to freely move between them as part of auto-ranging). If/when you do that, are there any remaining precision loss concerns? |
This is what we are trying to do by limiting the choice of base. The argument was that converting from base=2 to base=10 introduces statistical artifacts because one is not a factor of the other. |
Right. So if the base can be an arbitrary power of 2, and you keep the base <= the lowest non-zero recorded value, you end with no need for a zero bucket width. auto-ranging is then quite simple. You set the base to the power of 2 that is <= the first recorded value, and then move the base down on any “underflow” (a recorded value lower than the current base). When the base is always a power of two, and the resolution is also limited to powers of two (which it is, since scale is an integer), the is a while number if buckets between any old and new bsse level when you shift bases, Changing the base (e.g. when a recorded value “underflows” the current base) is then very cheap, and there us no “bucket splitting” issue. If there was no resizing of the underlying array storage involved, there would be no memory to move around at all (just a shift if the logical 0 position in a “circular” array. When there is resizing (which is likely in any application that doesn’t want to pre-allocate a wide enough range, because why not auto size?), some memcpy shifting of array elements is needed (but no math is needed on individual elements as they are copied). Auto sizing can happen on either end: on an underflow (which will shift the base down and add enough buckets to cover the added value range), or on overflow (no change to base, just adding buckets to cover the range). keeping the zero count “outside the array” by special-handling the actual 0 value can keep things simpler (no need to deal with its overlap with buckets). Or, as a alternative, you can “burn” (2^scale)-1 buckets (below the base) that would never get used and work the index computation math to make it naturally translate the 0 value to the logical 0 index in the array, while the non-zero values all end up in the right places (at or above the base), in order to keep the recording logic faster by reducing a conditional branch. (I do the latter, but either one would work). |
In fact, it already is optional, at least in our current PoC (with the disclaimer given above: it's just a PoC to experiment, no guarantees anything will survive into a released version). It's more by the nature of protobuf v3 that a missing zero-bucket width would be seen as a width of 0.
Again with the disclaimer that nothing like this has been implemented yet in Prometheus land, and it is all open to experimentation: I believe that merging histograms should result in the widest zero bucket among the source histograms determining the zero bucket width in the aggregated histogram, with all buckets for smaller values being merged into it. IIRC this is a crucial idea behind DDSketch (and possibly other histogram implementations with zero buckets of finite width). |
No, that was it. The appeal of fixing a family of bases to repeated squares of each was influenced by UDDSketch. It seems like the HDR histogram techniques that you developed all apply to an exponential histogram, but to literally use HDRHistogram we would all have to adopt the log-linear encoding. I was looking for help defining the pseudo-code for an auto-ranging exponential histogram here. |
I am working on a prototype for a "scaled base2 exponential histogram", as an implementation of https://github.com/open-telemetry/oteps/blob/main/text/0149-exponential-histogram.md. Basically it has two rules:
For simplicity, the prototype rounds subnormal numbers (5e-324 to 2e-308) to 0. So it is meaningful for OTel to state that the zero counter may include numbers rounded down to 0. But I lean against recording the rounding threshold, even as optional field in OTel protocol. First the question is how are you going to use it, if you have it in the protocol? Besides showing the threshold to the user in some particular UI, there is not much use. And it actually creates confusion. For example, if it is larger than the lowest bucket, what does that mean (overlapping bucket boundary?)? And since protobuf optional fields default to 0, when it is not set, it looks like the threshold is 0, but in fact it may not be (the producer neglected to set it, or the producer does not know what it is). Then we may need a value for "unknown" (use "NaN"?). Tracking the threshold opens a can of worms. For now, I think we should just treat the threshold as part of floating point processing rounding error on the producer system. The OTel protocol does not attempt to report such error margin. |
Sounds great, @yzhuge! |
"New Relic Sketch" (NrSketch) is now open source: https://github.com/newrelic-experimental/newrelic-sketch-java
|
I feel this issue has been open long enough to gather all the feedback we are going to gather. Having read through and sat with this debate for a while, I've come to the conclusion that my original premise was incorrect. OpenTelemetry might be better off in some ways for having a single Histogram in its protocol, and the vendors and the collector code base would appreciate that, but the users and the community are not better off for us forcing a single kind of Histogram to be used. There are more than one good kind of Histogram and we have definitely chosen one of them here. The base-2 exponential histogram is favored by many in this discussion for its conceptual and mathematical ease, compared with log-linear histograms. We have seen 3 prototypes of the base-2 exponential approach, written by @oertl, @beorn7, and @yzhuge. On the other hand, there are two well-known log-linear histograms with open-source implementations that are already in use for telemetry applications. There does not appear to be a dominant Histogram, neither OpenHistogram nor HDRHistogram, and both are great at what they do and reasonable choices for use in a telemetry system. I believe we should move forward with OTLP protocol support for base-2 exponential histograms, and I also think we should add support for a log-linear histogram protocol that can capture both OpenHistogram and HDRHistogram. I have opened a draft of the base-2 exponential approach here for review, copied directly from @beorn7's proposal. I will close this issue and open new issues in the OTel-proto repository to capture work needed for log-linear histogram support, assuming others agree with this direction. Speaking as a vendor (Lightstep), we are happy to accept any high-resolution histogram format included in OTLP. We see a lot of promise in the base-10 log-linear OpenHistogram approach, because standard units conversion (e.g., milli-seconds to seconds) and threshold queries (e.g., percent of latencies less than 100ms) are exact. We are not concerned with errors introduced by histogram conversion, because those errors are relatively easy to control. The real applications for this data that we have are not impacted by single-digit percent errors in individual buckets. From an SDK perspective, I believe we should return to an earlier proposal, #982, which is to accept any histogram as a viable default for OpenTelemetry SDKs. This allows OTel SDKs to release Metrics SDK support using OpenHistogram, HDRHistogram, or one of the new base-2 exponential offerings that we expect based on the prototypes we've seen. |
@giltene To be a little more concrete, what I was thinking about for log-linear protocol support goes back to @yzhuge's original proposal for a variable-base variable-sub-buckets log-linear histogram. OpenHistogram can be described with base=10 and sub_buckets=90 and perfect-subsetting variations with sub_buckets in {3, 9, 18}. HDRHistogram can be described with base=2 and sub_buckets in the powers-of-2. The basic protocol appears simple to capture. Speaking as a vendor, we are happy to contribute code that may be used in the OpenTelemetry collector to convert between any of the supported OTLP histogram formats. We already have such code in production and I will be glad to open-source it. |
@jmacd putting my HDRhistogram hat aside, I had been converging on an exponential representation "making perfect sense" for the subject matter here, and mostly looking to apply lessons learned from other implementations that can help avoid adding restrictions early, especially when it comes to configuration assumptions. E.g. I think that:
To the question of "which scale"? I actually think of exponential auto-ranging as "scale-less", since the "bucket_lower_bound" and the "base" items are actually redundant concepts. There is a lower bound and a higher bound to the the value range captured in the histogram, and both can be well below or well above some arbitrary fixed "base". The "base" it self has no real meaning, and the lower bound can be used as the "base". Or conversely that "base" can be used as the lower bound, but [when auto ranging is enabled or used] can mutate as values are recorded. For consistency (i.e. no loss of precision when merging histograms under the same representation and resolution), we need bucket boundaries to align across histograms. All that is needed for this is some agreed upon limitations/quantizations on available resolutions and lower bound boundaries. And there I tend towards using integer powers of 2. I.e. accepting the limitation that resolution is stated only with integer powers-of-2 levels, and that bucket lower bounds must be a integer powers of two (e.g. 2^-103 is a valid lower bound, and so is 2^79). So I would call the scheme "exponential powers-of-2-aligned", rather than "exponential with base=2". Keeping configuration to a bare minimum is a good and already-stated goal. The only (unavoidable?) configuration item I see as a common denominator to all "hi res" histograms is the need to specify required precision (spelled "resolution" and implemented by specifying "scale" in the discussed proposal, and spelled otherwise in the various other implementations).
When it comes to interacting with various existing histogram implementations (including various log linear, exponentials with bases or resolutions that do not align with this one, etc. etc.) we must accept some loss of precision during merging or importing data from such histograms. While individual histogram implementations can probably preserve values with no [additional] loss of precision when copying or merging histograms within the same implementation, some precision loss across unaligned implementations will be inherent due to the bucket boundaries differing. This is likely not a major concern (e.g. merging two non-same--implementation histograms which both provide +/- 0.5% relative error guarantees will have the expected behavior of increasing the relative error to "up to +/- 1%" due to the merge). To be clear, HdrHistogram is not "log-linear" to its users. It just [currently] uses an underlying log-linear implementation. HdrHistogram is HDR (High Dynamic Range) and sets precision and ranging behavior expectations through its APIs. I can easily envision evolving HdrHistogram's internal implementations and representations to some future v3.0 that would shift from log-linear to "exponential powers-of-2-aligned" with no change to any of the exposed APIs or specified semantics. Such a future library version might support importing serialized data from the old V2.0 or v1.0 formats, by either supporting the various versions of underlying formats, or by setting an expectation that importing e.g. v2.0 formatted data by a v3.0 library may involve round-off value precision losses. In fact, if this proposal ends up solidifying on an only-resolution-is-required-in-configuration and only-resolution-is-dictated-by-wire-form, and settles on the only limiting assumptions being that both resolution and bucket boundaries are stated as be integer powers of 2 and that resolution is limited to reasonable range (e.g. allowed scale values in the specification cover the range of 0...17), I think that such a v3.0 HdrHistogram evolution is quite likely, which would allow converting between opentelementy histograms and HdrHistogram to be done seamlessly and with no loss of precision, and allow users of HdrHistogram's useful features (like the double buffer'ed Recorder pattern, linear/log/percentile iterators, speed and concurrency) to easily integrate with opentelementry. Supporting the opentelemetry wire form for histograms will also be likely then (it would be the first ever dependency added in HdrHistogram). It would also provide a widely polygot base for open telemetry histogram producers and consumers (since HdrHistogram implementations currently exist for Java, Javascript, Python, .NET, Go, Rust, Erlang, and C/C++, and we can probably motivate them all to evolve to the new formats). It would be much harder to envision such a evolution if the new format was not compatible with auto-ranging and auto-scaling, or required the representation of a zero bucket width concept, or did not allow for histograms with e.g. 3 decimal points of precision (+/-0.05% error value, scale = 10). |
If I understand correctly, @giltene, you would like to see one new field that scales the values in a histogram? As far as I understand it, the HDR Histogram format uses a "Integer to double conversion ratio" as this field.
Cool! 😀 🎉 All -- I feel it is time to close this issue. Please direct attention to open-telemetry/opentelemetry-proto#322 for the current protocol review and file issues in this repository with suggestions and/or proposed improvements. Thank you! |
What are you trying to achieve?
This issue aims to achieve consensus from OpenTelemetry maintainers and approvers over a choice of histogram formats.
There has been a lengthy process to standardize an exponential histogram protocol for OpenTelemetry, culminating in OTEP 149 which arrived at a consensus on exponential histograms.
Contemporaneously, the OpenHistogram code base was re-licensed (formerly the Circonus Log-linear Histogram), making it an appealing option for both OpenTelemetry (and OpenMetrics) to use, but it is not the same as an exponential histogram. The goal of this issue is to summarize these options to allow a wider audience to participate in the selection process.
It is important that we choose only one of these options for several reasons: (1) because there is a loss of fidelity whenever we translate between incompatible histogram types, (2) because there will be a large amount of code dedicated to handling each of these types in many downstream vendor and OSS systems.
OTEP 149: Exponential histograms
OTEP 148, discussed and debated in open-telemetry/oteps#149, lists considerations around specifying an exponential histogram for OTLP. In an exponential histogram, boundaries are located at integer powers of a specific base value. If the base value is 1.1, then there are boundaries at 1.1, 1.1^2, 1.1^3, 1.1^4, and so on.
Merging exponential histograms
While this form of histogram is almost trivial in concept, there was an early concern about loss of fidelity when merging arbitrary histograms, due to residual "artifacts". This consideration leads in two directions, both of which have been explored.
The first avenue is to fix the histogram parameters completely--when there are no parameters then you can merge histograms without loss of fidelity. The second avenue is to choose a parameterization scheme that has "perfect subsetting", which is when buckets of a high-resolution histogram perfectly map into buckets of a lower-resolution histogram.
Perfect subsetting for exponential histograms
Following UDDSketch and unpublished work at Google, OTEP 149 lands on the idea of a powers-of-2 exponential histogram, one that ensures perfect subsetting.
OpenHistogram: Log-linear histograms
OpenHistogram uses a base-10 exponential scheme with 90 buckets linearly dividing each "decade", so for example there are 90 buckets between 0.1 and 1, 90 buckets between 1 and 10, 90 buckets between 10 and 100, and so on. This approach maps well to human intuition about logarithmic scale. This approach is also tailored for environments without floating point processors.
Merging OpenHistogram histograms
OpenHistogram has fixed parameterization, thus avoids loss of fidelity when merging.
Perfect subsetting for OpenHistogram-like histograms
The goal of perfect subsetting is to select parameters that support merging of high- and low-resolution histograms. OpenHistogram makes a strong case for the use of 90 buckets per decade, which is relatively high resolution, because any factor of 9 ensures integer boundaries >= 1 in a base-10 scheme.
Resolution factors that are compatible with OpenHistogram and have perfect subsetting:
Note that while OpenHistogram fixes resolution at 90, OpenTelemetry could adopt a protocol with support for other resolutions and still adopt OpenHistogram libraries for its SDKs, allowing metrics processing pipelines to automatically lower resolution for collected metrics.
Protocol considerations
Regardless of which choice is made for the options above above, there are several choices remaining.
Sparse vs. Dense encoding
A dense encoding is optimized for the case where non-empty buckets will be clustered together, making it efficient to encode a single offset and one (if exponential) or two (if log-linear) arrays of counts.
A sparse encoding is optimized for the case where non-empty buckets are not clustered together, making it efficient to encode every bucket index separately.
Sparse histograms can be converted into denser, lower-resolution histograms by perfect subsetting; bucket indexes are are sometimes compressed using delta-encoding techniques.
Zero bucket handling
The zero value must be handled specially in an exponential histogram. There is a question about how to recognize values that are close to zero, whether they "fall into" the zero bucket for example.
Converting from other histogram formats
Both the exponential and log-linear family of histogram are expected to improve the resolution-per-byte of encoded data that is achieved, relative to the explicit-boundary histogram currently included in OTLP. Metrics processors that operate on this data will require helper methods to assist in translating from other histogram formats into the exponential histogram that we choose.
To translate from another histogram format, often we use interpolation. Rules for interpolating between histogram buckets should specify how to handle the buckets that span zero and buckets with boundaries at infinity, since both are valid configurations. To interpolate from arbitrary-boundary buckets, we have to calculate bucket indexes for each boundary in the input.
Calculating bucket indexes: Exponential case
To calculate the bucket index for an arbitrary exponential base generally means calculating a logarithm of the value.
For the special case of base-2 exponential histograms, the IEEE 754 floating point representation has the data in the correct format, it can be directly extracted using bitwise operations.
To calculate bucket indexes without floating point hardware, a recursion relationship can be used.
Calculating bucket indexes: Log-linear case
OpenHistogram defines a recursion relation for calculating the bucket index.
Summary
Thank you to the experts that have guided us to this point. @yzhuge, @postwait, @oertl, @githomin, @CharlesMasson, @HeinrichHartmann, and @jdmontana.
This is now a question for the community. There are two major options presented here, a base-2 exponential histogram and a base-10 log-linear histogram. Both have technical merits.
There are also non-technical merits to consider. OpenHistogram is readily available and has already been adopted in a number of OSS systems, such as Envoy. An out-of-the-box Prometheus client uses 12 buckets with default boundaries that map exactly onto OpenHistogram boundaries, which cannot be said for the binary-exponential approach.
My personal opinion (@jmacd): I am in favor of adopting a protocol with support for OpenHistogram's histogram as long as the protocol supports the lower resolution factors listed above, 3, 9, 18, which appears to be a trivial extension to the OpenHistogram model. I am assuming this approach will be legally acceptable as far as the OpenHistogram project is concerned.
The text was updated successfully, but these errors were encountered: