-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Exemplars with Metrics SDK #113
Conversation
6b9e4a5
to
0b942e0
Compare
Relevant to this: open-telemetry/opentelemetry-proto#81 |
This is great. I added a bunch of comments describing how I think we can leave room for statistical interpretation based on exemplars in the future. |
Another related remark: In the extreme case of supporting one exemplar per metric event, can the protocol degenerate into the raw structured event encoding that has been requested? open-telemetry/opentelemetry-specification#617 I hadn't thought about this before making comments on the document above, but I think it might be possible to combine these ideas. If the value of an aggregation can optionally include raw values in addition to another aggregation, then raw values are the exemplars. @mxiamxia ^^^ |
Hi @jmacd, If I get you right, we can store the raw event values as
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me.
@jmacd I don't see you comments?
Oh, 🤦, they're in the linked Google doc. |
Agreed, I was thinking the same about merging raw metric exports and this message type. What about calling the exemplar message |
If I'm understanding this correctly, in any case where we export raw measurements it would use the |
Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
As I understand it, this is a known limitation of today's protocol and something we'll need to change to support nonstandard aggregations, with or without exemplars. The discussion from open-telemetry/opentelemetry-proto#81, open-telemetry/opentelemetry-proto#125, and open-telemetry/opentelemetry-proto#137 is relevant here.
I think we're missing something too, but I don't think the right solution is to overload It looks like there are a lot of questions about the implementation that need to be solved in the proto (Do we still need |
I think this is good to submit. My last comment on a potential restructuring that would avoid duplication of labels can be ignored, I think: #113 (comment) I discussed the matter with several people who agreed that in the long run, OTLP should have a way to avoid duplication and repetition of label sets in general and that we can ignore this question for now. |
Co-authored-by: Chris Kleinknecht <libc@google.com>
@bogdandrutu It's the same story with this PR. We have enough metrics approvers' approvals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
left a small question, not a blocker.
This has enough metrics reviewer approvals. Please merge. We have to keep interns unblocked! |
|
||
## Definition | ||
|
||
Exemplars are example data points for aggregated data. They provide specific context to otherwise general aggregations. For histogram-type metrics, exemplars are points associated with each bucket in the histogram giving an example of what was aggregated into the bucket. Exemplars are augmented beyond just measurements with references to the sampled trace where the measurement was recorded and labels that were attached to the measurement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exemplars are example data points for aggregated data.
Are they this "generic" thing, or are they "traces"? The proto schema below suggests the latter. For example, could I store "customer id" as exemplar, so that I could answer the question "which sample customer IDs have latencies in this bucket"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RawValue representation is a generic way to represent sampled metric events. There will some SDK-specific/custom selection logic that may decide to select only exemplars that have trace context, or they can decide to focus on the distribution of customer IDs. The customer ID would be represented by a label value.
When the aggregator is a histogram:
The SDK can select samples using fixed-size uniform selection on a per-bucket basis, or it can select items probabilistically so as to produce an expected number of exemplars per bucket that is equal (the latter is likely to have better coverage in the case where there are empty buckets--this can be accomplished using Weighted Sampling and inverse-probability weights, for example).
Let's suppose you configure exemplar selection to choose 100 exemplars per bucket per collection period. If the selection is unbiased and the sample_count fields are accurate, you will be able to summarize the contribution to each bucket by up to 100 customers.
Suppose it's a Counter producing a Sum aggregation, instead of a histogram. You could use exemplars selected from the Counter to summarize the contribution to a sum by customer ID. There are lots of ways to sample, and I believe this representation will support a large number of useful approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Customer ID as a label value would kill most metric backends. To me that's the whole point of exemplars, to allow associating samples of high-cardinality values with metrics. I am fine if we limit this high-cardinality dimension to trace IDs for now, but I am not seeing a "generic" solution here that would support exemplars on other dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment above had the same concern as you? Would adding correlation context as an attribute on RawValue
that can have the customer ID solve the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SDK has built-in support for aggregation so that high-cardinality labels can be eliminated before they reach most metric backends. The Sum, Histogram, or Summary that you export can be aggregated so that customer_id does not appear in the aggregation value.
Exemplars selected from the same series of events (that were summarized without customer_id) can include the customer_id, and the exemplars may be used to approximate the distribution of customer_id and other dimensions that were aggregated away in the Sum, Histogram or Summary value. One of the nice properties of the approach described here is that by limiting the number of exemplars, we limit cardinality reported in a single collection interval. For example, you could select 100 exemplars and even if there are 1000 actual customer_ids, you will collect at most 100 distinct values, and if chosen probabilistically, we can expect to recover the customer_ids that were most representative of the actual distribution (i.e., the "heavy hitters", the top of the distribution).
I want to emphasize that the API and the Protocol should not discourage the use of high-cardinality metrics. Given that I see exemplars as exactly the tool for addressing high cardinality, I'm confused by:
I am fine if we limit this high-cardinality dimension to trace IDs for now
What do you think we should restrict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I have with this schema is that it makes no distinction between regular label dimensions (which should survive aggregation) and the exemplar dimensions like customer-id. Only trace id is explicitly separated as exemplar dimensions. That makes it very easy for a user to shoot themselves in the foot and send an explosion of dimensions to the backend. The only way to avoid it is by carefully defining custom aggregation rules in the SDK and explicitly defining which labels should be treated as time series dimensions vs. exemplar attributes. While it minimizes the API surface, I feel that it makes the API more dangerous to use. Why not allow specifying exemplar labels explicitly from the beginning, and keep them separate from regular labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. I understand the concern, now.
By the way, an earlier draft of the metrics API allowed the application writer to recommend aggregation dimensions, by the name "Recommended Keys". It was removed: open-telemetry/opentelemetry-specification#463. The reason these keys were recommended is that we do not believe the author of the code knows which labels the system or the operator wants to monitor. If we ask the developer decide which dimensions are for aggregation and which are for exemplars, we make a semantic distinction out of a performance limitation (and not a universal one, as far as I know).
There is a practical reason to support arbitrary labels and deal with them through configuration: this is the natural thing to do when creating Metric events from Spans. Span attributes simply become Metric labels. We are adding a semantic convention to cover duration measurements: open-telemetry/opentelemetry-specification#657
The span-to-metrics issue is discussed here: open-telemetry/opentelemetry-specification#381
One way to address your concern would be to set the default to aggregate over zero dimensions, so that all labels are exemplar labels unless configured otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bogdandrutu I would like your opinion on this topic. We introduced RecommendedKeys()
to address a perceived need in Prometheus, since Prometheus clients actually enforce pre-declared label keys. We discovered that the Prometheus protocol does not have any such restriction, which made it appealing to remove a feature. In the (dog)statsd world, it's common to add labels as needed. In modern terminology, we had created (proposed) Metrics Processors named "defaultkeys" that would use the developer-provided recommended keys, and named "ungrouped" that would use all the keys when exporting metrics. Removing recommended keys brought us back to a single basic metrics processor.
With this proposal, we begin to see a "Sampler API" for metrics, that is one that takes a full set of labels, applies a sampling decision (whether to select an exemplar or not) and then returns the set of labels to use for aggregation
If we have a choice between:
(1) asking the user to choose which labels are significant for aggregation and which are not
(2) making it really easy to configure which labels are used for aggregation
I would absolutely prefer the second choice--whether aggregation is configured by a dynamic configuration API, by a static configuration API, or by hard-coding a View in your main() function, any of these should be viable and relatively easy, and all of these are more appealing to me than asking the user to distinguish two kinds of label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bogdandrutu Thanks for merging, but I think we should capture this discussion or at least address the question.
|
||
## Open questions | ||
|
||
- Exemplars usually refer to a span in a sampled trace. While using the collector to perform tail-sampling, the sampling decision may be deferred until after the metric would be exported. How do we create exemplars in this case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we address this question from the resolved discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel that this is the place to describe a fancy SDK approach to this problem. This question leads to arbitrarily complex approaches that are also found in the discussion about tail sampling itself. How should we decide to propagate a trace-is-sampled bit in-band when making child spans during a span lifetime? It's almost the same question.
A simple approach would be to maintain a per-span sample of metric events and buffer metric data until the span ends.
Another approach would use the statistics of the spans that are being selected by the tail sampler to form an unequal probability sampling scheme. Select sample metric events that are likely to be associated with spans that match the tail-sampling decision. E.g., if tail latency is used to select exemplars, and a high correlation is observed between latency and label X, then use label X to boost sample weight on metric events. This leads to a speculative approach where you try to choose exemplars that will have interesting traces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cnnradams Would you be willing to add an answer to this question? If head sampling, the logic for selecting trace contexts that are also being sampled is simple. If tail sampling, the logic for selecting metric samples has to be coordinated with tracing, delayed, or somehow speculative--and this decision is practically the same as deciding to what to tell your child in a span before the span is finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this 😬
To the depth that this OTEP goes, yes, this question is answered by "either the tail sampler needs to pick traces with exemplar choices in mind, or exemplars will need to be picked without a guarantee that they will have a trace". But the actual details of this still need to be worked out as far as I'm aware. I can't really mark this as answered now that its merged, so this will have to do 🤷
Another approach would use the statistics of the spans that are being selected by the tail sampler to form an unequal probability sampling scheme.
How would you have knowledge of the spans that were chosen to be sampled when that decision was made in a different process without your input?
|
||
## Definition | ||
|
||
Exemplars are example data points for aggregated data. They provide specific context to otherwise general aggregations. For histogram-type metrics, exemplars are points associated with each bucket in the histogram giving an example of what was aggregated into the bucket. Exemplars are augmented beyond just measurements with references to the sampled trace where the measurement was recorded and labels that were attached to the measurement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RawValue representation is a generic way to represent sampled metric events. There will some SDK-specific/custom selection logic that may decide to select only exemplars that have trace context, or they can decide to focus on the distribution of customer IDs. The customer ID would be represented by a label value.
When the aggregator is a histogram:
The SDK can select samples using fixed-size uniform selection on a per-bucket basis, or it can select items probabilistically so as to produce an expected number of exemplars per bucket that is equal (the latter is likely to have better coverage in the case where there are empty buckets--this can be accomplished using Weighted Sampling and inverse-probability weights, for example).
Let's suppose you configure exemplar selection to choose 100 exemplars per bucket per collection period. If the selection is unbiased and the sample_count fields are accurate, you will be able to summarize the contribution to each bucket by up to 100 customers.
Suppose it's a Counter producing a Sum aggregation, instead of a histogram. You could use exemplars selected from the Counter to summarize the contribution to a sum by customer ID. There are lots of ways to sample, and I believe this representation will support a large number of useful approaches.
An exemplar is defined as: | ||
|
||
``` | ||
message RawValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WIP open-telemetry/opentelemetry-proto#162 specifies that RawValue messages may be used in two ways.
- As exemplars that add additional information to SCALAR, HISTOGRAM, and SUMMARY data points.
- As raw data in its own right.
The SDK spec discusses three types of aggregation that can be represented as a scalar value: Sum, LastValue, and Exact. The exact representation would use RawValues with sample_count == 1.
|
||
## Open questions | ||
|
||
- Exemplars usually refer to a span in a sampled trace. While using the collector to perform tail-sampling, the sampling decision may be deferred until after the metric would be exported. How do we create exemplars in this case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel that this is the place to describe a fancy SDK approach to this problem. This question leads to arbitrarily complex approaches that are also found in the discussion about tail sampling itself. How should we decide to propagate a trace-is-sampled bit in-band when making child spans during a span lifetime? It's almost the same question.
A simple approach would be to maintain a per-span sample of metric events and buffer metric data until the span ends.
Another approach would use the statistics of the spans that are being selected by the tail sampler to form an unequal probability sampling scheme. Select sample metric events that are likely to be associated with spans that match the tail-sampling decision. E.g., if tail latency is used to select exemplars, and a high correlation is observed between latency and label X, then use label X to boost sample weight on metric events. This leads to a speculative approach where you try to choose exemplars that will have interesting traces.
@jmacd can I merge this or you need answers/actions for your comments? |
* Exemplar OTEP * Wording fixes Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> * stats updates, specify parameters/output format * aggregation -> aggregator, other small changes * gauge -> lastvalue * Update text/metrics/0113-exemplars.md Co-authored-by: Chris Kleinknecht <libc@google.com> Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> Co-authored-by: Chris Kleinknecht <libc@google.com> Co-authored-by: Bogdan Drutu <bogdandrutu@gmail.com>
* Exemplar OTEP * Wording fixes Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> * stats updates, specify parameters/output format * aggregation -> aggregator, other small changes * gauge -> lastvalue * Update text/metrics/0113-exemplars.md Co-authored-by: Chris Kleinknecht <libc@google.com> Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> Co-authored-by: Chris Kleinknecht <libc@google.com> Co-authored-by: Bogdan Drutu <bogdandrutu@gmail.com>
* Exemplar OTEP * Wording fixes Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> * stats updates, specify parameters/output format * aggregation -> aggregator, other small changes * gauge -> lastvalue * Update text/metrics/0113-exemplars.md Co-authored-by: Chris Kleinknecht <libc@google.com> Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> Co-authored-by: Chris Kleinknecht <libc@google.com> Co-authored-by: Bogdan Drutu <bogdandrutu@gmail.com>
* Exemplar OTEP * Wording fixes Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> * stats updates, specify parameters/output format * aggregation -> aggregator, other small changes * gauge -> lastvalue * Update text/metrics/0113-exemplars.md Co-authored-by: Chris Kleinknecht <libc@google.com> Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> Co-authored-by: Chris Kleinknecht <libc@google.com> Co-authored-by: Bogdan Drutu <bogdandrutu@gmail.com>
This OTEP defines exemplars within OpenTelemetry and specifies behaviour for exemplars for the default set of aggregations: Proposal doc