From 506a3284a35d1d059db4306f0dbf09a9f1309ac4 Mon Sep 17 00:00:00 2001 From: Connor Adams Date: Wed, 3 Jun 2020 13:49:54 -0400 Subject: [PATCH 1/6] Exemplar OTEP --- text/metrics/0113-exemplars.md | 94 ++++++++++++++++++++++++++++++++++ 1 file changed, 94 insertions(+) create mode 100644 text/metrics/0113-exemplars.md diff --git a/text/metrics/0113-exemplars.md b/text/metrics/0113-exemplars.md new file mode 100644 index 000000000..dc21f785c --- /dev/null +++ b/text/metrics/0113-exemplars.md @@ -0,0 +1,94 @@ +# Integrate Exemplars with Metrics + +This OTEP adds exemplar support to aggregations defined in the Metrics SDK. + +## Definition + +Exemplars are example data points for aggregated data. They provide specific context to otherwise general aggregations. For histogram-type metrics, exemplars are points associated with each bucket in the histogram giving an example of what was aggregated into the bucket. Exemplars are augmented beyond just measurements with references to the sampled trace where the measurement was recorded and labels that were attached to the measurement. + +## Motivation + +Defining exemplar behaviour for aggregations allows OpenTelemetry to support exemplars in Google Cloud Monitoring. + +Exemplars provide a link between metrics and traces. Consider a user using a Histogram aggregation to track response latencies over time for a high QPS server. The histogram is composed of buckets based on the speed of the request, for example, "there were 55 requests that took 400-500 milliseconds". The user wants to troubleshoot slow requests, so they would need to find a trace where the latency was high. With exemplars, the user is able to get an exemplar trace from a high latency bucket, an exemplar trace from a low latency bucket, and compare them to figure out the reason for the high latency. + +Exemplars are meaningful for all aggregations where relevant traces can provide more context to the aggregation, as well as when exemplars can display specific information not otherwise shown in the aggregation (for example, the full set of labels where they otherwise might be aggregated away). + +## Internal details + +An exemplar is defined as: + +``` +message Exemplar { + // Numerical value of the measurement that was recorded. Only one of these two fields is + // used for the data, depending on its type + double double_value = 0; + int64 int64_value = 1; + + // Exact time that the measurement was recorded + fixed64 time_unix_nano = 2; + + // 'label:value' map of all labels that were provided by the user recording the measurement + repeated opentelemetry.proto.common.v1.StringKeyValue labels = 3; + + // Span ID of the current trace [Optional] + string span_id = 4; + + // Trace ID of the current trace [Optional] + string trace_id = 5; +} +``` + +Exemplar collection should be enabled through an optional parameter, and when not enabled, there should be no collection/logic performed related to exemplars. This is to ensure that when necessary, aggregations are as high performance as possible. + +[#347](https://github.com/open-telemetry/opentelemetry-specification/pull/347) describes a set of standard aggregations in the metrics SDK. Here we describe how exemplars could be implemented for each aggregation. + +### Exemplar behaviour for standard aggregations + +#### HistogramAggregator + +Every bucket in the HistogramAggregator MUST (when enabled) maintain a list of exemplars whose values are within the boundaries of the bucket. Implementations should attempt to retain at least one exemplar per bucket, with a preference for exemplars with a sampled trace context and exemplars that were recorded later in the time period. They should also not retain an unbounded number of exemplars. + +#### Sketch + +A Sketch aggregation should maintain a list of exemplars whose values are spaced out across the distribution. There is no specific number of exemplars that should be retained (although the amount should not be unbounded), but the implementation should pick exemplars that represent as much of the distribution as possible. Preference should be given to exemplars with a sampled trace context. (Specific details not defined, see open questions.) + +#### Gauge + +Most (if not all) Gauges operate asynchronously and do not ever interact with traces. Since the value of a Gauge is the last measurement (essentially the other parts of an exemplar), exemplars are not worth implementing for Gauge. + +#### Exact + +The Exact aggregator does not aggregate measurements. If exemplars are enabled, implementations may attach a separate exemplar to each measurement in an exact aggregation including the trace context and full set of labels. + +Exemplars will always be retrieved from aggregations (by the exporter) as a list of Exemplar objects. + +## Trade-offs and mitigations + +Performance (in terms of memory usage and to some extent time complexity) is the main concern of implementing exemplars. However, by making recording exemplars optional, there should be minimal overhead when exemplars are not enabled. + +## Prior art and alternatives + +Exemplars are implemented in [OpenCensus](https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/Exemplars.md#exemplars), but only for HistogramAggregator. This OTEP is largely a port from the OpenCensus definition of exemplars, but it also adds exemplar support to other aggregators. + +[Cloud monitoring API doc for exemplars](https://cloud.google.com/monitoring/api/ref_v3/rpc/google.api#google.api.Distribution.Exemplar) + +## Open questions + +- Exemplars usually refer to a span in a sampled trace. While using the collector to perform tail-sampling, the sampling decision may be deferred until after the metric would be exported. How do we create exemplars in this case? + +- We don’t have a strong grasp on how the sketch aggregator works in terms of implementation - so we don’t have enough information to design how exemplars should work properly. + +- The spec doesn't yet define a standard set of aggregations, just default aggregations for standard metric instruments. Since exemplars are always attached to particular aggregations, it's impossible to fully specify the behavior of exemplars. + +### Which aggregations should include exemplars? + +There are other aggregations that can benefit from exemplars, but they do not have well defined exemplar implementations and they are not supported by any known exporter. Should these be included in the OTEP or should they be left out?: + +#### Counter + +Exemplars give value to counter aggregations by tying metric and trace data together. When enabled, the aggregator will retain a small bounded list of exemplars at each checkpoint, containing at least the minimum and maximum value measurements whose trace context was sampled. Measurements should only be retained if there is a sampled trace context when the measurement was recorded. + +#### MinMaxSumCount + +The aggregator should maintain a list of at least two exemplars (when enabled), one near the maximum value and one near the minimum value. Preference should be given to exemplars with sampled traces, and if those are not available then the actual min and max values should be used. From e07e155c53f311979428a7bfe18111f4e0f4958d Mon Sep 17 00:00:00 2001 From: Connor Adams Date: Mon, 8 Jun 2020 09:31:54 -0400 Subject: [PATCH 2/6] Wording fixes Co-authored-by: Tyler Yahn --- text/metrics/0113-exemplars.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/text/metrics/0113-exemplars.md b/text/metrics/0113-exemplars.md index dc21f785c..cd9fd2464 100644 --- a/text/metrics/0113-exemplars.md +++ b/text/metrics/0113-exemplars.md @@ -47,13 +47,13 @@ Exemplar collection should be enabled through an optional parameter, and when no #### HistogramAggregator -Every bucket in the HistogramAggregator MUST (when enabled) maintain a list of exemplars whose values are within the boundaries of the bucket. Implementations should attempt to retain at least one exemplar per bucket, with a preference for exemplars with a sampled trace context and exemplars that were recorded later in the time period. They should also not retain an unbounded number of exemplars. +Every bucket in the HistogramAggregator MUST (when enabled) maintain a list of exemplars whose values are within the boundaries of the bucket (for buckets containing a population of one or more). Implementations SHOULD attempt to retain at least one exemplar per bucket, with a preference for exemplars with a sampled trace context and exemplars that were recorded later in the time period. They SHOULD NOT retain an unbounded number of exemplars. #### Sketch -A Sketch aggregation should maintain a list of exemplars whose values are spaced out across the distribution. There is no specific number of exemplars that should be retained (although the amount should not be unbounded), but the implementation should pick exemplars that represent as much of the distribution as possible. Preference should be given to exemplars with a sampled trace context. (Specific details not defined, see open questions.) +A Sketch aggregation SHOULD maintain a list of exemplars whose values are spaced out across the distribution. There is no specific number of exemplars that should be retained (although the amount SHOULD NOT be unbounded), but the implementation SHOULD pick exemplars that represent as much of the distribution as possible. Preference SHOULD be given to exemplars with a sampled trace context. (Specific details not defined, see open questions.) -#### Gauge +#### Last-Value Most (if not all) Gauges operate asynchronously and do not ever interact with traces. Since the value of a Gauge is the last measurement (essentially the other parts of an exemplar), exemplars are not worth implementing for Gauge. @@ -87,8 +87,8 @@ There are other aggregations that can benefit from exemplars, but they do not ha #### Counter -Exemplars give value to counter aggregations by tying metric and trace data together. When enabled, the aggregator will retain a small bounded list of exemplars at each checkpoint, containing at least the minimum and maximum value measurements whose trace context was sampled. Measurements should only be retained if there is a sampled trace context when the measurement was recorded. +Exemplars give value to counter aggregations by tying metric and trace data together. When enabled, the aggregator will retain a small bounded list of exemplars at each checkpoint, containing at least the minimum and maximum value measurements whose trace context was sampled. Measurements SHOULD be retained only if there is a sampled trace context when the measurement was recorded. #### MinMaxSumCount -The aggregator should maintain a list of at least two exemplars (when enabled), one near the maximum value and one near the minimum value. Preference should be given to exemplars with sampled traces, and if those are not available then the actual min and max values should be used. +The aggregator should maintain a list of at least two exemplars (when enabled), one near the maximum value and one near the minimum value. Preference SHOULD be given to exemplars with sampled traces, and if those are not available then the actual min and max values SHOULD be used. From 8bb0d551e24fbb46d93b742ba0d717ae58ff446d Mon Sep 17 00:00:00 2001 From: Connor Adams Date: Tue, 9 Jun 2020 10:28:50 -0400 Subject: [PATCH 3/6] stats updates, specify parameters/output format --- text/metrics/0113-exemplars.md | 54 ++++++++++++++++++++-------------- 1 file changed, 32 insertions(+), 22 deletions(-) diff --git a/text/metrics/0113-exemplars.md b/text/metrics/0113-exemplars.md index cd9fd2464..92af01a6d 100644 --- a/text/metrics/0113-exemplars.md +++ b/text/metrics/0113-exemplars.md @@ -19,7 +19,7 @@ Exemplars are meaningful for all aggregations where relevant traces can provide An exemplar is defined as: ``` -message Exemplar { +message RawValue { // Numerical value of the measurement that was recorded. Only one of these two fields is // used for the data, depending on its type double double_value = 0; @@ -31,15 +31,19 @@ message Exemplar { // 'label:value' map of all labels that were provided by the user recording the measurement repeated opentelemetry.proto.common.v1.StringKeyValue labels = 3; - // Span ID of the current trace [Optional] - string span_id = 4; + // Span ID of the current trace + optional string span_id = 4; - // Trace ID of the current trace [Optional] - string trace_id = 5; + // Trace ID of the current trace + optional string trace_id = 5; + + // When sample_count is non-zero, this exemplar has been chosen in a statistically + // unbiased way such that the exemplar is representative of `sample_count` individual events + optional double sample_count = 6; } ``` -Exemplar collection should be enabled through an optional parameter, and when not enabled, there should be no collection/logic performed related to exemplars. This is to ensure that when necessary, aggregations are as high performance as possible. +Exemplar collection should be enabled through an optional parameter (disabled by default), and when not enabled, there should be no collection/logic performed related to exemplars. This is to ensure that when necessary, aggregations are as high performance as possible. Aggregations should also have a parameter to determine whether exemplars should only be collected if they are recorded during a sampled trace, or if tracing should have no effect on which exemplars are sampled. This allows aggregations to prioritize either the link between metrics and traces or the statistical significance of exemplars, when necessary. [#347](https://github.com/open-telemetry/opentelemetry-specification/pull/347) describes a set of standard aggregations in the metrics SDK. Here we describe how exemplars could be implemented for each aggregation. @@ -47,11 +51,11 @@ Exemplar collection should be enabled through an optional parameter, and when no #### HistogramAggregator -Every bucket in the HistogramAggregator MUST (when enabled) maintain a list of exemplars whose values are within the boundaries of the bucket (for buckets containing a population of one or more). Implementations SHOULD attempt to retain at least one exemplar per bucket, with a preference for exemplars with a sampled trace context and exemplars that were recorded later in the time period. They SHOULD NOT retain an unbounded number of exemplars. +The HistogramAggregator MUST (when enabled) maintain a list of exemplars whose values are distributed across all buckets of the histogram (there should be one or more exemplars in every bucket that has a population of at least one sample-able measurement). Implementations SHOULD NOT retain an unbounded number of exemplars. #### Sketch -A Sketch aggregation SHOULD maintain a list of exemplars whose values are spaced out across the distribution. There is no specific number of exemplars that should be retained (although the amount SHOULD NOT be unbounded), but the implementation SHOULD pick exemplars that represent as much of the distribution as possible. Preference SHOULD be given to exemplars with a sampled trace context. (Specific details not defined, see open questions.) +A Sketch aggregation SHOULD maintain a list of exemplars whose values are spaced out across the distribution. There is no specific number of exemplars that should be retained (although the amount SHOULD NOT be unbounded), but the implementation SHOULD pick exemplars that represent as much of the distribution as possible. (Specific details not defined, see open questions.) #### Last-Value @@ -59,9 +63,27 @@ Most (if not all) Gauges operate asynchronously and do not ever interact with tr #### Exact -The Exact aggregator does not aggregate measurements. If exemplars are enabled, implementations may attach a separate exemplar to each measurement in an exact aggregation including the trace context and full set of labels. +The Exact aggregation will function by maintaining a list of `RawValue`s, which contain all of the information exemplars would carry. Therefore the Exact aggregation will not need to maintain any exemplars. + +#### Counter + +Exemplars give value to counter aggregations in two ways: One, by tying metric and trace data together, and two, by providing necessary information to re-create the input distribution. When enabled, the aggregator will retain a bounded list of exemplars at each checkpoint, sampled from across the distribution of the data. Exemplars should be sampled in a statistically significant way. + +#### MinMaxSumCount -Exemplars will always be retrieved from aggregations (by the exporter) as a list of Exemplar objects. +Similar to Counter, MinMaxSumCount should retain a bounded list of exemplars that were sampled from across the input distribution in a statistically significant way. + +#### Custom Aggregations + +Custom aggregations MAY support exemplars by maintaining a list of exemplars that can be retrieved by exporters. Custom aggregations should select exemplars based on their usage by the connected exporter (for example, exemplars recorded for Stackdriver should only be retained if they were recorded within a sampled trace). + +Exemplars will always be retrieved from aggregations (by the exporter) as a list of RawValue objects. They will be communicated via a + +``` +optional repeated RawValue exemplars = 6 +``` + +attribute on the `metric_descriptor` object. ## Trade-offs and mitigations @@ -80,15 +102,3 @@ Exemplars are implemented in [OpenCensus](https://github.com/census-instrumentat - We don’t have a strong grasp on how the sketch aggregator works in terms of implementation - so we don’t have enough information to design how exemplars should work properly. - The spec doesn't yet define a standard set of aggregations, just default aggregations for standard metric instruments. Since exemplars are always attached to particular aggregations, it's impossible to fully specify the behavior of exemplars. - -### Which aggregations should include exemplars? - -There are other aggregations that can benefit from exemplars, but they do not have well defined exemplar implementations and they are not supported by any known exporter. Should these be included in the OTEP or should they be left out?: - -#### Counter - -Exemplars give value to counter aggregations by tying metric and trace data together. When enabled, the aggregator will retain a small bounded list of exemplars at each checkpoint, containing at least the minimum and maximum value measurements whose trace context was sampled. Measurements SHOULD be retained only if there is a sampled trace context when the measurement was recorded. - -#### MinMaxSumCount - -The aggregator should maintain a list of at least two exemplars (when enabled), one near the maximum value and one near the minimum value. Preference SHOULD be given to exemplars with sampled traces, and if those are not available then the actual min and max values SHOULD be used. From ab60d380397875780c180240ceadbb49078d7e6f Mon Sep 17 00:00:00 2001 From: Connor Adams Date: Thu, 11 Jun 2020 10:03:07 -0400 Subject: [PATCH 4/6] aggregation -> aggregator, other small changes --- text/metrics/0113-exemplars.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/text/metrics/0113-exemplars.md b/text/metrics/0113-exemplars.md index 92af01a6d..810bb6cf7 100644 --- a/text/metrics/0113-exemplars.md +++ b/text/metrics/0113-exemplars.md @@ -16,7 +16,7 @@ Exemplars are meaningful for all aggregations where relevant traces can provide ## Internal details -An exemplar is defined as: +An exemplar is a `RawValue`, which is defined as: ``` message RawValue { @@ -43,11 +43,11 @@ message RawValue { } ``` -Exemplar collection should be enabled through an optional parameter (disabled by default), and when not enabled, there should be no collection/logic performed related to exemplars. This is to ensure that when necessary, aggregations are as high performance as possible. Aggregations should also have a parameter to determine whether exemplars should only be collected if they are recorded during a sampled trace, or if tracing should have no effect on which exemplars are sampled. This allows aggregations to prioritize either the link between metrics and traces or the statistical significance of exemplars, when necessary. +Exemplar collection should be enabled through an optional parameter (disabled by default), and when not enabled, there should be no collection/logic performed related to exemplars. This is to ensure that when necessary, aggregators are as high performance as possible. Aggregators should also have a parameter to determine whether exemplars should only be collected if they are recorded during a sampled trace, or if tracing should have no effect on which exemplars are sampled. This allows aggregations to prioritize either the link between metrics and traces or the statistical significance of exemplars, when necessary. -[#347](https://github.com/open-telemetry/opentelemetry-specification/pull/347) describes a set of standard aggregations in the metrics SDK. Here we describe how exemplars could be implemented for each aggregation. +[#347](https://github.com/open-telemetry/opentelemetry-specification/pull/347) describes a set of standard aggregators in the metrics SDK. Here we describe how exemplars could be implemented for each aggregator. -### Exemplar behaviour for standard aggregations +### Exemplar behaviour for standard aggregators #### HistogramAggregator @@ -55,15 +55,15 @@ The HistogramAggregator MUST (when enabled) maintain a list of exemplars whose v #### Sketch -A Sketch aggregation SHOULD maintain a list of exemplars whose values are spaced out across the distribution. There is no specific number of exemplars that should be retained (although the amount SHOULD NOT be unbounded), but the implementation SHOULD pick exemplars that represent as much of the distribution as possible. (Specific details not defined, see open questions.) +A Sketch aggregator SHOULD maintain a list of exemplars whose values are spaced out across the distribution. There is no specific number of exemplars that should be retained (although the amount SHOULD NOT be unbounded), but the implementation SHOULD pick exemplars that represent as much of the distribution as possible. (Specific details not defined, see open questions.) #### Last-Value -Most (if not all) Gauges operate asynchronously and do not ever interact with traces. Since the value of a Gauge is the last measurement (essentially the other parts of an exemplar), exemplars are not worth implementing for Gauge. +Most (if not all) Last-Value aggregators operate asynchronously and do not ever interact with context. Since the value of a Last-Value is the last measurement (essentially the other parts of an exemplar), exemplars are not worth implementing for Gauge. #### Exact -The Exact aggregation will function by maintaining a list of `RawValue`s, which contain all of the information exemplars would carry. Therefore the Exact aggregation will not need to maintain any exemplars. +The Exact aggregator will function by maintaining a list of `RawValue`s, which contain all of the information exemplars would carry. Therefore the Exact aggregator will not need to maintain any exemplars. #### Counter @@ -73,9 +73,9 @@ Exemplars give value to counter aggregations in two ways: One, by tying metric a Similar to Counter, MinMaxSumCount should retain a bounded list of exemplars that were sampled from across the input distribution in a statistically significant way. -#### Custom Aggregations +#### Custom Aggregators -Custom aggregations MAY support exemplars by maintaining a list of exemplars that can be retrieved by exporters. Custom aggregations should select exemplars based on their usage by the connected exporter (for example, exemplars recorded for Stackdriver should only be retained if they were recorded within a sampled trace). +Custom aggregators MAY support exemplars by maintaining a list of exemplars that can be retrieved by exporters. Custom aggregators should select exemplars based on their usage by the connected exporter (for example, exemplars recorded for Stackdriver should only be retained if they were recorded within a sampled trace). Exemplars will always be retrieved from aggregations (by the exporter) as a list of RawValue objects. They will be communicated via a @@ -83,7 +83,7 @@ Exemplars will always be retrieved from aggregations (by the exporter) as a list optional repeated RawValue exemplars = 6 ``` -attribute on the `metric_descriptor` object. +attribute on the `Metric` object. ## Trade-offs and mitigations From ec82dc49a8f552597c88891e52b1baa2cbe81f10 Mon Sep 17 00:00:00 2001 From: Connor Adams Date: Thu, 11 Jun 2020 15:38:47 -0400 Subject: [PATCH 5/6] gauge -> lastvalue --- text/metrics/0113-exemplars.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/text/metrics/0113-exemplars.md b/text/metrics/0113-exemplars.md index 810bb6cf7..8b1426d98 100644 --- a/text/metrics/0113-exemplars.md +++ b/text/metrics/0113-exemplars.md @@ -32,10 +32,10 @@ message RawValue { repeated opentelemetry.proto.common.v1.StringKeyValue labels = 3; // Span ID of the current trace - optional string span_id = 4; + optional bytes span_id = 4; // Trace ID of the current trace - optional string trace_id = 5; + optional bytes trace_id = 5; // When sample_count is non-zero, this exemplar has been chosen in a statistically // unbiased way such that the exemplar is representative of `sample_count` individual events @@ -59,7 +59,7 @@ A Sketch aggregator SHOULD maintain a list of exemplars whose values are spaced #### Last-Value -Most (if not all) Last-Value aggregators operate asynchronously and do not ever interact with context. Since the value of a Last-Value is the last measurement (essentially the other parts of an exemplar), exemplars are not worth implementing for Gauge. +Most (if not all) Last-Value aggregators operate asynchronously and do not ever interact with context. Since the value of a Last-Value is the last measurement (essentially the other parts of an exemplar), exemplars are not worth implementing for Last-Value. #### Exact From 24c4c1e89b7f1a261eb10d1bcc7dd9d40102fb1d Mon Sep 17 00:00:00 2001 From: Connor Adams Date: Thu, 18 Jun 2020 16:20:21 -0400 Subject: [PATCH 6/6] Update text/metrics/0113-exemplars.md Co-authored-by: Chris Kleinknecht --- text/metrics/0113-exemplars.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/metrics/0113-exemplars.md b/text/metrics/0113-exemplars.md index 8b1426d98..d0a4be102 100644 --- a/text/metrics/0113-exemplars.md +++ b/text/metrics/0113-exemplars.md @@ -75,7 +75,7 @@ Similar to Counter, MinMaxSumCount should retain a bounded list of exemplars tha #### Custom Aggregators -Custom aggregators MAY support exemplars by maintaining a list of exemplars that can be retrieved by exporters. Custom aggregators should select exemplars based on their usage by the connected exporter (for example, exemplars recorded for Stackdriver should only be retained if they were recorded within a sampled trace). +Custom aggregators MAY support exemplars by maintaining a list of exemplars that can be retrieved by exporters. Custom aggregators should select exemplars based on their usage by the connected exporter (for example, exemplars recorded for Google Cloud Monitoring should only be retained if they were recorded within a sampled trace). Exemplars will always be retrieved from aggregations (by the exporter) as a list of RawValue objects. They will be communicated via a