open-telemetry · kalyanaj · Jun 5, 2024 · Jun 5, 2024 · Jun 5, 2024 · Jun 5, 2024
diff --git a/text/trace/0235-sampling-threshold-in-trace-state.md b/text/trace/0235-sampling-threshold-in-trace-state.md
@@ -1,46 +1,52 @@
 # Sampling Threshold Propagation in TraceState
 
+## Abstract
+
+Sampling is an important lever to reduce the costs associated with collecting and processing telemetry data. It enables you to choose a representative set of items from an overall population.
+
+There are two key aspects for sampling of tracing data. The first is that sampling decisions can be made independently for *each* span in a trace. The second is that sampling decisions can be made at multiple points in the telemetry pipeline. For example, the sampling decision for a span at span creation time could have been to **keep** that span, while the downstream sampling decision for the *same* span at a later stage (say in an external process in the data collection pipeline) could be to **drop** it.
+
+For each of the above aspects, we want sampling decisions to be made in a **consistent** manner so that we can effectively reason about a trace. This OTEP describes a mechanism to achieve such consistent sampling decisions using a mechanism called **Consistent Probability Sampling**. To achieve this, it proposes a mechanism for a common random value (R) and a rejection threshold (T) that is based on a participant's sampling rate. This proposal describes how these values should be propagated and how participants should use them to make sampling decisions.
+
+This mechanism will enable creating a new set of samplers (known as Consistent Probability Samplers) that will enable trace participants to choose their own sampling rates, while still achieving consistent sampling decisions. This OTEP ensures that such samplers will interoperate with existing (non consistent probability) samplers.
+
 ## Motivation
 
-Sampling is a broad topic; here it refers to the independent decisions made at points in a distributed tracing system of whether to collect a span or not. Multiple sampling decisions can be made before a span is finally consumed. When sampling is to be performed at multiple points in the process, the only way to reason about it effectively is to make sure that the sampling decisions are **consistent**.
-In this context, consistency means that a positive sampling decision made for a particular span with probability p1 implies a positive sampling decision for any span belonging to the same trace, if it is made with probability p2 >= p1.
+Customers want to express arbitrary sampling probabilities such as 1%, 10%, and 75%. However, the existing experimental [specification for probability sampling using TraceState](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md) optimizes for powers of two probabilities. It supports non power of two sampling only using interpolation between powers of two. This approach is unnecessarily restrictive. Hence, we need an updated mechanism to support specifying any sampling probability.
+
+Further, there is a need for consistent sampling in the collection path (outside of the head-based sampling paths). To achieve consistent sampling decisions, the previous experimental spec required using a custom source of randomness (`r-value`). However, in such downstream sampling decisions, it can be expensive to reference this custom value from the tracestate attribute in every span. To improve this, this proposal makes use of the inherent randomness in the traceID as a less expensive solution. However, one caveat is that the new randomness flag introduced in the W3C TraceContext Level 2 specification can potentially be reset by trace participants until they move to that Level 2 specification. Hence, there is need to still reference tracestate to check for the non-existence of this custom random value before relying on the traceid as the source of randomness.
 
 ## Explanation
+Let's start with the definition for a consistent sampling decision. Consistency means that a positive sampling decision made for a particular span with probability p1 implies a positive sampling decision for any span belonging to the same trace if it is made with probability p2 >= p1.
+
+This proposal introduces a new value with the key `th` as an alternative to the `p` value in the previous specification. The `p` value is limited to powers of two, while the `th` value in this proposal supports a large range of values.
 
-The existing, experimental [specification for probability sampling using TraceState](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md) is limited to powers-of-two probabilities, and is designed to work without making assumptions about TraceID randomness.
-This system can only achieve non-power-of-two sampling using interpolation between powers of two, which is unnecessarily restrictive.
-In existing sampling systems, sampling probabilities like 1%, 10%, and 75% are common, and it should be possible to express these without interpolation.
-There is also a need for consistent sampling in the collection path (outside of the head-sampling paths) and using inherent randomness in the traceID is a less-expensive solution than referencing a custom `r-value` from the tracestate in every span.
-This proposal introduces a new value with the key `th` as an alternative to the `p` value in the previous specification.
-The `p` value is limited to powers of two, while the `th` value in this proposal supports a large range of values.
-This proposal allows for the continued expression of randomness using `r-value` as specified there using the key `r`.
-To distinguish the cases, this proposal uses the key `rv`.
+This proposal allows for the continued expression of randomness using `r-value` as specified there using the key `r`. To distinguish the cases, this proposal uses the key `rv`.
 
-In the general case, in order to make consistent sampling decisions across the entire path of the trace, two values MUST be present in the `SpanContext`:
+In the general case, in order to make consistent sampling decisions for the two aspects described above, two values MUST be present in the `SpanContext`:
 
 1. A _random_ (or pseudo-random) 56-bit value, called `R` below.
-2. A 56-bit _rejection threshold_ (or just "threshold") as expressed in the TraceState, called `T` below. `T` represents the maximum threshold that was applied in all previous consistent sampling stages. If the current sampling stage applies a greater-valued threshold than any stage before, it MUST update (increase) the threshold correspondingly.
+2. A 56-bit _rejection threshold_ (or just "threshold") as expressed in the TraceState, called `T` below. `T` represents the maximum threshold that was applied in all previous consistent sampling stages. If the current sampling stage applies a greater threshold value than any stage before, it MUST update (increase) the threshold correspondingly.
 
-One way to think about _rejection threshold_ is that is the number of spans that would be discarded out of 2^56 considered spans. This means that spans where `R >= T` will be sampled.
+One way to think about _rejection threshold_ is that it is the number of spans that would be discarded out of 2^56 considered spans. This means that spans where `R >= T` will be kept.
 
-Here is an example involving three participants `A`, `B`, and `C`:
+Here is an example involving three participating operations `A`, `B`, and `C`:
 
 `A` -> `B` -> `C`
 
-where -> indicates a parent -> child relationship.
+where -> indicates a parent to child relationship.
 
 `A` uses consistent probability sampling with a sampling probability of 0.25 (this corresponds to a rejection probability of .75).
 `B` uses consistent probability sampling with a sampling probability of 0.5.
 `C` uses a parent-based sampler.
 
-When `A` samples a span, its outgoing traceparent will have the 'sampled' flag SET and the 'th' in its outgoing tracestate will be set to `0xc0_0000_0000_0000`.
-When `A` does not sample a span, its outgoing traceparent will have the 'sampled' flag UNSET but the 'th' in its outgoing tracestate will still be set to `0xc0_0000_0000_0000`.
-When B samples a span, its outgoing traceparent will have the 'sampled' flag SET and the 'th' in its outgoing tracestate will be set to `0x80_0000_0000_0000`.
+When the sampling decision for `A` is to *keep* the span, its outgoing traceparent will have the 'sampled' flag SET and the 'th' in its outgoing tracestate will be set to `0xc0_0000_0000_0000`.
+When the sampling decision for `A` is to *drop* the span, its outgoing traceparent will have the 'sampled' flag UNSET but the 'th' in its outgoing tracestate will still be set to `0xc0_0000_0000_0000`.
+When the sampling decision for `B` is to *keep* the span, its outgoing traceparent will have the 'sampled' flag SET and the 'th' in its outgoing tracestate will be set to `0x80_0000_0000_0000`.
 C (being a parent based sampler) samples a span purely based on its parent (B in this case), it will use the sampled flag to make the decision. Its outgoing 'th' value will continue to reflect what it got from B (`0x80_0000_0000_0000`), and this is useful to understand its adjusted count.
 
 This design requires that as a given span progresses along its collection path, `th` is non-decreasing (and, in particular, must be increased at stages that apply lower sampling probabilities).
-It does not, however, restrict a span's initial `th` in any way (e.g., relating it to that of its parent, if it has one).
-It is acceptable for B to have a lesser initial `th` than A has. It would not be ok if some later-stage sampler decreased A's `th`.
+It does not, however, restrict a span's initial `th` in any way. If a parent-based consistent sampler is used, a span's initial `th` would be the same as its parent's `th` value, else it would be a new value based on the sampling rate chosen for that span. In other words, the sampling rate for each operation can be chosen independently, and this would map to having different `th` values for different spans. But for any particular span, it is not acceptable for a downstream sampler to *decrease* the `th` value in its context.
 
 The system has the following invariant:
 
@@ -51,27 +57,29 @@ The sampling decision is propagated with the following algorithm:
 * If the `th` key is not specified, this implies that non-probabilistic sampling may be taking place.
 * Else derive `T` by parsing the `th` key as a hex value as described below.
 * If `T` is 0, Always Sample.
-* Compare the 56 bits of `T` with the 56 bits of `R`. If `T > R`, then do not sample.
+* Compare the 56 bits of `T` with the 56 bits of `R`. If `R >= T`, then set the sampling decision to *keep* else make the decision to *drop*.
 
 The `R` value MUST be derived as follows:
 
 * If the key `rv` is present in the Tracestate header, then `R = rv`.
-* Else if the Random Trace ID Flag is `true` in the traceparent header, then `R` is the lowest-order 56 bits of the trace-id.
-* Else `R` MUST be generated as a random value in the range `[0, (2**56)-1]` and added to the Tracestate header with key `rv`.
+* Else `R` is the lowest-order 56 bits of the trace-id.
+
+At the root span, the `R` value must be generated as follows:
+
+* If the new random flag in the `traceparent` is set, then there is no action required. In this case, the tracestate header will not have the `rv` key, and the last 56 bits of the traceid will be used as the source of randomness. For more info on this new flag, see [the W3C trace context specification](https://w3c.github.io/trace-context/#trace-id).
+* If not, `R` MUST be generated as a random value in the range `[0, (2**56)-1]` and added to the Tracestate header with key `rv`.
 
-The preferred way to propagate the `R` value is as the lowest 56 bits of the trace-id.
-If these bits are in fact random, the `random` trace-flag SHOULD be set as specified in [the W3C trace context specification](https://w3c.github.io/trace-context/#trace-id).
-There are circumstances where trace-id randomness is inadequate (for example, sampling a group of traces together); in these cases, an `rv` value is required.
+Although less common, there are circumstances where trace-id randomness is inadequate (for example, when sampling a group of traces together); in these cases, an `rv` value is required.
 
 The value of the `rv` and `th` keys MUST be expressed as up to 14 hexadecimal digits from the set `[0-9a-f]`. For `th` keys only, trailing zeros (but not leading zeros) may be omitted. `rv` keys MUST always be exactly 14 hex digits.
 
 Examples:
 
 - `th` value is missing: non-probabalistic sampling may be taking place.
-- `th=4` -- equivalent to `th=40000000000000`, which is a 25% rejection threshold, corresponding to a 75% sampling probability.
-- `th=c` -- equivalent to `th=c0000000000000`, which is a rejection threshold of 75%, corresponding to a sampling probability of 25%.
-- `th=08` -- equivalent to `th=08000000000000`, which is a rejection threshold of 3.125%, corresponding to a sampling probability of 96.875%.
-- `th=0` -- equivalent to `th=00000000000000`, which is a 0% rejection threshold, which means Always Sample.
+- `th=0` -- equivalent to `th=00000000000000`, which is a 0% rejection threshold, corresponding to 100% sampling probability (Always Sample).
+- `th=08` -- equivalent to `th=08000000000000`, which is a rejection threshold of 3.125%, corresponding to 96.875% sampling probability.
+- `th=4` -- equivalent to `th=40000000000000`, which is a 25% rejection threshold, corresponding to 75% sampling probability.
+- `th=c` -- equivalent to `th=c0000000000000`, which is a rejection threshold of 75%, corresponding to 25% sampling probability.
 
 The `T` value MUST be derived as follows:
 
@@ -82,21 +90,21 @@ Sampling Decisions MUST be propagated by setting the value of the `th` key in th
 
 ## Initializing and updating T and R values
 
-There are two categories of sampler:
+There are two categories of samplers:
 
 - **Head samplers:** Implementations of [`Sampler`](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.29.0/specification/trace/sdk.md#sampler), called by a `Tracer` during span creation.
-- **Downstream samplers:** Any component that, given an ended Span, decides whether to drop or forward ("sample") it on to the next component in the system. Also known as "collection-path samplers" or "sampling processors". _Tail samplers_ are a special class of downstream samplers that buffer the spans in a trace and select a sampling probability for the trace as a whole using data from any span in the buffered trace.
+- **Downstream samplers:** Any component that, given an ended Span, decides whether to *drop* or *keep* it by forwarding it to the next component in the system. This category is also known as "collection path samplers" or "sampling processors". _Tail samplers_ are a special class of downstream samplers that buffer spans of a trace and make a sampling decision for the trace as a whole using data from any span in the buffered trace.
 
 This section defines behavior for each kind of sampler.
 
 ### Head samplers
 
-A head sampler is responsible for computing the `rv` and `th` values in a new span's initial [`TraceState`](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.29.0/specification/trace/api.md#tracestate). Notable inputs to that computation include the parent span's trace state (if a parent span exists) and the new span's trace ID.
+A head sampler is responsible for computing the `rv` and `th` values in a new span's initial [`TraceState`](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.29.0/specification/trace/api.md#tracestate). The main inputs to that computation include the parent span's trace state (if a parent span exists), the new span's trace ID, and possibly the trace flags (to know if the trace ID has been generated in a random manner).
 
-First, a consistent `Sampler` decides which sampling probability to use. The sampler MAY select any value of T. If a valid `SpanContext` is provided in the call to `ShouldSample` (indicating that the span being created will be a child span),
+First, a consistent probability `Sampler` may choose its own sampling rate. The higher the chosen sampling rate, the lower the rejection threshold (T). It MAY select any value of T. If a valid `SpanContext` is provided in the call to `ShouldSample` (indicating that the span being created will be a child span),
 
-- Choosing a T greater than the parent span's is expected to result in partial traces (the parent may be sampled but its child, the current span, dropped).
-- Choosing a T less than or equal to the parent span is expected to result in complete traces (this is definition of consistent probability sampling).
+- Choosing a T greater than the parent span's T can result in partial traces. The parent span may be `kept` but it is possible that its child, the current span, may be dropped because of the lower sampling rate. At the same time, in case where the child span is `kept`, the parent span would have been to `keep` as well (meeting our consistent sampling goals) since the parent's sampling rate is greater than the child's sampling rate.
+- Similarly, choosing a T less than or equal to the parent span can also result in partial traces. The parent span might have been `dropped` but it is possible that its child, the current span, may be `kept` because of the higher sampling rate. At the same time, in case where the parent span is `kept`, the child span would be `kept` as well (meeting our consistent sampling goals) since the child's sampling rate is greater than the parent's sampling rate.
 
 For the output TraceState,
 
@@ -169,7 +177,7 @@ This proposal is the result of long negotiations on the Sampling SIG over what i
 
 ## Prior art and alternatives
 
-The existing specification for `r-value` and `p-value` attempted to solve this problem, but were limited to powers of 2, which is inadequate.
+The existing specification for `r-value` and `p-value` attempted to solve this problem, but was limited to powers of 2, which is inadequate.
 
 ## Open questions
 
@@ -180,5 +188,5 @@ We also know that some implementations prefer to use a sampling probability (in
 ## Future possibilities
 
 This permits sampling systems to propagate consistent sampling information downstream where it can be compensated for.
-For example, this will enable the tail-sampling processor in the OTel Collector to propagate its sampling decisions to backends in a standard way.
+For example, this will enable the tail-sampling processor in the OTel Collector to propagate its sampling decisions to backend systems in a standard way.
 This permits backend systems to use the effective sampling probability in data presentations.