open-telemetry · jmacd · Sep 29, 2021 · Jul 23, 2021 · Jul 23, 2021 · Jul 23, 2021
diff --git a/text/trace/0168-sampling-propagation.md b/text/trace/0168-sampling-propagation.md
@@ -0,0 +1,317 @@
+# Propagate head trace sampling probability
+
+Use the W3C trace context to convey consistent head trace sampling probability.
+
+## Motivation
+
+The head trace sampling probability is the probability associated with
+the start of a trace context that was used to determine whether the
+W3C `sampled` flag is set, which determines whether child contexts
+will be sampled by a `ParentBased` Sampler.  It is useful to know the
+head trace sampling probability associated with a context in order to
+build span-to-metrics pipelines when the built-in `ParentBased`
+Sampler is used.  Further motivation for supporting span-to-metrics
+pipelines is presented in [OTEP
+170](https://github.com/open-telemetry/oteps/pull/170).
+
+A consistent trace sampling decision is one that can be carried out at
+any node in a trace, which supports collecting partial traces.
+OpenTelemetry specifies a built-in `TraceIDRatioBased` Sampler that
+aims to accomplish this goal but was left incomplete (see a
+[TODO](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#traceidratiobased) 
+in the specification).
+
+We propose to propagate the necessary information alongside the [W3C
+sampled flag](https://www.w3.org/TR/trace-context/#sampled-flag) using
+`tracestate` with an `otel` vendor tag, which will require
+(separately) [specifying how the OpenTelemetry project uses
+`tracestate` itself](https://github.com/open-telemetry/opentelemetry-specification/pull/1852).
+
+## Explanation
+
+Two pieces of information are needed to convey consistent head trace
+sampling probability:
+
+1. The head trace sampling probability
+2. Source of consistent sampling decisions.
+
+This proposal uses 6 bits of information for each of these and does
+not depend on built-in TraceID randomness, which is not sufficiently
+specified for probability sampling at this time.  This proposal closely 
+follows [research by Otmar Ertl](https://arxiv.org/pdf/2107.07703.pdf).
+
+### Probability value
+
+To limit the cost of this extension and for statistical reasons
+documented below, we propose to limit head trace sampling probability
+to powers of two.  This limits the available head trace sampling
+probabilities to 1/2, 1/4, 1/8, and so on.  We can compactly encode
+these probabilities as small integer values using the base-2 logarithm
+of the adjusted count (i.e., inverse probability).
+
+For example, the probability value 2 corresponds with 1-in-4 sampling,
+the probability value 10 corresponds with 1-in-1024 sampling.  Using
+six bits of information we can convey sampling rates as small as
+2**-61.  The value 62 is reserved to mean sampling with probability 0,
+which conveys an adjusted count of 0 for the associated context.
+
+When propagated the probability value will be interpreted as shown in
+the following table, which uses an offset of +1:
+
+| Probability Value | Head Probability |
+| ----- | ----------- |
+| 0 | Unknown |
+| 1 | 1 |
+| 2 | 1/2 |
+| 3 | 1/4 |
+| ... | ... |
+| N | 2**(-N+1) |
+| ... | ... |
+| 61 | 2**-60 |
+| 62 | 2**-61 |
+| 63 | 0 |
+
+[Described in OTEP
+170](https://github.com/open-telemetry/oteps/pull/170), Span data
+would encode the probability value described here offset by +1, when
+the adjusted count is known, and would encode 0 when the adjusted
+count is unknown.
+
+### Randomness value
+
+With head trace sampling probabilities limited to powers of two, the
+amount of randomness needed per trace context is limited.  A
+consistent sampling decision is accomplished by propagating a specific
+random variable denoted `R`.  The random variable is a described by a
+discrete geometric distribution having shape parameter `1/2`, listed
+below:
+
+| `R` Value | Selection Probability |
+| ---------------- | --------------------- |
+| 0 | 1/2 |
+| 1 | 1/4 |
+| 2 | 1/8 |
+| 3 | 1/16 |
+| ... | ... |
+| 0 <= `R` <= 61 | 1/(2**(-`R`+1)) |
+| ... | ... |
+| 60 | 2**-61 |
+| 61 | 2**-62 |
+| 62 | 2**-62 |
+| 63 | 0 |
+
+Such a random variable `R` can be generated using the following
+pseudocode.  Note there is a tiny probability that the code has to
+reject the calculated result and start over, since the value 62 is
+defined to have adjusted count 0, not 2**62.
+
+```golang
+func nextRandomness() int {
+  // Repeat until a valid result is produced.
+  for {
+    R := 0
+    for {
+      if nextRandomBit() {
+        break
+      }
+      R++
+    }
+    // The expected value of R is 2.
+	if R < 63 {
+	  return R
+    }
+	// Reject, try again.
+  }
+}
+```
+
+This can be computed from a stream of random bits as the number of
+leading zeros using efficient instructions on modern computer
+architectures.
+
+For example, the value 3 means there were three leading zeros and
+corresponds with being sampled at probabilities 1-in-1 through 1-in-8
+but not at probabilities 1-in-16 and smaller.
+
+### Proposed `tracestate` syntax
+
+The consistent sampling decision and head trace sampling probability
+will be propagated using four bytes of base16 content, as follows:
+
+```
+tracestate: otel=p:PP;r:RR
+```
+
+where `PP` are two bytes of base16 probability value and `RR` are two
+bytes of base16 random value.  These values are omitted when they are
+unknown.
+
+This proposal should be taken as a recommendation and will be modified
+to [match whatever format OpenTelemtry specifies for its
+`tracestate`](https://github.com/open-telemetry/opentelemetry-specification/pull/1852).
+The choice of base16 encoding is therefore just a recommendation,
+chosen because `traceparent` uses base16 encoding.
+
+### Examples
+
+The following `tracestate` value:
+
+```
+tracestate: otel=r:0a;p:03
+```
+
+translates to
+
+```
+base16(probability) = 03 // 1-in-8 head probability
+base16(randomness) = 0a // qualifies for 1-in-1024 sampling or greater
+```
+
+Any `TraceIDRatioBased` Sampler configured with probability 2**-10 or
+greater will enable sampling this trace, whereas any
+`TraceIDRatioBased` Sampler configured with probability 2**-11 or less
+will stop sampling this trace.  The W3C `sampled` flag is set to true
+when the probability value is less than or equal to the randomness
+value.
+
+## Internal details
+
+The reasoning behind restricting the set of sampling rates is that it:
+
+- Lowers the cost of propagating head sampling probability
+- Limits the number of random bits required
+- Avoids floating-point to integer rounding errors
+- Makes math involving partial traces tractable.
+
+[An algorithm for making statistical inference from partially-sampled
+traces has been published](https://arxiv.org/pdf/2107.07703.pdf) that
+explains how to work with a limited number of power-of-2 sampling rates.
+
+### Behavior of the `TraceIDRatioBased` Sampler
+
+The Sampler must be configured with a power-of-two probability
+`2**-S` except for the special case of zero probability, which is handled
+specially.
+
+If the context is a new root, the initial `tracestate` must be created
+using geometrically-distributed random value `R` (as described above,
+with maximum value 61) and the initial head probability value `S`.  If
+the head probability is zero use `S=63`, the specified value for zero
+probability.
+
+If the context is not a new root, output a new `tracestate` with the
+same `R` value as the parent context, and this Sampler's value of `S`
+for the outgoing context's probability value (i.e., as the value for
+`P`).
+
+In both cases, set the `sampled` bit if `S<=R` and `S<63`.
+
+### Behavior of the `ParentBased` sampler
+
+The `ParentBased` sampler is unmodified by this proposal.  It honors
+the W3C `sampled` flag and copies the incoming `tracestate` keys to
+the child context.
+
+### Behavior of the `AlwaysOn` Sampler
+
+The `AlwaysOn` Sampler behaves the same as `TraceIDRatioBased` with `P=1` (i.e., `S=0`)
+
+### Behavior of the `AlwaysOff` Sampler
+
+The `AlwaysOff` Sampler behaves the same as `TraceIDRatioBased` with `P=0` (i.e., `S=62`).
+
+## Worked example
+
+The behavior of these tables can be verified by hand using a smaller
+example.  The following table shows how these equations work where
+`R`, `P`, and `S` are limited to 3 bits.
+
+Values of `P`, which have the same encoded value and interpretation as
+for the proposed `log_head_adjusted_count` field of OTEP 170, would be
+interpreted as follows:
+
+| `P` value | Adjusted count |
+| -----     | -----          |
+| 0         | Unknown        |
+| 1         | 1              |
+| 2         | 2              |
+| 3         | 4              |
+| 4         | 8              |
+| 5         | 16             |
+| 6         | 32             |
+| 7         | 0              |
+
+Note there are only 6 non-zero, non-unknown values for the adjusted
+count. Thus there are six defined values of `R` and `S`.  The
+following table shows `R` and the corresponding selection probability,
+along with the calculated adjusted count for each `S`:
+
+| `R` value | `R` selection probability | `S=0` | `S=1` | `S=2` | `S=4` | `S=5` | `S=6` |
+| --        | --                        | --    | --    | --    | --    | --    | --    |
+| 0         | 1/2                       | 1     | 0     | 0     | 0     | 0     | 0     |
+| 1         | 1/4                       | 1     | 2     | 0     | 0     | 0     | 0     |
+| 2         | 1/8                       | 1     | 2     | 4     | 0     | 0     | 0     |
+| 3         | 1/16                      | 1     | 2     | 4     | 8     | 0     | 0     |
+| 4         | 1/32                      | 1     | 2     | 4     | 8     | 16    | 0     |
+| 5         | 1/32                      | 1     | 2     | 4     | 8     | 16    | 32    |
+
+Notice that the sum of `R` selection probability times adjusted count
+in each of the `S=*` columns equals 1.  For example, in the `S=5`
+column we have `0*1/2 + 0*1/4 + 0*1/8 + 0*1/16 + 16*1/32 + 16*1/32 =
+16/32 + 16/32 = 1`.  In the `S=2` column we have `0*1/2 + 0*1/4 +
+4*1/8 + 4*1/16 + 4*1/32 + 4*1/32 = 4/8 + 4/16 + 4/32 + 4/32 = 1/2 +
+1/4 + 1/8 + 1/8 = 1`.  We conclude that when `R` is chosen with the
+given probabilities, any choice of `S` produces one expected span.
+
+## Prototype
+
+[This proposal has been prototyped in the OTel-Go
+SDK.](https://github.com/open-telemetry/opentelemetry-go/pull/2177) No
+changes in the OTel-Go Tracing SDK's `Sampler` or `tracestate` APIs
+were needed.
+
+## Trade-offs and mitigations
+
+### Not using TraceID randomness
+
+It would be possible, if TraceID were specified to have at least 62
+uniform random bits, to compute the randomness value described above
+as the number of leading zeros among those 62 random bits.
+
+This proposal requires modifying the W3C traceparent specification,
+therefore we do not propose to use bits of the TraceID.
+
+[This issue has been filed with the W3C trace context group.](https://github.com/w3c/trace-context/issues/463)
+
+### Not using TraceID hashing
+
+It would be possible to make a consistent sampling decision by hashing
+the TraceID, but we feel such an approach is not sufficient for making
+unbiased sampling decisions.  It is seen as a relatively difficult
+task to define and specify a good enough hashing function, much less
+to have it implemented in multiple languages.
+
+Hashing is also computationally expensive. This proposal uses extra
+data to avoid the computational cost of hashing TraceIDs.
+
+### Restriction to power-of-two 
+
+Restricting head sampling rates to powers of two does not limit tail
+Samplers from using arbitrary probabilities.  The companion [OTEP
+170](https://github.com/open-telemetry/oteps/pull/170) has discussed
+the use of a `sampler.adjusted_count` attribute that would not be
+limited to power-of-two values.  Discussion about how to represent the
+effective adjusted count for tail-sampled Spans belongs in [OTEP
+170](https://github.com/open-telemetry/oteps/pull/170), not this OTEP.
+
+Restricting head sampling rates to powers of two does not limit
+Samplers from using arbitrary effective probabilities over a period of
+time.  For example, choosing 1/2 sampling half of the time and 1/4
+sampling half of the time leads to an effective sampling rate of 3/8.
+
+## Prior art and alternatives
+
+Google's Dapper system propagated a field in its trace context called
+"inverse_probability", which is equivalent to adjusted count.  This
+proposal uses the base-2 logarithm of adjusted count to save space and
+limit required randomness.