Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry TraceIdRatioBased sampler requirements following OTEP 235 #4166

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Jul 29, 2024

Fixes #1413.

Changes

Updates Trace SDK and TraceState handling specifications with OTEP 235 sampling thresholds. This PR depends on #4162 to introduce the concept of Trace Randomness. This PR is the second part of two, it focuses on thresholds.

  • Revise TraceIdRatioBased algorithm section. The existing TODO implies this is not a breaking change.
  • Change text about TraceIdRatioBased construction
  • Move text about TraceIdRatioBased description (leave unmodified).

The content of OTEP 235 was revised for clarity by @kalyanaj in open-telemetry/oteps#261. I've heavily copied from the final text in that still-unmerged OTEP. I introduced new content explaining how to compute thresholds from probabilities with use of variable precision, referring to the OTel Collector-Contrib pkg/sampling reference implementation. The new (Golang) demonstration code is validated here, https://go.dev/play/p/7eLM6FkuoA5.

A proof of concept for this specification along with #4162 can be found in open-telemetry/opentelemetry-go#5645.

Part of #3602.

Product of the Sampling SIG members @kentquirk @kalyanaj @oertl @PeterF778 and myself.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 30, 2024

Feedback from the OTel Spec SIG meeting discussion cc/ @jsuereth:

  • Please add a migration guide to explain how transitioning samplers will work; in particular, it's not safe to begin using non-root independent sampling until TraceIdRatioBased samplers are replaced everywhere in a trace. Until then, only safe to continue using ParentBased sampling w/ root TraceIdRatioBased decision.

Update: 68fa270

Copy link

github-actions bot commented Aug 7, 2024

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot removed the Stale label Aug 8, 2024
jmacd added a commit that referenced this pull request Aug 15, 2024
This reduces the number of lines of diff in PR 4166, which replaces the
entire `tracestate-probability-sampling.md` file with new contents.

Part of #4166.

## Changes

Move a file, place a link to it and explain that a change is in
progress.
@jmacd
Copy link
Contributor Author

jmacd commented Aug 15, 2024

@kalyanaj @PeterF778 @oertl @kentquirk Please take another look at this PR, especially the file tracestate-probability-sampling.md which now reads as a new file, not as a major rewrite. The contents are derived from open-telemetry/oteps#261.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 15, 2024

@open-telemetry/specs-trace-approvers @open-telemetry/specs-approvers @open-telemetry/technical-committee this PR has reached consensus in the Sampling SIG, we have multiple prototypes implemented, and we are looking for final approvals.

specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/tracestate-handling.md Outdated Show resolved Hide resolved
specification/trace/tracestate-handling.md Outdated Show resolved Hide resolved
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 28, 2024
@jmacd
Copy link
Contributor Author

jmacd commented Aug 29, 2024

@open-telemetry/specs-trace-approvers @open-telemetry/specs-approvers @open-telemetry/technical-committee this PR has reached consensus in the Sampling SIG, we have multiple prototypes implemented, and we are looking for final approvals.

@github-actions github-actions bot removed the Stale label Aug 30, 2024
specification/trace/sdk.md Show resolved Hide resolved
specification/trace/sdk.md Show resolved Hide resolved
Comment on lines +420 to +426
The `TraceIdRatioBased` GetDescription MUST return a string of the form `"TraceIdRatioBased{RATIO}"`
with `RATIO` replaced with the Sampler instance's trace sampling ratio
represented as a decimal number. The precision of the number SHOULD follow
implementation language standards and SHOULD be high enough to identify when
Samplers have different ratios. For example, if a TraceIdRatioBased Sampler
had a sampling ratio of 1 to every 10,000 spans it could return
`"TraceIdRatioBased{0.000100}"` as its description.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I left this as-is for compatibility purposes. I'd be happy also to say that this was never defined as a stable string, and that we should extend the TraceIdRatioBased sampler's description with the actually configured threshold (which can vary according to precision).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to define it as a stable string now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have explicitly stated that Description need not be a stable string to allow Samplers to vary dynamically, and I can't see why it should be defined as stable anyway. If anyone has been parsing this value to determine the effective sampling probability, leaving it as-is will be good, but they should begin using the encoded threshold instead.

@jpkrohling jpkrohling self-requested a review September 18, 2024 07:50
Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, will try to complete by tomorrow.

@@ -87,6 +87,7 @@ formats is required. Implementing more than one format is optional.
| [Built-in `SpanProcessor`s implement `ForceFlush` spec](specification/trace/sdk.md#forceflush-1) | | | + | | + | + | + | + | + | + | + | |
| [Attribute Limits](specification/common/README.md#attribute-limits) | X | | + | | + | + | + | + | | | | |
| Fetch InstrumentationScope from ReadableSpan | | | + | | + | | | + | | | | |
| TraceIdRatioBased implements OpenTelemetry tracestate `th` field | | | | | | | | | | | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as the other PR: if this is required, shouldn't there be a couple of implementations lined up before the spec change is merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a shared my draft, open-telemetry/opentelemetry-go#5645, and @oertl has already merged an equivalent sampler in the Java contrib repository. (I would add that the OTel-Collector-Contrib probabilistic sampler processor acts as a near-prototype.)

(in combination with [`ParentBased`](#parentbased)) because different language
SDKs or even different versions of the same language SDKs may produce inconsistent
results for the same input.
The `TraceIdRatioBased` sampler implements simple, ratio-based probability sampling using randomness features specified in the [W3C Trace Context Level 2][W3CCONTEXTMAIN] Candidate Recommendation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel very bad for this comment, but does this file currently have a word wrap at around 80 characters? I personally prefer not to force line wraps and let people configure their editors to their preferences, but I prefer consistency even more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is inconsistent style in this file, and I feel like reviewers have asked for me to do it both ways. I'll do anything you want!!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine leaving it as is, we (you?) can send a PR afterwards having it consistent :-)

specification/trace/sdk.md Show resolved Hide resolved

##### `TraceIdRatioBased` sampler algorithm

A Trace configured with sampling threshold `T`, a 56-bit unsigned number corresponding with the sampling ratio, has `ShouldSample()` called for a trace having randomness value `R`, a 56-bit unsigned random number.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble parsing this section. Can we simplify it?

Here's a suggestion which might still need some improvement:

Suggested change
A Trace configured with sampling threshold `T`, a 56-bit unsigned number corresponding with the sampling ratio, has `ShouldSample()` called for a trace having randomness value `R`, a 56-bit unsigned random number.
Given a trace with a sampling threshold `T` and a randomness value `R` (typically, the 7 rightmost bytes of the trace ID), when `ShouldSample()` is called, it checks whether `R >= T` and returns `RECORD_AND_SAMPLE`, otherwise returns `DROP`.

But I think you might have the case in mind where R is not set yet and we are at the root span. In that case, the first "trace" would be "tracer". Related question: is this here supposed to replace the OTEP? I like how we have it in the OTEP:

The R value MUST be derived as follows:

  • If the key rv is present in the Tracestate header, then R = rv.
  • Else if the Random Trace ID Flag is true in the traceparent header, then R is the lowest-order 56 bits of the trace-id.
  • Else R MUST be generated as a random value in the range [0, (2**56)-1] and added to the Tracestate header with key rv.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Trace configured with sampling threshold T

Do you mean "A Tracer configured with sampling threshold T"? Note the typo?

It is also not entirely clear to me whether the better term is "A sampler".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way that R is calculated is meant to be part of #4162. We ran into a difficulty after the OTEP (235) merged which is discussed in the subsequent and unmerged (but widely approved by Sampling SIG members) open-telemetry/oteps#261. The Random flag isn't useful in all cases as a signal for this purpose, so we have instead a "presumption of trace randomness". Root tracers should use either use the correct randomness or support user-supplied rv randomness values (optional). Non-root tracers should check for rv otherwise assume a random TraceID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

Given a Sampler configured with a sampling threshold `T` and Context with randomness value `R` (typically, the 7 rightmost bytes of the trace ID), when `ShouldSample()` is called, it uses the expression `R >= T` to decide whether to return `RECORD_AND_SAMPLE` or `DROP`. 

* If randomness value (R) is greater or equal to the rejection threshold (T), meaning when (R >= T), return `RECORD_AND_SAMPLE`, otherwise, return `DROP`.
* When (R >= T), the OpenTelemetry TraceState SHOULD be modified to include the key-value `th:T` for rejection threshold value (T), as specified for the [OpenTelemetry TraceState `th` sub-key][TRACESTATEHANDLING].

Comment on lines +420 to +426
The `TraceIdRatioBased` GetDescription MUST return a string of the form `"TraceIdRatioBased{RATIO}"`
with `RATIO` replaced with the Sampler instance's trace sampling ratio
represented as a decimal number. The precision of the number SHOULD follow
implementation language standards and SHOULD be high enough to identify when
Samplers have different ratios. For example, if a TraceIdRatioBased Sampler
had a sampling ratio of 1 to every 10,000 spans it could return
`"TraceIdRatioBased{0.000100}"` as its description.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to define it as a stable string now?


### Sampling Probability

Sampling probability is the likelihood that a span will be *kept*. Each participant can choose a different sampling probability for each span. For example, if the sampling probability is 0.25, around 25% of the spans will be kept.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to think that you mean "tracer" here instead of "participant", potentially being "collector" when this is made not by a tracer. So, participant is "a tracer or a Collector" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I've added ", which includes a set of tracers and collectors" to the first use of "participants". This term originated in https://github.com/open-telemetry/oteps/pull/261/files.


Sampling probability is the likelihood that a span will be *kept*. Each participant can choose a different sampling probability for each span. For example, if the sampling probability is 0.25, around 25% of the spans will be kept.

Sampling probability is valid in the range 2^-56 through 1. Note that the zero value is not defined and that "never" sampling is not a form of probability sampling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2^-56 might seem a bit random for the non-initiated: would it be worth saying that this so that we have 7 bytes, matching the 7 bytes we get from the "randomness" (typically the 7 rightmost bytes from the trace ID)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added "The value 56 appearing in this expression corresponds with 7 bytes of randomness (i.e., 56 bits) which are specified for W3C Trace Context Level 2 TraceIDs. ".

specification/trace/tracestate-probability-sampling.md Outdated Show resolved Hide resolved

This proposal supports two sources of randomness:

- **A custom source of randomness**: This proposal allows for a *random* (or pseudo-random) 56-bit value. We refer to this as `rv`. This can be generated and propagated through the `tracestate` header and the tracestate attribute in each span.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I commented this elsewhere, but when should I, as a user, should consider having a custom source of randomness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to be part of #4162 which focuses on randomness. It writes "To enable sampling in this and other situations where TraceIDs lack sufficient randomness,"

However, I tried to stay away from the advanced use-cases some might mention. If you have a reason to use independent trace IDs and still want them to sample consistently, this is what you'd choose.


If `R` >= `T`, *keep* the span, else *drop* the span.

`T` represents the maximum threshold that was applied in all previous consistent sampling stages. If the current sampling stage applies a greater threshold value than any stage before, it MUST update (increase) the threshold correspondingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this comes later, but the OTEP also mentions that this cannot be lowered, only increased.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just came to the part where it says that it can be lowered at head samplers, but not for downstream samplers. This statement here might need to be adjusted then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probability can be lowered after the fact by re-sampling with a higher threshold, but not raised after the fact (by a lower threshold). I'll look for any inconsistencies.

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than my previous comments, LGTM!

specification/trace/tracestate-probability-sampling.md Outdated Show resolved Hide resolved

If `R` >= `T`, *keep* the span, else *drop* the span.

`T` represents the maximum threshold that was applied in all previous consistent sampling stages. If the current sampling stage applies a greater threshold value than any stage before, it MUST update (increase) the threshold correspondingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just came to the part where it says that it can be lowered at head samplers, but not for downstream samplers. This statement here might need to be adjusted then.


The original TraceIdRatioBased sampler specification gave a workaround for the underspecified behavior, that it was safe to use for root spans: "It is recommended to use this sampler algorithm only for root spans (in combination with [`ParentBased`](./sdk.md#parentbased)) because different language SDKs or even different versions of the same language SDKs may produce inconsistent results for the same input."

To avoid inconsistency during this transition, users SHOULD follow this guidance until all TraceIdRatioBased samplers used in a system have been upgraded to the modern `TraceIdRatioBased` specification based on W3C Trace Context Level 2 randomness. After all `TraceIdRatioBased` samplers have been upgraded, it is safe to use `TraceIdRatioBased` sampler without also using the `ParentBased` sampler.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can users assess that they reached this? Should we keep a table, showing from which versions which SDKs support the new spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way they can do this is to wait for all spans to have the W3C trace Random flag set across a system. How does that sound?

specification/trace/tracestate-probability-sampling.md Outdated Show resolved Hide resolved

Threshold values are encoded with trailing zeros removed, which allows for variable precision. This can be accompolished by rounding, and there are several practical way to do this with built-in string formatting libraries.

With up to 56 bits of precision available, implementations that use built-in floating point number support will be limited by the precision of the underlying number support. If the language supports IEEE 754-2008-standard hexadecimal floating point, for example in Golang,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last statement sounds a bit strange.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fixed:

One way to encode thresholds uses the IEEE 754-2008-standard hexadecimal floating point representation as a simple solution.  For example, in Golang,

A downstream sampler, in contrast, may output a given ended Span with a *modified* trace state, complying with following rules:

- If the chosen sampling probability is 1, the sampler MUST NOT modify any existing `th`, nor set any `th`.
- Otherwise, the chosen sampling probability is in `(0, 1)`. In this case the sampler MUST output the span with a `th` equal to `max(input th, chosen th)`. In other words, `th` MUST NOT be decreased (as it is not possible to retroactively adjust an earlier stage's sampling probability), and it MUST be increased if a lower sampling probability was used. This case represents the common case where a downstream sampler is reducing span throughput in the system.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Otherwise, the chosen sampling probability is in `(0, 1)`. In this case the sampler MUST output the span with a `th` equal to `max(input th, chosen th)`. In other words, `th` MUST NOT be decreased (as it is not possible to retroactively adjust an earlier stage's sampling probability), and it MUST be increased if a lower sampling probability was used. This case represents the common case where a downstream sampler is reducing span throughput in the system.
- Otherwise, the chosen sampling probability is in `[0, 1)`. In this case the sampler MUST output the span with a `th` equal to `max(input th, chosen th)`. In other words, `th` MUST NOT be decreased (as it is not possible to retroactively adjust an earlier stage's sampling probability), and it MUST be increased if a lower sampling probability was used. This case represents the common case where a downstream sampler is reducing span throughput in the system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't represent 0-probability sampling with a threshold. The corresponding threshold is out-of-range on purpose. If you want 0-probability sampling, you can simply not export a span, or you can export a span w/o the th value set, which says "unknown sampling probability".

jmacd and others added 3 commits September 25, 2024 15:28
Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>
@jmacd jmacd requested review from a team as code owners September 25, 2024 22:29

##### `TraceIdRatioBased` sampler algorithm

A Trace configured with sampling threshold `T`, a 56-bit unsigned number corresponding with the sampling ratio, has `ShouldSample()` called for a trace having randomness value `R`, a 56-bit unsigned random number.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Trace configured with sampling threshold T

Do you mean "A Tracer configured with sampling threshold T"? Note the typo?

It is also not entirely clear to me whether the better term is "A sampler".

specification/trace/tracestate-probability-sampling.md Outdated Show resolved Hide resolved
specification/trace/tracestate-probability-sampling.md Outdated Show resolved Hide resolved
First, a consistent probability `Sampler` may choose its own sampling rate. The higher the chosen sampling rate, the lower the rejection threshold (T). It MAY select any value of T. If a valid `SpanContext` is provided in the call to `ShouldSample` (indicating that the span being created will be a child span), there are two possibilities:

- **The child span chooses a T greater than the parent span's T**: The parent span may be *kept* but it is possible that its child, the current span, may be dropped because of the lower sampling rate. At the same time, in the case where the decision for the child span is to *keep* it, the decision for the parent span would have also been to *keep* (due to our consistent sampling approach) since the parent's sampling rate is greater than the child's sampling rate.
- **The child span chooses a T less than or equal to the parent span's T**: The parent span might have been *dropped* but it is possible that its child, the current span, may be *kept* because of the higher sampling rate. At the same time, in case where the parent span is *kept*, the child span would be *kept* as well (due to our consistent sampling approach) since the child's sampling rate is greater than the parent's sampling rate.
Copy link

@PeterF778 PeterF778 Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase "parent span might have been dropped" is not correct in this context. According to our specs, if the parent span was dropped, there wouldn't be parent's T-value available (we clear th value when dropping a span). I suggest saying "Downstream samplers may decide to drop the parent span".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this section and it has me a bit confused. I propose to cut a lot of text, since I'm not sure it was answering real questions.

h = hex(tvalue).rstrip('0')
# remove leading 0x
tv = 'tv='+h[2:]
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: wouldn't this code result in an empty string if the threshold is zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I propose

if tvalue == 0: 
  add_otel_trace_state('tv:0')
else:
  h = hex(tvalue).rstrip('0')
  # remove leading 0x
  add_otel_trace_state('tv:'+h[2:])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Review approach & specify algorithm for TraceIdRatioBasedSampler (ProbabilitySampler)
8 participants