[feat] Allow additional sampling hooks #115

toumorokoshi · 2020-06-11T17:01:30Z

Hi,

This OTEP is trying to consolidate a few discussions around sampling and deferred span creation. There's a few issues that have been in the spec issues for a while, I thought it would be good to maybe raise this as an OTEP.

text/trace/115-add-sample-to-span-end.md

- making idempotence one option to ensure consistent sampling results and not skew statistical results - adding missing open discussion and pro of sample decisions propagating to other services.

text/trace/115-add-sample-to-span-end.md

yurishkuro · 2020-06-26T16:02:42Z

text/trace/115-add-sample-to-span-end.md

+
+shouldRetry is only valid when paired with a RECORD or RECORD_AND_SAMPLED result.
+A NOT_RECORD result results in no further calls to the sampler, regardless of
+the shouldRetry value.


This is the opposite of how Jaeger samplers behave - a YES decision is final, a NO decision may be retryable.

Yes, this is different a bit from Jaeger. The intention was to enable the SDK to make a decision around how to handle "MAYBE" decisions (whether to consider them a Yes or No), as well as allow the SDK / Sampler make the decision on whether a Yes or No is final.

More rationalle here: #115 (comment)

Co-authored-by: Yuri Shkuro <yurishkuro@users.noreply.github.com>

toumorokoshi · 2020-08-11T03:58:49Z

@jmacd @yurishkuro I'd like to do whatever I can to help move this forward. I'm itching to do some other OT stuff but don't want to leave work unfinished.

Would you mind giving more feedback and calling out what you feel like are blockers, or approving? I feel your approval would go a long way in the viability of this OTEP.

yurishkuro · 2020-08-11T22:01:11Z

@toumorokoshi I like the overall direction, but my concern is with the handling of what you called "MAYBE" state. Specifically, per https://github.com/open-telemetry/oteps/pull/115/files#r446273010, I would not be able to implement Jaeger SDK behavior, because this OTEP takes the opposite approach, without conclusive arguments imo. What if we made this aspect configurable?

shouldRetry is only valid when paired with a RECORD or RECORD_AND_SAMPLED result.
A NOT_RECORD result results in no further calls to the sampler, regardless of
the shouldRetry value.

toumorokoshi · 2020-08-12T16:07:25Z

@toumorokoshi I like the overall direction, but my concern is with the handling of what you called "MAYBE" state. Specifically, per https://github.com/open-telemetry/oteps/pull/115/files#r446273010, I would not be able to implement Jaeger SDK behavior, because this OTEP takes the opposite approach, without conclusive arguments imo. What if we made this aspect configurable?

shouldRetry is only valid when paired with a RECORD or RECORD_AND_SAMPLED result.
A NOT_RECORD result results in no further calls to the sampler, regardless of
the shouldRetry value.

Thanks @yurishkuro for looking!

I think I personally am ambivalent about the concept of recording / not recording, but the rationalle for this behavior was to allow the usage of noop spans if the initial decision is to not record. I'm having trouble finding the original ticket, I'll see if I can find it.

Looking through the Jaeger code linked, I don't think the concept of "record" exists: A span is never replaced with a dummy version, it is always recording values.

If that is the case, I think one could just not use the "NOT_RECORD" state at all in a sampler to achieve a Jaeger client-like behavior. The Jaeger-style sampler would return on the following decisions:

"RECORD" -> sampled: False, retryable: True
"RECORD_AND_SAMPLED" -> sampled: true, retryable: False

Alternative: remove NOT_RECORD state

An option is to also just remove the concept of a NOT_RECORD state, and by extension not inject noop spans at all. I personally feel this is a good idea for other reasons:

SDK users having to reason about edge cases where attributes are not recorded or stored
seeing inconsistent performance of functions based on recording decision)

Either way I believe Jaeger-client style behavior is achievable.

yurishkuro · 2020-08-15T00:34:07Z

The reason Jaeger does not use true no-op spans is because regardless of the sampling decision for trace collection users usually still want unique trace ID for every trace, which can be used for other correlation purposes (e.g. tagging logs). I don't see how that's possible to achieve with no-op spans unless one still allocates unique instances of no-op spans pointing to distinct SpanContext's (which is not how no-op works in OpenTracing, there no-op span is a singleton that does not incur allocation cost). Jaeger SDKs do support effective no-op because when the span is not sampled, all its write operations are short-circuited to do nothing (but it does require an atomic read of a flag, so slightly more expensive than real no-op span). Strictly speaking, it's possible to implement another span type that has true no-op functions, as long as the decision is known at span creation.

Looking through the Jaeger code linked, I don't think the concept of "record" exists: A span is never replaced with a dummy version, it is always recording values.

That's not completely accurate - Jaeger SDKs have another hidden sampling state called "finalized", which is set to true as soon as the sampler returns non-retryable decision, which makes the most common cases and most common samplers much more efficient, since a finalized sampling means the write ops (e.g. span.setAttribute) do not need to be echoed into sampler callbacks anymore. I think it's kind of related to RECORD mode in OTel:

Jaeger: `decision.sample`	Jaeger: `decision.retryable`	OTEL decision
false	true	RECORD
false	false	NOT_RECORD
true	false	RECORD_AND_SAMPLED

However, the additional differences are:

in Jaeger the sampled state is shared between all spans of a given trace that are not finished in the given process (which allows sampling decisions based on tags on children spans affect parent span)
the "finalized" state is explicitly set to true if the parent span context was remote, i.e. downstream services don't attempt to make changes to sampling (which means it's possible that parent A starts non-sampled, calls child B that does not record its data, then A gets an attribute that trips sampling to on, and child C will be recording/sampling data).

Is it worth including these in the OTEP? The main objective of that design in Jaeger was graceful fallback onto "normal" behavior where you only have simplistic samplers saying yes/no and making it efficient (i.e. short-circuiting the rest of the callbacks).

toumorokoshi · 2020-08-18T05:59:23Z

@yurishkuro thanks for the thorough reply!

I believe the Jaeger behavior as you described it is pretty much how OpenTelemetry works today:

effective no-op spans: write short-circuit, but a custom SpanContext is passed to allow a unique traceId.
the choice for a no-op span is based on the initial sampling choice (which is the only path that will cause an OTEL decision to be "NOT_RECORD")

I think I understand your concern a bit better now: the sticking point is the ability for the Sampler to return back a "NOT_RECORD" result. If you want a no decision to be retryble, then I would just implement the Sampler to never return NOT_RECORD, so either "RECORD" or "RECORD_AND_SAMPLED".

But looking at the mechanics, I feel like this could all be simplified by having "record" by a calculation of the sample decision and the retryable decision. So slightly modifying your table, we could have is_recording be a function:

Jaeger: decision.sample / OTEL sampled	Jaeger: decision.retryable / OTEL shouldRetry	OTEL is_recording() return value
false	true	true
false	false	false
true	false	true
true	true	true

This would be a modification of the "SamplingResult" struct in the SDK. This would also still enable the mostly no-op spans that you described.

Is it worth including these in the OTEP? The main objective of that design in Jaeger was graceful fallback onto "normal" behavior where you only have simplistic samplers saying yes/no and making it efficient (i.e. short-circuiting the rest of the callbacks).

I think the two additional behaviors you called out are really good additions to the spec, but can be tackled outside of this OTEP (which is just advocating for additional sampling hooks).

I'm happy to keep filing follow-up PRs and discussions to get all the behavioral issues smoothed out. Unless you feel like this is a blocker for your own approval, I'd like to commit to following up separately on those instead.

Regarding the risks if we go that route:

the shared sampling choice in a trace is not possible (or at least not defined in the current spec). To enable jeager-client like behavior, this would have to be addressed
the "finalized" state being honored will be possible with the APIs exposed, since they will consume the span and thus the SpanContext. I.E. This behavior could be replicated by a custom sampler. To make it possible in the SDK as well, I think this would be a matter of adding a flag to the built-in samplers to honor the parent's sampling choices.

text/trace/115-add-sample-to-span-end.md

lzchen · 2020-08-18T23:54:00Z

text/trace/115-add-sample-to-span-end.md

+
+- Pro: enables setting the recording decision early, which can skip additional
+       processing for instrumentations that use the isRecording field.
+- Pro: enables setting the recording decision early, which can be used


How can the decision to use a cheaper Span implementation (such as noop span) upon a hook decision be made without having a span created already? Unless I am misunderstanding something, if we aren't using deferred span creation, wouldn't a "real" span be created if the initial sampling decision (upon starting the span) was set to true?

e.g.

class CustomSampler(): ... def should_sample(...): return True ... def onUpdateName(span, name): return False ... tracer_provider = TracerProvider(sampler=CustomSampler()) span = tracer.start_as_current_span("oldname") -> always sampled, so real Span is created span.updateName("new name") -> what happens here? The real Span is already created, so how do we "save on processing"? How is this "setting the recording decision early", compared to setting it on creation?

tedsuo · 2023-02-06T17:53:53Z

@toumorokoshi I'm closing this since the Sampler API has changed significantly since this OTEP was opened. Please update or open a new PR if you are still interested in this.

toumorokoshi added 2 commits June 11, 2020 09:58

Adding proposal to move sample to span end

ab72737

renaming as add sample to span end.

5942c60

toumorokoshi requested review from arminru, bogdandrutu, c24t, carlosalberto, iredelmeier, jmacd, reyang, SergeyKanzhelev, tedsuo, tigrannajaryan and yurishkuro as code owners June 11, 2020 17:01

toumorokoshi added 2 commits June 11, 2020 10:03

fixing markdown lint

be0a7db

renaming to PR number

762d7d2

yurishkuro reviewed Jun 11, 2020

View reviewed changes

text/trace/115-add-sample-to-span-end.md Outdated Show resolved Hide resolved

yurishkuro reviewed Jun 11, 2020

View reviewed changes

text/trace/115-add-sample-to-span-end.md Outdated Show resolved Hide resolved

toumorokoshi mentioned this pull request Jun 11, 2020

Allow samplers to be called during different moments in the Span lifetime open-telemetry/opentelemetry-specification#307

Open

yurishkuro reviewed Jun 11, 2020

View reviewed changes

text/trace/115-add-sample-to-span-end.md Outdated Show resolved Hide resolved

yurishkuro changed the title ~~Feature/add sample to span end~~ [feat] Allow sampling at span end Jun 11, 2020

jmacd mentioned this pull request Jun 11, 2020

Proposal for sampling.priority #107

Closed

Addressing feedback

9302221

- making idempotence one option to ensure consistent sampling results and not skew statistical results - adding missing open discussion and pro of sample decisions propagating to other services.

toumorokoshi mentioned this pull request Jun 15, 2020

starlette instrumentation open-telemetry/opentelemetry-python#777

Merged

Adding shouldRetry, incremental sample hooks

940863c

yurishkuro reviewed Jun 26, 2020

View reviewed changes

text/trace/115-add-sample-to-span-end.md Outdated Show resolved Hide resolved

yurishkuro reviewed Jun 26, 2020

View reviewed changes

Update text/trace/115-add-sample-to-span-end.md

3d8ac4e

Co-authored-by: Yuri Shkuro <yurishkuro@users.noreply.github.com>

toumorokoshi requested review from a team July 22, 2020 15:26

yurishkuro mentioned this pull request Jul 31, 2020

Remove warnings about Span.UpdateName open-telemetry/opentelemetry-specification#754

Merged

toumorokoshi changed the title ~~[feat] Allow sampling at span end~~ [feat] Allow additional sampling hooks Aug 11, 2020

Merge branch 'master' into feature/add-sample-to-span-end

fa165d1

yurishkuro mentioned this pull request Aug 14, 2020

Add Initial OpenTracing compatibility requirements open-telemetry/opentelemetry-specification#768

Closed

yurishkuro self-assigned this Aug 15, 2020

lzchen reviewed Aug 18, 2020

View reviewed changes

text/trace/115-add-sample-to-span-end.md Show resolved Hide resolved

lzchen reviewed Aug 18, 2020

View reviewed changes

text/trace/115-add-sample-to-span-end.md Show resolved Hide resolved

lzchen reviewed Aug 18, 2020

View reviewed changes

Oberon00 mentioned this pull request Oct 15, 2020

Missing spec for Span ID creation for dropped/record-only spans open-telemetry/opentelemetry-specification#1060

Closed

Base automatically changed from master to main January 27, 2021 20:37

evantorrie mentioned this pull request Mar 5, 2021

Remove WithRecord() option from trace.SpanOption when starting a span open-telemetry/opentelemetry-go#1660

Merged

MadLittleMods mentioned this pull request Aug 2, 2022

Draft: Migrate to OpenTelemetry tracing matrix-org/synapse#13400

Closed

17 tasks

plantfansam mentioned this pull request Nov 2, 2022

Best way to to support traceresponse and load balancer deferred sampling? open-telemetry/opentelemetry-specification#2914

Closed

tedsuo added the triaged label Feb 6, 2023

tedsuo closed this Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Allow additional sampling hooks #115

[feat] Allow additional sampling hooks #115

toumorokoshi commented Jun 11, 2020

yurishkuro Jun 26, 2020

toumorokoshi Jul 22, 2020

toumorokoshi commented Aug 11, 2020

yurishkuro commented Aug 11, 2020

toumorokoshi commented Aug 12, 2020

yurishkuro commented Aug 15, 2020

toumorokoshi commented Aug 18, 2020

lzchen Aug 18, 2020 •

edited

Loading

tedsuo commented Feb 6, 2023

[feat] Allow additional sampling hooks #115

[feat] Allow additional sampling hooks #115

Conversation

toumorokoshi commented Jun 11, 2020

yurishkuro Jun 26, 2020

Choose a reason for hiding this comment

toumorokoshi Jul 22, 2020

Choose a reason for hiding this comment

toumorokoshi commented Aug 11, 2020

yurishkuro commented Aug 11, 2020

toumorokoshi commented Aug 12, 2020

Alternative: remove NOT_RECORD state

yurishkuro commented Aug 15, 2020

toumorokoshi commented Aug 18, 2020

lzchen Aug 18, 2020 • edited Loading

Choose a reason for hiding this comment

tedsuo commented Feb 6, 2023

lzchen Aug 18, 2020 •

edited

Loading