-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Allow additional sampling hooks #115
[feat] Allow additional sampling hooks #115
Conversation
- making idempotence one option to ensure consistent sampling results and not skew statistical results - adding missing open discussion and pro of sample decisions propagating to other services.
|
||
shouldRetry is only valid when paired with a RECORD or RECORD_AND_SAMPLED result. | ||
A NOT_RECORD result results in no further calls to the sampler, regardless of | ||
the shouldRetry value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the opposite of how Jaeger samplers behave - a YES decision is final, a NO decision may be retryable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is different a bit from Jaeger. The intention was to enable the SDK to make a decision around how to handle "MAYBE" decisions (whether to consider them a Yes or No), as well as allow the SDK / Sampler make the decision on whether a Yes or No is final.
More rationalle here: #115 (comment)
Co-authored-by: Yuri Shkuro <yurishkuro@users.noreply.github.com>
@jmacd @yurishkuro I'd like to do whatever I can to help move this forward. I'm itching to do some other OT stuff but don't want to leave work unfinished. Would you mind giving more feedback and calling out what you feel like are blockers, or approving? I feel your approval would go a long way in the viability of this OTEP. |
@toumorokoshi I like the overall direction, but my concern is with the handling of what you called "MAYBE" state. Specifically, per https://github.com/open-telemetry/oteps/pull/115/files#r446273010, I would not be able to implement Jaeger SDK behavior, because this OTEP takes the opposite approach, without conclusive arguments imo. What if we made this aspect configurable?
|
Thanks @yurishkuro for looking! I think I personally am ambivalent about the concept of recording / not recording, but the rationalle for this behavior was to allow the usage of noop spans if the initial decision is to not record. I'm having trouble finding the original ticket, I'll see if I can find it. Looking through the Jaeger code linked, I don't think the concept of "record" exists: A span is never replaced with a dummy version, it is always recording values. If that is the case, I think one could just not use the "NOT_RECORD" state at all in a sampler to achieve a Jaeger client-like behavior. The Jaeger-style sampler would return on the following decisions:
Alternative: remove NOT_RECORD stateAn option is to also just remove the concept of a NOT_RECORD state, and by extension not inject noop spans at all. I personally feel this is a good idea for other reasons:
Either way I believe Jaeger-client style behavior is achievable. |
The reason Jaeger does not use true no-op spans is because regardless of the sampling decision for trace collection users usually still want unique trace ID for every trace, which can be used for other correlation purposes (e.g. tagging logs). I don't see how that's possible to achieve with no-op spans unless one still allocates unique instances of no-op spans pointing to distinct
That's not completely accurate - Jaeger SDKs have another hidden sampling state called "finalized", which is set to true as soon as the sampler returns non-retryable decision, which makes the most common cases and most common samplers much more efficient, since a finalized sampling means the write ops (e.g. span.setAttribute) do not need to be echoed into sampler callbacks anymore. I think it's kind of related to RECORD mode in OTel:
However, the additional differences are:
Is it worth including these in the OTEP? The main objective of that design in Jaeger was graceful fallback onto "normal" behavior where you only have simplistic samplers saying yes/no and making it efficient (i.e. short-circuiting the rest of the callbacks). |
@yurishkuro thanks for the thorough reply! I believe the Jaeger behavior as you described it is pretty much how OpenTelemetry works today:
I think I understand your concern a bit better now: the sticking point is the ability for the Sampler to return back a "NOT_RECORD" result. If you want a no decision to be retryble, then I would just implement the Sampler to never return NOT_RECORD, so either "RECORD" or "RECORD_AND_SAMPLED". But looking at the mechanics, I feel like this could all be simplified by having "record" by a calculation of the sample decision and the retryable decision. So slightly modifying your table, we could have is_recording be a function:
This would be a modification of the "SamplingResult" struct in the SDK. This would also still enable the mostly no-op spans that you described.
I think the two additional behaviors you called out are really good additions to the spec, but can be tackled outside of this OTEP (which is just advocating for additional sampling hooks). I'm happy to keep filing follow-up PRs and discussions to get all the behavioral issues smoothed out. Unless you feel like this is a blocker for your own approval, I'd like to commit to following up separately on those instead. Regarding the risks if we go that route:
|
|
||
- Pro: enables setting the recording decision early, which can skip additional | ||
processing for instrumentations that use the isRecording field. | ||
- Pro: enables setting the recording decision early, which can be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can the decision to use a cheaper Span implementation (such as noop span) upon a hook decision be made without having a span created already? Unless I am misunderstanding something, if we aren't using deferred span creation, wouldn't a "real" span be created if the initial sampling decision (upon starting the span) was set to true?
e.g.
class CustomSampler():
...
def should_sample(...):
return True
...
def onUpdateName(span, name):
return False
...
tracer_provider = TracerProvider(sampler=CustomSampler())
span = tracer.start_as_current_span("oldname") -> always sampled, so real Span is created
span.updateName("new name") -> what happens here? The real Span is already created, so how do we "save on processing"? How is this "setting the recording decision early", compared to setting it on creation?
@toumorokoshi I'm closing this since the Sampler API has changed significantly since this OTEP was opened. Please update or open a new PR if you are still interested in this. |
Hi,
This OTEP is trying to consolidate a few discussions around sampling and deferred span creation. There's a few issues that have been in the spec issues for a while, I thought it would be good to maybe raise this as an OTEP.