-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to achieve consistent sampling across linked traces? #2918
Comments
cc @jmacd |
https://github.com/open-telemetry/opentelemetry-dotnet/pull/1851/files It was in .NET originally, but was removed as it was not something spec covered at that time. |
@kalyanaj Thank you for posing these questions. I would like to separate questions about span Links being created after the start of a span into a separate topic which may interest @pyohannes, below. About
|
If there is a producer publishing a message to a topic, it cannot know how many consumers are subscribed to the topic and are processing the message. In case some of the consumer traces aren't sampled, I don't see how directions on links would help. |
I agree with @jmacd here - assuming we deal with a relatively high number of links, the question is what approach would maximize the number of complete groups of traces, but it'd be impossible to achieve full consistency. This perspective also helps with links after start discussion. It's up to sampler to maximize consistency, but since it's impossible to achieve it anyway, we should allow adding links after start (with direction or without it) |
@pyohannes I apologize for the confusion--The idea didn't fully address the problem, as I realized from a discussion we had about this issue in today's Sampling SIG. I was trying to establish that Sampling as we know it, where new spans make a sampling decision somehow dependent on their parent context and the span contexts they are linked with at creation, is the reason why we do not support creating Span links after span start. The idea is that because a Sampler has access to the sampled flag of its parent context and other preceding (linked-to) contexts, then we have these capabilities:
The reason we prohibit creating span links after creation is because it breaks one or both of these. What we have is a situation where a link between spans must be recorded by the later-in-time span; the only way we have to control recording a span is in the sampling decision, therefore span links must be present at the time of sampling. The creation of a span link after span start breaks the two requirements as follows. If the linked-to context is sampled, then the only way to make it complete is to record the linked-from span. If the linked-from span is already not being recorded because the sampling decision has passed, it becomes impossible to record the link. We have unverifiable incompleteness because the linked-to span has no awareness of the linked-from span, which was not recorded. The problem scenario, to be concrete, is a call to add a span link when the linked-to context is sampled and the linked-from span is a no-op span. We have nowhere to record the link. The OTel Sampler API returns currently one of four states, described here: https://opentelemetry.io/docs/reference/specification/trace/sdk/#recording-sampled-reaction-table. To address both @kalyanaj's original question and support span link after creation, we need a new Span reaction that is a "conditionally recorded" span. A conditionally recorded span is one that is not itself sampled and is being held in memory, recording events and potential after-creation span links. When an after-creation span link occurs linking to a sampled span context, the conditionally recorded span would change states, entering a new state "exported-unsampled" where the span is passed to the exporter despite being unsampled. (If the span was also being probability sampled, the exported-unsampled spans MUST be assigned zero adjusted count.) Then, to configure a Sampler that would ensure consistent, complete spans including their span links:
I hope this sketch is more complete! I didn't actually add a direction attribute to Links, I just require them to be recorded when either side is sampled, for completeness. The need for a new "exported-unsampled" Sampler decision is required even without support for adding span links after creation (to @kalyanaj's point). The need for a new "conditionally-recorded" Sampler decision would be required to support span links after creation (to @pyohannes's feature request). |
@jmacd btw, some Jaeger SDKs utilized a state similar to "conditionally-recorded" (we called it deferred sampling), to support sampling based on span attributes that become available only after span start. It's a bit of a kludge, because the state only makes sense until a child span is created, at which point the sampling decision needs to be finalized. I am, however, not convinced that sampling considerations are the deciding factor for allowing adding links post creation. The exact same arguments could be made for disallowing span attributes after span creation, yet we allow that. Just because sampling questions become more difficult with post-creation links, it does not negate the fact that there are use cases that can benefit from late links, especially in scenarios that sample everything (e.g. CI or other devexp workflows). |
This is true. However, I think ensuring completeness across linked traces makes you lose another crucial capability: effectively enforcing a fixed sampling rate. If you make a sampling decision based on links to two upstream spans, both upstream spans sampled with a probability of 10%, you're sampling the span with the probability that at least one of the two upstream spans was sampled. This probability is higher than 10%, and, the more links you have, it approaches 100%. In cases where there is heavy batching and where there are several layers of links, the actual sampling volume could end up being much higher than what one might expect based on the probability decision at the root. While this is not to be seen as an argument for adding spans link after span creation, I think it illustrates that probably not all capabilities we intend to provide can be fully utilized at the same time, but there might be trade-offs based on usage scenarios. |
I agree. We can't avoid the fundamentals of sampling. What we can do is provide new Sampler implementations that give users a choice. If users would like to record a span that is linked-to by others, they should be able to do so without causing entire other traces to be collected. If that capability will co-exist with what we have today, it means two new Sampler decision codes as I outlined above, one to say "maybe record this span, depending" and one to say "record an untraced span". |
@yurishkuro About "deferred sampling" thanks for explaining. Comparing the two span states that I described with the one from Jaeger, the "deferred sampling" state of Jaeger is similar to but different than the one I called "conditionally recorded", because you could remain in a conditionally recorded state after the first child up until span end because, at any moment, a new span link could appear and cause the span to become "unsampled exported". Using the Jaeger term "deferred" instead of "conditionally" would give us a complete list of span states:
So, it looks like three new states if you combine Jaeger's deferred sampling decision with the deferred exporting decision requested to support span links after start. For us to adopt this kind of support in OpenTelemetry will require prototypes, in case anyone is wondering what are the next steps. Interested parties should look at #2179, too. |
Filing this issue per our discussion in the Sampling SIG today.
What are you trying to achieve?
OpenTelemetry supports Span Links that can be used to model asynchronous scenarios or batched operations (fan-out/fan-in). I am looking to achieve some level of consistent (head-based) sampling of all the linked traces. If the sampling decision happens at an individual trace level, customers cannot understand the whole story of what happened to a request.
Example of links usage: One use-case is in a producer - consumer scenario where a producer span (say Trace T1 / Span S1) enqueues a job to a queue; let's say such jobs are processed by a consuming service asynchronously. Since the lifetimes of the producer and consumer are different, the consuming operation is modelled as a separate trace (T2 / S2) that links to T1 / S1 using span-links. If there's a way to do consistent sampling across links, then if T1 was sampled then T2 also should be sampled.
What did you expect to see?
Guidance / samples / out-of-the-box sampler to help achieve the above. For example, something like:
Additional context.
The text was updated successfully, but these errors were encountered: