-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should zero-duration spans be considered good/reasonable practice? #4123
Comments
Effectively, the only alternative to zero-duration spans that I can come up with is a new 'thing' (Option 3) that's part of tracing and is very similar to span, but does not have a duration or status. Note: So my preference would be to stick to spans as a general-purpose solution for this problem. |
I think calling the spans representing "Msg 1" and "Msg 2" zero-duration is underselling what's happening with them. These spans represent queueing a message to a local buffer for publishing. This action is probably approaching zero-duration, but is not actually zero. Therefore, I'd say that any disagreement on the basis of zero-duration is invalid because these actions do in fact have a duration. The alternatives add additional complexity to the data model and do not appear to meaningfully improve the modeling of these types of scenarios. I'm in favor of using spans to solve this problem. |
@jack-berg is right that the spans won't actually have a zero duration, but in many cases a very short duration. Maybe the discussion should be re-framed and not be about zero duration, but about a useful duration? In many cases in question, the (nano-seconds) duration of such spans will not be of interest, and the only reason the spans are created is to obtain a distinct context. The span is more or less a side-effect. The question should be whether we're fine with that or not. |
Adding more point raised in the spec discussion on 7/2:
/cc @tedsuo in case you want to add any additional feedback on this |
let's continue the discussion then.
Since there was a consensus during the call and within messaging SIG, I feel we have different understanding of the principles. Regardless of the above:
|
I would like to understand if the following pseudo code models the situation:
If that's the case, then this appears to be a very normal situation, except for the fact that the span durations are presumably very short, and we have little interest in them. Normally we'd avoid such tiny spans, but they are necessary in this case for building the graph correctly. "Zero duration spans" are usually brought up as a way to model something that isn't really a span. But if the proposal is to create ordinary spans that are just very tiny, that seems reasonable. I would suggest that we avoid referring to them as "zero duration" to avoid confusion. If we're talking about a different situation, do you mind providing some simple pseudo code to describe an example? I do still have concerns about the overhead of generating so many spans. I agree that all other approaches proposed would have similar overhead... is this a problem? I haven't tested it, I really don't know the answer. If the overhead is unacceptable, then it seems like we have a real modeling challenge in front of us. If the overhead is fine, maybe this is just ordinary instrumentation, with nothing special about it? That would be a relief. :) |
Wanted to reiterate a comment I made in the 7/16/24 spec SIG: The defining characteristics of the trace / span data model are: 1. spans have a start & end time (i.e. duration). 2. spans are arranged into a hierarchy where each hierarchy is a trace, and these hierarchies can be linked together into a larger graph through links. When considering whether we should model something with a span, we should reject it if the operation being modeled has no useful duration AND does not contribute anything meaningful to the trace graph. Put the other way, if an operation either has a meaningful duration OR adds meaningful value to the trace graph, then its valuable to model as a span. |
Thank Jack, it's a good definition and I completely agree. In the 7/16/24 meeting, we reviewed the model for batch processing spans. These spans do represent meaningful operations in a trace graph, according to Jack's definition above. So there is nothing incorrect about using spans. In fact, this is a very normal use of spans, except for the fact that so many need to be created in order to correctly model the trace. So really, the only concern is efficiency. And concerns about this were raised. Generating timestamps, random IDs, and other data structures have overhead. Making large numbers of individual API calls have overhead. My request would be for the Messaging SIG to determine how bad this overhead is, and whether some kind of bulk API for span creation and injection could possibly lower this overhead in a meaningful way. If measuring the duration of these spans creates significant overhead, I would be in favor of marking them as zero duration as part of bulk span creation. |
From the messaging SIG, we want to provide flexibility and allow to trace individual messages independently. Otherwise tracing may become not quite useful. We can ask implementations to make such behavior configurable, but we want to allow creating spans per message. As an owner of messaging SDK that does batching in multiple ways, I have to provide per-message tracing for my users.
|
@jmacd made a point that we could use sampling. There is essentially the same problem in tracing any high volume system – a high volume of concurrent HTTP requests requires sampling to address overhead. So maybe, once again, batch operations are a normal tracing situation with a normal solution. |
The sampling applies to message spans in the same way as to any other spans, so yes, it will sample most of them out. Sampling could in theory be used to suppress message span creation, but now that we have a better suppression mechanism (tracer.IsEnabled() - #4063) we could let users suppress per-message traces in this way by giving per-message tracer a different name.
Please let me know if per-message tracing with above explanation on sampling/suppression/other means to disable it could be considered as a "normal solution" or you're looking for something different. |
Creating a tracer per message feels extreme? Also, is there a way to create "disabled" tracers? I thought this was only for the case where the SDK is not installed.
|
We can create different tracers for different types of telemetry:
With #3867 and #4063 users can decide to turn tracer off. It's configured like this OpenTelemetrySdk.builder()
.setTracerProvider(
SdkTracerProvider.builder()
.addTracerConfiguratorCondition(nameEquals("per-message-tracer"), TracerConfig.disabled()) // disable tracer by name
.build())
..... See open-telemetry/opentelemetry-java#6375 for the details. There is a great summary in #3867 description on why suppression could be a better choice than sampling in general case. |
Ok I see what you mean. A "per-message" tracer to disable the entire scope, not a new tracer for every message 😅. That makes a lot of sense! |
I've created open-telemetry/semantic-conventions#1273 (blocker for messaging stability) to encourage instrumentations to allow disabling per-message tracing for batch publish. I'm going to close this issue, please reopen if necessary. |
Zero-duration spans are used by messaging semantic conventions when publishing messages in batches open-telemetry/semantic-conventions#1187
Messages need to have unique context to be traceable individually and the only way to achieve it today is by creating a span for each message and injecting corresponding context into the message.
There could be other cases like local/fire-and-forget/no-ack cases where context-propagation is necessary:
Based on discussion with @tedsuo, zero-duration spans are controversial.
I'd like to discuss alternatives and/or settle on the zero-duration spans as legit solution.
What do we need to capture for a message/event/work-item:
Alternatives to zero-duration-span:
publish
, as discussed in Messaging: per-message tracing when sending batches semantic-conventions#1187 )The text was updated successfully, but these errors were encountered: