Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should zero-duration spans be considered good/reasonable practice? #4123

Closed
lmolkova opened this issue Jul 1, 2024 · 15 comments
Closed

Should zero-duration spans be considered good/reasonable practice? #4123

lmolkova opened this issue Jul 1, 2024 · 15 comments
Labels
spec:trace Related to the specification/trace directory triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted

Comments

@lmolkova
Copy link
Contributor

lmolkova commented Jul 1, 2024

Zero-duration spans are used by messaging semantic conventions when publishing messages in batches open-telemetry/semantic-conventions#1187

image

Messages need to have unique context to be traceable individually and the only way to achieve it today is by creating a span for each message and injecting corresponding context into the message.

There could be other cases like local/fire-and-forget/no-ack cases where context-propagation is necessary:

  • in-memory queues where each work item should have a unique context
  • gRPC streaming where individual messages need to have trace-context propagated
  • general-purpose eventing (e.g. Akka)

Based on discussion with @tedsuo, zero-duration spans are controversial.

I'd like to discuss alternatives and/or settle on the zero-duration spans as legit solution.


What do we need to capture for a message/event/work-item:

  • timestamp
  • unique context (span-id)
  • parent context
  • name (for visualization/identification purposes)
  • attributes (in some cases)

Alternatives to zero-duration-span:

  1. Event
    • ❌ does not have unique context
    • span events are (hopefully) going away, logs are not part of tracing signal
  2. Pure span-context + link (from publish, as discussed in Messaging: per-message tracing when sending batches semantic-conventions#1187 )
    • ❌ does not have means to capture parent context (e.g. incoming request)
    • needs publish span, which is not there for in-memory/streaming operations
    • minor: links would need to have name and timestamp
  3. Something new:
    • part of tracing (exported as (or along with) a span)
    • has all properties outlined above
    • does not have duration/status
@lmolkova lmolkova added the spec:trace Related to the specification/trace directory label Jul 1, 2024
@lmolkova
Copy link
Contributor Author

lmolkova commented Jul 1, 2024

Effectively, the only alternative to zero-duration spans that I can come up with is a new 'thing' (Option 3) that's part of tracing and is very similar to span, but does not have a duration or status.

Note:
The boundary between fully-local and network operation is still blurry - e.g. sending gRPC stream message or making UDP call still involves network, just no ack. The duration could be nano or micro-seconds and operation may still fail if connection is lost.

So my preference would be to stick to spans as a general-purpose solution for this problem.

@lmolkova lmolkova changed the title Are zero-duration spans valid? Should zero-duration spans be considered good/reasonable practice? Jul 1, 2024
@jack-berg
Copy link
Member

jack-berg commented Jul 2, 2024

I think calling the spans representing "Msg 1" and "Msg 2" zero-duration is underselling what's happening with them. These spans represent queueing a message to a local buffer for publishing. This action is probably approaching zero-duration, but is not actually zero. Therefore, I'd say that any disagreement on the basis of zero-duration is invalid because these actions do in fact have a duration.

The alternatives add additional complexity to the data model and do not appear to meaningfully improve the modeling of these types of scenarios.

I'm in favor of using spans to solve this problem.

@svrnm svrnm added the triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted label Jul 8, 2024
@pyohannes
Copy link
Contributor

This action is probably approaching zero-duration, but is not actually zero. Therefore, I'd say that any disagreement on the basis of zero-duration is invalid because these actions do in fact have a duration.

@jack-berg is right that the spans won't actually have a zero duration, but in many cases a very short duration. Maybe the discussion should be re-framed and not be about zero duration, but about a useful duration?

In many cases in question, the (nano-seconds) duration of such spans will not be of interest, and the only reason the spans are created is to obtain a distinct context. The span is more or less a side-effect. The question should be whether we're fine with that or not.

@lmolkova
Copy link
Contributor Author

lmolkova commented Jul 8, 2024

Adding more point raised in the spec discussion on 7/2:

  • we want something that can have children - this indicates it should be a span
  • we want it to be visualized similarly to spans
  • there were no objections proceeding with such spans in messaging semconv from the spec community

/cc @tedsuo in case you want to add any additional feedback on this

@lmolkova
Copy link
Contributor Author

lmolkova commented Jul 12, 2024

let's continue the discussion then.

SIGs regularly propose ideas that violate core design principles of OpenTelemetry

Since there was a consensus during the call and within messaging SIG, I feel we have different understanding of the principles.

Regardless of the above:

  • messaging SIG decided to give flexibility to create per-message spans for batch publish.
  • the alternative (create events with context) would have the same overhead.

@tedsuo
Copy link
Contributor

tedsuo commented Jul 12, 2024

I would like to understand if the following pseudo code models the situation:

foreach (messages as message){
  startSpan()
  injectContext(message)
  process(message)
  endSpan()
}

If that's the case, then this appears to be a very normal situation, except for the fact that the span durations are presumably very short, and we have little interest in them. Normally we'd avoid such tiny spans, but they are necessary in this case for building the graph correctly.

"Zero duration spans" are usually brought up as a way to model something that isn't really a span. But if the proposal is to create ordinary spans that are just very tiny, that seems reasonable. I would suggest that we avoid referring to them as "zero duration" to avoid confusion.

If we're talking about a different situation, do you mind providing some simple pseudo code to describe an example?

I do still have concerns about the overhead of generating so many spans. I agree that all other approaches proposed would have similar overhead... is this a problem? I haven't tested it, I really don't know the answer. If the overhead is unacceptable, then it seems like we have a real modeling challenge in front of us. If the overhead is fine, maybe this is just ordinary instrumentation, with nothing special about it? That would be a relief. :)

@jack-berg
Copy link
Member

Wanted to reiterate a comment I made in the 7/16/24 spec SIG:

The defining characteristics of the trace / span data model are: 1. spans have a start & end time (i.e. duration). 2. spans are arranged into a hierarchy where each hierarchy is a trace, and these hierarchies can be linked together into a larger graph through links.

When considering whether we should model something with a span, we should reject it if the operation being modeled has no useful duration AND does not contribute anything meaningful to the trace graph. Put the other way, if an operation either has a meaningful duration OR adds meaningful value to the trace graph, then its valuable to model as a span.

@tedsuo
Copy link
Contributor

tedsuo commented Jul 23, 2024

Thank Jack, it's a good definition and I completely agree.

In the 7/16/24 meeting, we reviewed the model for batch processing spans. These spans do represent meaningful operations in a trace graph, according to Jack's definition above. So there is nothing incorrect about using spans. In fact, this is a very normal use of spans, except for the fact that so many need to be created in order to correctly model the trace.

So really, the only concern is efficiency. And concerns about this were raised. Generating timestamps, random IDs, and other data structures have overhead. Making large numbers of individual API calls have overhead.

My request would be for the Messaging SIG to determine how bad this overhead is, and whether some kind of bulk API for span creation and injection could possibly lower this overhead in a meaningful way. If measuring the duration of these spans creates significant overhead, I would be in favor of marking them as zero duration as part of bulk span creation.

@lmolkova
Copy link
Contributor Author

lmolkova commented Jul 23, 2024

From the messaging SIG, we want to provide flexibility and allow to trace individual messages independently. Otherwise tracing may become not quite useful.

We can ask implementations to make such behavior configurable, but we want to allow creating spans per message.

As an owner of messaging SDK that does batching in multiple ways, I have to provide per-message tracing for my users.
We've had it implemented for several years. The learnings:

  • there are bunch of use-cases where per-message tracing is important. If OTel prohibits it or makes it opt-in even for batches, I'd not follow the spec.
  • it's confusing to some users and we want them to disable per-message tracing if they need to
  • perf impact depends A LOT on other things - message rate, batch size, message sizes, etc - there is no single number I can give you that would make sense without detailed explanation. In short: it's negligeable comparing to the impact of enabling tracing overall.

@tedsuo
Copy link
Contributor

tedsuo commented Jul 23, 2024

@jmacd made a point that we could use sampling. There is essentially the same problem in tracing any high volume system – a high volume of concurrent HTTP requests requires sampling to address overhead. So maybe, once again, batch operations are a normal tracing situation with a normal solution.

@lmolkova
Copy link
Contributor Author

The sampling applies to message spans in the same way as to any other spans, so yes, it will sample most of them out.

Sampling could in theory be used to suppress message span creation, but now that we have a better suppression mechanism (tracer.IsEnabled() - #4063) we could let users suppress per-message traces in this way by giving per-message tracer a different name.

So maybe, once again, batch operations are a normal tracing situation with a normal solution.

Please let me know if per-message tracing with above explanation on sampling/suppression/other means to disable it could be considered as a "normal solution" or you're looking for something different.

@tedsuo
Copy link
Contributor

tedsuo commented Jul 23, 2024

Creating a tracer per message feels extreme? Also, is there a way to create "disabled" tracers? I thought this was only for the case where the SDK is not installed.

span.IsRecording() could be used to suppress the injection operation when the span has not been sampled.

@lmolkova
Copy link
Contributor Author

lmolkova commented Jul 23, 2024

We can create different tracers for different types of telemetry:

  1. for all publishing calls
  2. another for all messages

With #3867 and #4063 users can decide to turn tracer off.

It's configured like this

OpenTelemetrySdk.builder()
    .setTracerProvider(
        SdkTracerProvider.builder()
            .addTracerConfiguratorCondition(nameEquals("per-message-tracer"), TracerConfig.disabled()) // disable tracer by name
            .build())
.....

See open-telemetry/opentelemetry-java#6375 for the details.

There is a great summary in #3867 description on why suppression could be a better choice than sampling in general case.

@tedsuo
Copy link
Contributor

tedsuo commented Jul 23, 2024

Ok I see what you mean. A "per-message" tracer to disable the entire scope, not a new tracer for every message 😅. That makes a lot of sense!

@lmolkova
Copy link
Contributor Author

lmolkova commented Jul 26, 2024

I've created open-telemetry/semantic-conventions#1273 (blocker for messaging stability) to encourage instrumentations to allow disabling per-message tracing for batch publish.

I'm going to close this issue, please reopen if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:trace Related to the specification/trace directory triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted
Projects
None yet
Development

No branches or pull requests

5 participants