Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporterhelper] Use span links when tracing across queue #12212

Open
jade-guiton-dd opened this issue Jan 30, 2025 · 7 comments
Open

[exporterhelper] Use span links when tracing across queue #12212

jade-guiton-dd opened this issue Jan 30, 2025 · 7 comments
Labels
area:exporter enhancement New feature or request

Comments

@jade-guiton-dd
Copy link
Contributor

jade-guiton-dd commented Jan 30, 2025

Is your feature request related to a problem? Please describe.

The observability requirements for stable components requires the processing performance of a component to be observable.

At the moment, this is not quite true for the OTLP exporter (cf. telemetry review), and prevents its stabilization. Using the exporterhelper, it emits a span for the exporting operation, (including retries), but there is no link with the initial trace that enqueued the data, which prevents analysis of latency and queueing times.

Update: In the case of the in-memory queue and with the fix at #12225, there will actually be a parent-child relationship. But even in that case, I think a span link would be more appropriate for the reasons below.

Describe the solution you'd like

When the queue is enabled in exporterhelper, I think a span should be emitted, its SpanContext should be saved in the queue (potentially serialized if using the persistent queue), and the span for the export operation should have a span link linking it to the queueing operation. Note that the links may be N to N, because the batching exporter may fuse or split batches. Adding an attribute on each link describing how many items were processed from an initial batch may be useful.

Describe alternatives you've considered

It would also be possible to measure the latency using a new metric, but this would provide less information, and would still require some context propagation through the queue.

Additional context

Related issues: #11740, #8804

@bogdandrutu
Copy link
Member

This is only a problem for the persistent queue implementation. The in-memory queue propagates everything but the deadline. See https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/internal/queue_sender.go#L167C15-L167C28

@jade-guiton-dd
Copy link
Contributor Author

jade-guiton-dd commented Jan 31, 2025

@bogdandrutu In my tests, I think the traces were broken even with the memory queue, but I guess that must be a bug then. I'll check again.

@jade-guiton-dd
Copy link
Contributor Author

jade-guiton-dd commented Jan 31, 2025

It looks like my observations are because of a bug with the batching code: when batching is not enabled, data goes through the DisabledBatcher, which discards the provided Context and creates a new one with context.Background for some reason. I filed a PR to fix that: #12225 (Edit: Fix merged in #12231)

However, even with we fix that bug, I think there is still an improvement to be made with the in-memory queue: it currently creates a parent-child relationship between the previous component's span and the export operation span, which will not always make sense because of the asynchronous nature of the queue, and the potentially N-to-N relation between enqueue operations and export operations caused by the batcher. This is why I am arguing we should use span links instead. I'll rename the issue.

@jade-guiton-dd jade-guiton-dd changed the title [exporterhelper] Preserve trace context through queue [exporterhelper] Use span links when tracing across queue Jan 31, 2025
@jade-guiton-dd
Copy link
Contributor Author

jade-guiton-dd commented Jan 31, 2025

To give an example of what I mean by the parent-child relationship not making sense, this is what traces look like with a fix to the above bug:

Image

The exporter/otlphttp/bar/traces span is not contained within the parent receiver/otlp/foo/TracesDataReceived span, and could hypothetically be arbitrary long after if the queue weren't empty.

@bogdandrutu
Copy link
Member

The exporter/otlphttp/bar/traces span is not contained within the parent receiver/otlp/foo/TracesDataReceived span, and could hypothetically be arbitrary long after if the queue weren't empty.

Is there a strong requirement for this?

@bogdandrutu
Copy link
Member

However, even with we fix that bug, I think there is still an improvement to be made with the in-memory queue: it currently creates a parent-child relationship between the previous component's span and the export operation span, which will not always make sense because of the asynchronous nature of the queue, and the potentially N-to-N relation between enqueue operations and export operations caused by the batcher. This is why I am arguing we should use span links instead. I'll rename the issue.

If:

  • queue and batch are disabled -> simple child is ok
  • queue is disabled and batch enabled -> In general we need a new trace and links to "parents", but if the batch is only one request do we do a child Span?
  • If queue is enabled and batch is disabled -> Is it "that" wrong if this is a child?

@jade-guiton-dd
Copy link
Contributor Author

jade-guiton-dd commented Feb 3, 2025

As you said, we have a technical requirement to use span links when an export operation has multiple parent enqueue operations because of batching (because a span can only have one parent, but can have multiple links). I agree that we could technically use child spans in the other cases (batching disabled, or enabled but no merging was performed), but I think it would be good to do the same thing in all cases to avoid confusion.

Moreover, while I agree it's not "that" wrong to use child spans when batching is disabled but the queue is enabled, I think span links would give easier-to-interpret output, notably under load where there might be a considerable delay between the enqueue operation and the export operation. It sounds like span links were designed specifically for this sort of asynchronous processing according to the docs:

For example, let’s say we have a distributed system where some operations are tracked by a trace. In response to some of these operations, an additional operation is queued to be executed, but its execution is asynchronous. We can track this subsequent operation with a trace as well. We would like to associate the trace for the subsequent operations with the first trace, but we cannot predict when the subsequent operations will start. We need to associate these two traces, so we will use a span link.

@mx-psi mx-psi added this to the Self observability milestone Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:exporter enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

3 participants