-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify how Application Developers are supposed to set the Span status when using Instrumentation Libraries #4131
Comments
I do not believe so. In the application code, you do not have access to the Span in question, not it's Span Id nor any Attributes, nothing. In a theoretical The issue seems much more fundamental. The entire idea of the Span Status API is for the Application Developer to be able to overwrite the Span Status in the end:
In practice, when using an instrumentation library, the Application developer has no ability to do this in a controlled way, unless the Instrumentation Library provides it for every single instrumented function. My initial example actually does not matter. The Application developer should be able to overwrite the Span Status of invoked instrumented functions, regardless if the Instrumentation Library set it to It is the most basic usage of the Tracing API, as it seems intended and is specified. |
I think what @svrnm was referring to is the possibility of using a Span processor to implement the I can see this being useful to handle spans created by instrumentation libraries. For instance, imagine you want to unset the Would this be something that would fulfil your requirements? |
In the I guess you can hardcode some rules, such as "always overwrite the status of a Span which is the nth child to a parent Span with name 'X'". // the second lib_function() call can result in an error, which is then handled
@instrument
def my_function():
lib_function();
result=lib_function();
if result.is_error():
handle_error(result);
lib_function();
This becomes even more difficult when you have different possible tracetrees (due to branches and loops) or if you don't actually handle the error in all cases and want to leave the status on I'm feeling a bit crazy here, because I don't think this is a special requirement of mine, but simply a requirement of the Tracing API itself: The Span Status needs to be overwriteable by the Application developer. And if I understand correctly, this is currently only realizable by the Instrumentation Library enabling it (which they probably wont) or by working at the Collector level? It is regrettable, but it would answer the question. It would mean that in practice, Application developers encounter an |
Thanks @fkoep, I think I got a clearer understanding of the issue now, and of course you're not crazy, this is clearly a valid use case. While #1089 would allow to update the status of a span, it can only consider the attributes present in a given span to do this, which would have to have been added initially by the instrumentation lib. This may help in many situations, as in my example, but not in others, as you pointed above. However, as you can see in the discussion in #1089 making a span writeable after it has been ended has many implications to be discussed. This type of change would also require having access to spans outside of their instrumentation scope, or outside a processor, which is also challenging. It would be a non-trivial change. While there may be alternative solutions for this, I will defer to the Technical Committee (TC) to continue the conversation and give their perspective on this issue. |
Thanks @danielgblanco and +1 to the point that you should not feel crazy @fkoep, this is a legitimate request, but might not be easy to implement. An alternative that came to my mind: if you wrap your spans from the downstream library into a parent span you have full control over, would it work to set that parent span with status
|
Just playing devils advocate here, but in the use-case described, I'd actually prefer to see the Of course this varies, but I think this is useful information I'd like to know from my apps. |
TC Triage: We are currently discussing this issue. We think broad guidance on error modelling in Spans for OpenTelemetry is needed and will guide the future of this request. |
@joaopgrassi I can see your point. However, leaving the One could argue that |
Thanks for the responses! Good to see that it is being discussed.
As far as I am aware, the status of the parent span has no bearing on how the status of the child spans are to be interpreted? Of course, if the OpenTelemetry semantics were to be changed this way, it would solve the problem.
Well, you can spin that argument further and say that the OpenTelemetry Tracing API does not allow to accurately model the reality that an error has occured and has been handled. (Maybe instead of the enumeration values I can see the argument that one would like Spans with handled errors still be marked as Good point about the tail sampling! Actually, in our case, it is even worse than I have described earlier... We work against a proprietary service we have no control over, and besides the sporadic 404's we get when a resource is deleted, we also need to look regularly look up metadata for each resource. About 10% of resources do not have any associated metadata, which will also be reported by a 404. Should they have designed it that way? Probably not, but we can't change it.... And having >10% of our traces be marked as erroneous makes our error overview useless. In the end, we simply decided to turn off the tracing feature of our REST client library. |
Discussed at TC meeting: Ability to change error status on spans based on the application logic is an important part of a wider problem: how applications or other instrumentations can interact with inner instrumentation layers to enrich and customize their telemetry. It affects spans, metrics, maybe other signals. We want to tackle this problem in general. One of possible solutions could be to have scoped cross-signal callbacks allowing to customize telemetry items and/or classify errors. See this writeup for more details. Related:
The immediate mitigation to this specific issue (brought up by @svrnm and @danielgblanco in the comments) is to use It should be possible to identify spans which status should be changed in the callback and change status only for them. E.g. in the example above the filter would be similar to this: span is |
There are some confusing assumptions being made in the thread above. While I agree with @lmolkova there is a larger ssue and potentially room for improvement of the tracing interfaces, I find the assertions about "error traces" problematic. @fkoep I understand you have a scenario where you have an inner span returning 404s, which makes it look like 10% of your traces are "errors". However, as you point out, your application is able to handle those errors and so the parent span does not have an error. I believe you should be looking at the root span of a trace for your tail sampling decision, and then you will only sample true errors. Otherwise, I'm not sure how you're applying the logic of "tail sampling": usually this refers to a decision made about a whole trace, and there is no reason you need to inspect the inner 404 error spans while making your tail sampling decision. Therefore, it seems the way to resolve this problem is simply to stop interpreting inner span errors as errors: let them propagate to the root span and then consider them errors. |
+1 to what @jmacd said - protocol level failures (even when they are real errors) are a normal part of network communication that are usually retried and recovered from - they should not be used alone. Tail-based sampling and alerts would be more efficient and accurate if built around server spans (or something that application reports about itself) rather than client communication. |
What assumptions do you mean? To consider a trace as "erronous" when it contains an error span? We've been using a Jaeger-based Observability Solution, so maybe that's where this interpretation comes from. (In Jaeger, any trace which contains an error span is being visualized in red color. And when debugging an issue, it's common practice to just search with a "error=true" filter across all spans of all traces.)
From different docs, it seems like sampling based on the span status of non-root spans is common practice:
Not to throw any shade on your post, you are providing a valid alternative approach. I'm just not sure which assumptions you are contending. To your approach, I see difficulties with reporting non-fatal unhandled errors in subroutines which don't propagate up. A "fatal unhandled error", I would expect to propagate all the way up to the Root Span, as it fits the way the code is structured, too. With "non-fatal unhandled error" (maybe there is a better term for it) I mean an error which occurred in some subroutine, but for which the application switched to some fallback behavior. The error was not really handled, it was just non-fatal. You still want an error span emitted, the trace sampled, and for a human to look at the issue. However, in the code no information about these non-fatal errors is propagated all the way up to the Root Span location. I'm not sure how common this is in other code bases, in ours it it rather common. We sit between different external services and if some minor things are not working as expected, we rather just log an error and continue instead of halting our entire service. @lmolkova Oh yeah, you can make use of
So summarized: Rather than overwriting any SpanStatus, you would recommend leaving the CLIENT-Spans as-is and only sampled based on the status of the higher-up INTERNAL/SERVER-Span? Do i understand that correctly? That would seem to be the solution to our personal issues. Though now I'm a bit troubled to understand in which practical cases you would ever overwrite a SpanStatus from ERROR to OK, as this issue seemed to be the model case for that. |
I'm not sure changing the error status on existing spans is a viable solution. An application might call to a different service, if an error is returned, the calling service might be able to handle that. However, there's no way to change the status of the remote span(s), which are errors. Errors can happen in some contexts, and be handled in other. That should be modeled in traces. The notion that all traces with errors are problematic is quite deeply engrained in many current tools and workflows, and I would argue that at least for synchronous workflows one should move away from that notion.
I agree with that, with the caveat that things are a bit more complicated for asynchronous workflows. |
What are you trying to achieve?
As an Application developer, I should be able to set the Span Status of Spans created by an Instrumentation Library.
In practice, a Spans lifetime is often tied to a library functions scope. It is not clear how you should gain access to the Span from the calling side or alternatively, how an Instrumentation Library should enable me to influence the Span Status.
Additional context.
In the Span::SetStatus operation (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status), the following line raises questions:
When using an Instrumentation Library, how are Application developers supposed to do that?
It is stated that:
unset
or set it toerror
.This means the only way an Application Developer can set the Span Status is by the Instrumentation Library giving him the ability to modify the Span Status while the Span is still open.
However, most often Spans are tied to a functions scope and not accessible by the caller.
Example: When using an instrumented HTTP Client library, an Application can expect 404 errors in certain places. The Application might look up an resource and create it if it does not exist yet:
If the first HTTP GET returns a 404, the resulting
GET /picture/{id}
Span would be marked with anerror
Span Status by the the HTTP Client library (as per Semantic Conventions for HTTP Spans). We handled that error, so we want to set the status toOK
. However, the Span is hidden away inhttp.get()
and cannot be modified by the outer Application logic.In order for the Application Developer the overwrite this Span Status, the HTTP Client Library would need to accept a parameter for expected HTTP statusses, for which the Span Status should not be set to
error
:Alternatively, the HTTP Client Library would need to return the still-open Span, so the Span Status can be overwritten by the Application code:
Is that really how it is supposed to be done? That seems like a major imposition on Instrumention Libraries.
In my pseudocode example, the design of the REST API and Application code is debatable, but the Tracing specification clearly states that setting the Span Status like this is intended usage. Various Github Issues I found suggest the same. However, I cannot find any real-life examples.
Maybe I am overseeing something really simple which could be explained or recommended in 1-2 sentences in the
Tracing API
specification:The text was updated successfully, but these errors were encountered: