Add support for more expressive span status API #123

noahfalk · 2020-07-01T03:48:51Z

A while back we discussed the status API and I was concerned the current API isn't expressive enough for many kinds of errors. I agreed to suggest an alternative, this OTEP. I know there has been a lot of past discussion and divergent opinion on this topic so I am not convinced there is a plan that everyone will get behind but I hope this at least contributes to moving the discussion forward.

Also I want to give a heads up that I may not have enough time to have extended discussions or revisions that something like this may deserve. With luck other contributors will also be motivated to push for improvements in this area too : )

linux-foundation-easycla · 2020-07-01T03:48:54Z

The committers are authorized under a signed CLA.

✅ Noah Falk (cd147a9)

text/0123-improve-span-status-api.md

Oberon00 · 2020-07-01T14:54:35Z

text/0123-improve-span-status-api.md

+
+**Closed vs. open-ended status descriptions** - We could make status represented by a fixed set of options such as the current gRPC codes or any larger/different set. It likely is simpler for tools consuming the data to only operate on fixed options and it is pleasant to imagine that all (or most) status results will be reasonably categorized into the fixed list. However I do not expect most users to take up the responsibility of mapping their status results if OpenTelemetry/back-end tools don't provide a clear representation. Instead these users will not set status or choose options that are generic and inconsistent rendering the data useless to everyone. For a fixed list to work I believe we either have to curate a very large list of options or accept that significant portions of the developer audience will not find the status representation useful.
+
+**API using attribute semantic conventions ** - It is also possible to do this via semantic conventions on attributes and although I think the strongly typed API is preferable I don't have that strong of an opinion. Semantic conventions are likely to have higher performance overhead, higher risk of error in key names and are less discoverable/refactorable in IDEs. The advantages are that there is some past precedent for doing this specific to http codes and new conventions can be added easily. If we go semantic conventions it does imply that Exception becomes a type that can be directly passed as an argument to SetAttribute(). Requiring the user to destructure the exception into a list of key value pairs would be overly onerous and error-prone for a common usage scenario. If desired the SDK or exporter could destructure it, but that can be determined independently from API design and I'd like to keep it out of scope.


Markdown syntax problem:

Suggested change

**API using attribute semantic conventions ** - It is also possible to do this via semantic conventions on attributes and although I think the strongly typed API is preferable I don't have that strong of an opinion. Semantic conventions are likely to have higher performance overhead, higher risk of error in key names and are less discoverable/refactorable in IDEs. The advantages are that there is some past precedent for doing this specific to http codes and new conventions can be added easily. If we go semantic conventions it does imply that Exception becomes a type that can be directly passed as an argument to SetAttribute(). Requiring the user to destructure the exception into a list of key value pairs would be overly onerous and error-prone for a common usage scenario. If desired the SDK or exporter could destructure it, but that can be determined independently from API design and I'd like to keep it out of scope.

**API using attribute semantic conventions** - It is also possible to do this via semantic conventions on attributes and although I think the strongly typed API is preferable I don't have that strong of an opinion. Semantic conventions are likely to have higher performance overhead, higher risk of error in key names and are less discoverable/refactorable in IDEs. The advantages are that there is some past precedent for doing this specific to http codes and new conventions can be added easily. If we go semantic conventions it does imply that Exception becomes a type that can be directly passed as an argument to SetAttribute(). Requiring the user to destructure the exception into a list of key value pairs would be overly onerous and error-prone for a common usage scenario. If desired the SDK or exporter could destructure it, but that can be determined independently from API design and I'd like to keep it out of scope.

(you see now why line breaks would be good? 😉 )

(I love reading paragraphs hunting for the changed space character! ... agreed and fixed 👍 )

Oberon00 · 2020-07-01T14:56:56Z

text/0123-improve-span-status-api.md

+
+**API using attribute semantic conventions ** - It is also possible to do this via semantic conventions on attributes and although I think the strongly typed API is preferable I don't have that strong of an opinion. Semantic conventions are likely to have higher performance overhead, higher risk of error in key names and are less discoverable/refactorable in IDEs. The advantages are that there is some past precedent for doing this specific to http codes and new conventions can be added easily. If we go semantic conventions it does imply that Exception becomes a type that can be directly passed as an argument to SetAttribute(). Requiring the user to destructure the exception into a list of key value pairs would be overly onerous and error-prone for a common usage scenario. If desired the SDK or exporter could destructure it, but that can be determined independently from API design and I'd like to keep it out of scope.
+
+**API using Span events** - Most of the rationale for attribute semantic conventions also applies here, events are effectively another key-value store mechanism. The timestamp that is attached to an event appears to hold little value as status is probably produced at the same approximate time the span end timestamp is recorded. Similar to attribute conventions it sounds like there is precedent for storing some errors as events.


Most of the rationale for attribute semantic conventions also applies here

I see this as completely orthogonal. If we use semantic conventions, they could prescribe span attributes or event attributes or both.

If we have additional API functions, they could be defined on the span or some event (builder) object, and even if defined on the span they could create event-level or span-level data.

I think we are in agreement? Perhaps my second paragraph made it sound like events aren't semantic conventions so I've clarified in the title that events are also conventions.

I've been limiting the scope of the discussion to the tracing API so here I was attempting to distinguish whether the user provides the information via strongly typed API or provides it via loosely typed conventions. I agree that a strongly typed API could store the data internally as attributes/events and doing so probably (but not necessarily) has some implications on the SDK API for reading the data back.

I added some text in the implementation section of the coming revision to explicitly call out the ability to store the data into attributes or events regardless of the tracing API that provides the data to us.

Oberon00 · 2020-07-01T15:00:15Z

text/0123-improve-span-status-api.md

+I would suggest an API called Span.SetStatus(...) that takes all the arguments above, and optionally overloads or default parameters that make common calls easier. For example in C#:
+
+````C#
+void SetStatus(Exception spanException, string statusType = "LanguageException", bool? successHint = null);


Capturing an exception with traceback and storing it until export is potentially extremely memory heavy (e.g. in Python the exception has a reference to a traceback, and the traceback has references to all local variables on the stack, preventing them from being GCd).

So the solution to this would probably be to call into some exporter-specific code already when calling this API.

Thanks! I added a comment about this issue and potential solution to the internal details section.

Oberon00 · 2020-07-01T15:11:21Z

text/0123-improve-span-status-api.md

+
+**Closed vs. open-ended status descriptions** - We could make status represented by a fixed set of options such as the current gRPC codes or any larger/different set. It likely is simpler for tools consuming the data to only operate on fixed options and it is pleasant to imagine that all (or most) status results will be reasonably categorized into the fixed list. However I do not expect most users to take up the responsibility of mapping their status results if OpenTelemetry/back-end tools don't provide a clear representation. Instead these users will not set status or choose options that are generic and inconsistent rendering the data useless to everyone. For a fixed list to work I believe we either have to curate a very large list of options or accept that significant portions of the developer audience will not find the status representation useful.
+
+**API using attribute semantic conventions ** - It is also possible to do this via semantic conventions on attributes and although I think the strongly typed API is preferable I don't have that strong of an opinion. Semantic conventions are likely to have higher performance overhead, higher risk of error in key names and are less discoverable/refactorable in IDEs. The advantages are that there is some past precedent for doing this specific to http codes and new conventions can be added easily. If we go semantic conventions it does imply that Exception becomes a type that can be directly passed as an argument to SetAttribute(). Requiring the user to destructure the exception into a list of key value pairs would be overly onerous and error-prone for a common usage scenario. If desired the SDK or exporter could destructure it, but that can be determined independently from API design and I'd like to keep it out of scope.


Note that you can provide a nicely typed API and still use semantic conventions. E.g. the API could just "serialize" whatever you give it into SetAttribute / CreateEvent calls. The difference is that the API between the exporter and the API is then "weakly typed".

To point at another project, the idea of "typed spans", i.e. utilities to have strongly typed APIs for semantic conventions, is at least as old as OpenTelemetry. E.g. see "Typed Spans" in the PR description of open-telemetry/opentelemetry-specification#571

As an additional advantage, the exception/error APIs could be provided as a package separate from the core API, (you could provide a function com.opentelemetry.contrib.errorutils.ErrorUtils.captureError(Span, other params) that sets any attributes/events on the span).

Thanks! I've added this suggestion to the list of alternatives.

arminru · 2020-07-01T17:22:13Z

text/0123-improve-span-status-api.md

+2. StatusData - A discriminated union of:
+   - An integer, a string, or a tuple of integer and string. These options can be used for:
+     - Enumerated status codes: For example an http status code could be represented as 404, "Not Found", or both. In the case of common status codes OpenTelemetry SDK or backend could optionally assist in filling out the remainder of a partially specified enumeration value. For enumerations that aren't well-known the community of users is responsible for determining any conventions.
+     - Free-form error messages


How would a named error with a separate message be captured? I would treat these separately and not combine them in one free-form string. This would allow for searching, filtering and grouping by the error name alone but still preserve the message. A backend could, for example, display the number of all EntityNotFound errors in one place and filter for them in another place where users can inspect the messages bearing the entities that were not found.

The current proposal wouldn't be able to capture this, but we could change the proposal to allow it. I've added some text to the trade-offs and mitigations section mentioning the idea. I'd be fine adding it, not sure if we'll see additional opinions about it?

Personally, I'd add it right from the start. I'd expect users want to capture the information in a way that most closely resembles how it's reported by their respective language or framework, rather than having to combine both in one string that potentially has to be parsed/decomposed for analysis later again.

My original intent wasn't that users would combine strings, but that they would provide either an enumerated status code OR a message, not both. If we anticipate it is common to provide both I agree I'd rather provide an API that easily facilitates that vs. an API that encourages users to create private conventions to merge and split text.

I expect it to be common enough to support it in the API. Often it will indeed be a type of error that can be enumerated but will not necessarily have an integer representation like we have in HTTP but rather a name of the error (like "Entity not found" or an exception name) with a (detailed) description (e.g. "The entity with ID '7583' was not found in the table 'products'" or an exception message).
Therefore I'm suggesting to have two strings called, for example, name, brief or kind and a separate string called details, message or description.
Happy to hear others' opinions on that.

text/0123-improve-span-status-api.md

noahfalk · 2020-07-06T21:08:03Z

I've updated the text based on the feedback, thanks for that! If you were looking for different/more changes vs. what I did please just let me know : )

arminru · 2020-07-07T11:13:57Z

text/0123-improve-span-status-api.md

+2. StatusData - A discriminated union of:
+   - An integer, a string, or a tuple of integer and string. These options can be used for:
+     - Enumerated status codes: For example an http status code could be represented as 404, "Not Found", or both. In the case of common status codes OpenTelemetry SDK or backend could optionally assist in filling out the remainder of a partially specified enumeration value. For enumerations that aren't well-known the community of users is responsible for determining any conventions.
+     - Free-form error messages


Personally, I'd add it right from the start. I'd expect users want to capture the information in a way that most closely resembles how it's reported by their respective language or framework, rather than having to combine both in one string that potentially has to be parsed/decomposed for analysis later again.

text/0123-improve-span-status-api.md

jmacd · 2020-07-07T15:38:02Z

text/0123-improve-span-status-api.md

+data hard to work with because it can no longer be treated uniformly.
+- **Loss of fidelity** - Mapping from a status representation that
+distinguishes a large number of different results to one that only
+distinguishes a few is inherently lossy. This prevents users from isolating


My understanding of the span status code is slightly different than the summary in this document. I see the span status code as a categorization of the status meant to augment our understanding of the span's condition. I would not characterize the span status code as a lossy transformation: I'd describe it as additive.

As an example, suppose we have an HTTP server and it encounters some kind of condition. In the HTTP domain we have well-defined status codes, which can be encoded in a non-lossy way using the http.status_code semantic convention.

I believe one of the sources of confusion here is that in order to categorize an error into one of the canonical span status codes, the author has to take a position, and that position will not be unique. Depending on positionality, different users will categorize the various HTTP status codes differently. Accordingly, I would define the span status code as the span actor's perspective on the activity it performed, how it would categorize its own status. Other actors in a system may witness consequential failures (e.g., a client receiving a HTTP 500 code) and handle them in a way that disagrees with the self-assessment used by a particular span. E.g., a server may decide that a 500 status is maps to Internal errors, and the client that receives a 500 may decide to retry, thus suppressing the 500. That 500 could become a span event with http.status_code and http.status_text attributes, and still the client could have a successful retry and set the span status to OK.

Summarizing, I don't see a problem with Status code as it stands, but think we need to recognize that the Status code is not a unique position.

Thanks for the thoughtful feedback!

Summarizing, I don't see a problem with Status code as it stands

I'm not sure how to relate your opinion on the status code concept to the OTEP overall. How do you feel about the goals outlined in the goals section of the OTEP:
a) They aren't useful and don't need to be pursued?
b) They are useful but the existing Span API already satisfies them?
c) They are useful, and the existing Span API doesn't satisfy them. However we shouldn't frame the proposal as a change to the 'Status' API because that API is already fulfilling a different important purpose
d) Something else?

I believe one of the sources of confusion here is that in order to categorize an error into one of the canonical span status codes, the author has to take a position, and that position will not be unique

That sounds like the problem I named Inconsistency in the text, yes? "If the mapping from native representation to
OpenTelemetry representation isn't well-defined then API users or SDK implementations are unlikely to choose the same mapping."

jmacd · 2020-07-07T15:42:02Z

FYI in the Spec SIG call today I requested that Google provide a design document on the canonical status codes, as there was such a document written when this practice was introduced. @mtwo will follow up.

mtwo · 2020-07-07T15:50:05Z

@jmacd @bogdandrutu I read the doc that you directed me to and it describes high level concepts like how to return an error or how to check for an error, but doesn't actually include the status code proto. I'm guessing that we want both?

Let me know and I'll reach out to Sanjay and will CC both of you.

Oberon00 · 2020-07-07T15:54:15Z

Maybe one alternative to this OTEP would be to:

Definitely change the default span status to UNKNOWN instead of OK when not explicitly set.
Maybe rename "status" to "statusCategory"
If we find an error condition that is not representable with current semantic conventions, add one. The most common one I can think of would be the exception semantic conventions discussed in Add semantic conventions for errors opentelemetry-specification#427. But we could also add an status.hresult attribute to encode a HRESULT integer, or status.posix to store a symbolic POSIX error such as ECONNREFUSED.

jmacd · 2020-07-07T19:39:38Z

@mtwo Yes please if you could ask Sanjay to locate a document that lays out the status codes and justifies them, probably this is what Bogdan and I remember reading many years ago. I think that would help this discussion.

mtwo · 2020-07-07T19:48:14Z

On it! I’ll CC both of you on the mail, as I don’t actually know Sanjay well From: Joshua MacDonald<mailto:notifications@github.com> Sent: Tuesday, July 7, 2020 12:39 PM To: open-telemetry/oteps<mailto:oteps@noreply.github.com> Cc: Morgan McLean<mailto:morgan_mclean@hotmail.com>; Mention<mailto:mention@noreply.github.com> Subject: Re: [open-telemetry/oteps] Add support for more expressive span status API (#123) @mtwo<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmtwo&data=02%7C01%7C%7C699e7e8630844254a81308d822ad85f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637297475950934777&sdata=hr6w9SbIrUlaFAf79uZgS88tof8EWnKhfa%2Bibvu0%2Bys%3D&reserved=0> Yes please if you could ask Sanjay to locate a document that lays out the status codes and justifies them, probably this is what Bogdan and I remember reading many years ago. I think that would help this discussion. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopen-telemetry%2Foteps%2Fpull%2F123%23issuecomment-655082541&data=02%7C01%7C%7C699e7e8630844254a81308d822ad85f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637297475950934777&sdata=DE4%2FZ%2FFMhEFPITfMMjyOd%2BkUiZteLKciKyvskdTf684%3D&reserved=0>, or unsubscribe<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAIXLK6XBRKBGARNPNJXRWDR2N2YTANCNFSM4OM7Y6YQ&data=02%7C01%7C%7C699e7e8630844254a81308d822ad85f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637297475950944762&sdata=qxnGL72pFCLTHH6is9lYCdqKZRtJaPhVYE5NQcC9Q3k%3D&reserved=0>.

mtwo · 2020-07-08T02:31:39Z

Message sent! You and Bogdan are CC’d From: Joshua MacDonald<mailto:notifications@github.com> Sent: Tuesday, July 7, 2020 12:39 PM To: open-telemetry/oteps<mailto:oteps@noreply.github.com> Cc: Morgan McLean<mailto:morgan_mclean@hotmail.com>; Mention<mailto:mention@noreply.github.com> Subject: Re: [open-telemetry/oteps] Add support for more expressive span status API (#123) @mtwo<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmtwo&data=02%7C01%7C%7C699e7e8630844254a81308d822ad85f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637297475950934777&sdata=hr6w9SbIrUlaFAf79uZgS88tof8EWnKhfa%2Bibvu0%2Bys%3D&reserved=0> Yes please if you could ask Sanjay to locate a document that lays out the status codes and justifies them, probably this is what Bogdan and I remember reading many years ago. I think that would help this discussion. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopen-telemetry%2Foteps%2Fpull%2F123%23issuecomment-655082541&data=02%7C01%7C%7C699e7e8630844254a81308d822ad85f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637297475950934777&sdata=DE4%2FZ%2FFMhEFPITfMMjyOd%2BkUiZteLKciKyvskdTf684%3D&reserved=0>, or unsubscribe<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAIXLK6XBRKBGARNPNJXRWDR2N2YTANCNFSM4OM7Y6YQ&data=02%7C01%7C%7C699e7e8630844254a81308d822ad85f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637297475950944762&sdata=qxnGL72pFCLTHH6is9lYCdqKZRtJaPhVYE5NQcC9Q3k%3D&reserved=0>.

mwear · 2020-07-08T23:50:43Z

text/0123-improve-span-status-api.md

+easier. For example in C#:
+
+````C#
+void SetStatus(Exception spanException, string statusType = "LanguageException", bool? successHint = null);


Multiple exceptions can occur during the lifetime of a span and experiencing an exception may or may not result in a non-ok status depending on the circumstances. How can multiple exceptions can handled using this API?

Multiple functions returning error codes may be called during the lifetime of a span and an error return may result in a non-ok status or not.

I think it's up to the user/instrumentation to decide which exception was the semantically important one. Most likely the one that went uncaugth.

Agreed with @Oberon00. I also attempted to capture this concept in the scoping section above but perhaps the wording wasn't good?

The operation represented by a span might be composed of sub-operations (for
example a function that calls into other lower level functions) and each of
those sub-operations might have a notion of status or error. If those
sub-operations aren't represented with a child span then it is out of scope
how their status is captured.

Feedback suggestions Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com>

noahfalk · 2020-07-10T12:45:46Z

Maybe one alternative to this OTEP would be to:

Definitely change the default span status to UNKNOWN instead of OK when not explicitly set.

Maybe rename "status" to "statusCategory"

If we find an error condition that is not representable with current semantic conventions, add one. The most common one I can think of would be the exception semantic conventions discussed in Add semantic conventions for errors opentelemetry-specification#427. But we could also add an status.hresult attribute to encode a HRESULT integer, or status.posix to store a symbolic POSIX error such as ECONNREFUSED.

Thanks for the suggestion! The first two parts sound fine to me but there isn't enough information for me to stake an opinion on the 3rd bullet. I did want to keep the spotlight on the API/data structure that developer directly observes when providing the status and I wasn't sure if you were refering to one of these options or something else?

Exception is a primitive that can be stored via SetAttribute

Span s = ...
Exception myError = ...
s.SetAttribute("status.exception", myError);

Exception is destructured by the user and stored as a series of SetAttribute calls

Span s = ...
Exception myError = ...
s.SetAtrribute("status.exception.type", myError.GetType());
s.SetAttribute("status.exception.message", myError.Message);
s.SetAttribute("status.exception.stackTrace", myError.StackTrace);
s.SetAttribute("status.exception.AnotherThing", myError.AnotherThing);

Option 1 or 2 above, except stored in a newly created event instead of span attribute
An OpenTelemetry provided strongly typed API whose implementation make use of underlying span attributes or events to store the data

Span s = ...
Exception myError = ...
s.SetException(myError);

IMO 1 or 4 could look pretty reasonable, but 2 requires the developer to write too much code.

Oberon00 · 2020-07-10T13:22:08Z

I like a variant of option 1: Instead of making it a primitive, we allow configuring a TracerProvider with AttributeSerializers that convert a non-primitive attribute value into one or more primitive attributes.

jmacd · 2020-07-10T16:30:21Z

@noahfalk in your comment #123 (comment)
Has this become entirely about exception reporting? There's no mention of Status Codes any longer. I think you'll find support for this, both (1) and (4). There is a recent proposal on exception reporting, again no mention of Status Code: open-telemetry/opentelemetry-specification#697

Note that the OTel-Go API already has a similar API to your (1) and (4). We're just waiting for semantic conventions to be finished. https://github.com/open-telemetry/opentelemetry-go/blob/9e99b441d2cb63bf7359b22860b47d8a8bbbb87d/api/trace/api.go#L126

@mtwo asked for some information from Google and we got a little bit of info from Sanjay Ghemawat. Here's a summary:

Google's Status Code were first developed in 2006 for an abstract file-system interface. They were adopted for use in RPC systems several years later. In 2012, these codes were recommended for all APIs. Over the 14 years, only one new code was introduced. UNAUTHORIZED was split apart from PERMISSION_DENIED. The primary motivation for the recommendation to use canonical codes in all APIs resulted from bugs caused by translation of error codes in multi-layer systems. Google still has other error code spaces in use, but the author of an alternative error code space is required to provide a translation into the canonical error space.

All of this leads me to think that we should remove Span Status as a special field and use semantic conventions for error codes. To me, asking users to specify a canonical status code when they are using another error code space raises the original problem that Google meant to solve: "bugs caused by translation of error codes". We should not be asking users to translate error code spaces. If we can ignore backwards compatibility for the time being. As a first step, I would specify that users should simply record their status in its natural code space, e.g., "hresult.status_code", "posix.status_code", "http.status_code". These conventional code spaces are standardized, so a vendor can automatically translate into canonical status codes. If the user has no natural error code space of their own, they may consider using Google's canonical error space.

The problem with all of this, IMO, is that Google has done nowhere near enough to promote their error code space as a generic solution to propagating errors in a multi-layer system. When gRPC was released, public documentation emerged that talked about the meaning of the codes and how they can be used with gRPC, but there is still no public design document. We need a design document that goes into great detail about the fine distinctions between the codes. For example, how and why is UNAUTHORIZED different from PERMISSION_DENIED? Note that the first Google result for "grpc status codes" is this, a page from the Google Maps Booking API documentation. This is not a design document, though it goes into depth on FAILED_PRECONDITION vs. ABORTED vs. UNAVAILABLE and INVALID_ARGUMENT vs. FAILED_PRECONDITION vs. OUT_OF_RANGE. @mtwo if I may make a suggestion, this feels like a responsibility for the gRPC team. Is there a gRPC product manager who could dig up an old internal design document, the one that was used in 2012 to justify the universal use of these codes, sanitize it, and publish externally?

Oberon00 · 2020-07-10T17:17:33Z

I would be in favor of removing the dedicated status code field from Spans. For our use cases in Dynatrace so far, we haven't used it so far and instead did what you suggest in your comment, i.e., adding dedicated semantic conventions for specific kinds of errors.

yurishkuro · 2020-07-10T21:59:31Z

+1 to remove the explicit span status. We didn't have it in OpenTracing and I never run into a user ask for adding it. Some users at Uber did ask for more consistent application of the error attribute by instrumentation, but not for detailed status code (aside from domain-specific codes already captured as attributes).

noahfalk · 2020-07-10T23:47:15Z

[@jmacd] @noahfalk in your comment #123 (comment)
Has this become entirely about exception reporting?

I also care about enumerated status codes (including ones like Http, HRESULT or POSIX) and error messages. I was using Exception as an example because @Oberon00 had put some focus on it in the comment I was responding to ("The most common one I can think of would be the exception semantic conventions..."). Specific to .NET I expect Exception to be the most common case but I also anticipate enumerated error codes and error messages to show up in some scenarios.

[@Oberon00] I would be in favor of removing the dedicated status code field from Spans. For our use cases in Dynatrace so far, we haven't used it
[@yurishkuro] We didn't have it in OpenTracing and I never run into a user ask for adding it

Probably implied from the OTEP that I also agree : ) To add the .NET evidence to the others - Activity, the API .NET has used for many years for distributed tracing has nothing like the current Grpc code specification of Status. To the best of my knowledge no user or telemetry tool has ever told us it would be valuable to them prior to its appearance in OpenTelemetry.

iNikem · 2020-07-15T06:03:56Z

We have "mostly approved" open-telemetry/opentelemetry-specification#697 for recording exceptions on Spans. Among other things, it was decided to decouple span status from exceptions. This means that mentioning error messages or exception objects and having them in API proposed by this PR is a contradiction to that decision.

We already have technology/runtime specific semantic conventions for various status/error codes, such as http.status_code and will definitely have more of them.

I am cautiously sceptical that operations statuses from very different domains can be mapped to gRPC status codes. I was surprised in the past when discovered, that we have status codes copied from gRPC.

Taking into account all of the above I will join the proposal to remove span.status field from the specification. We should close this PR then.

noahfalk · 2020-07-15T10:23:42Z

Taking into account all of the above I will join the proposal to remove span.status field from the specification. We should close this PR then.

No objection to closing the PR and thanks for everyone's consideration and discussion : ) I'd also be glad if the generic term 'status' is no longer defined as a grpc code.

I do want to call out that open-telemetry/opentelemetry-specification#697 only partially satisfied the scope of proposal here for anyone looking back on this in the future. Remaining open issues are how to capture:

free-form error messages
well-known enumerated status codes such as POSIX or HRESULT
User-defined or less well known enumerated codes

I'm fine with the premise that solutions to these issues could be designed later, potentially post-GA, and potentially via semantic conventions only.

iNikem · 2020-07-16T05:34:31Z

@noahfalk will you close this PR? Or some action is needed before that could happen?

jmacd · 2020-07-17T17:38:10Z

I'm not sure where the discussion about Span Status will go, but there's interest in the Metrics specification of having a nice way to label span-like duration measurements with a status, and I think it would be nice if we could use the natural status code, not the canonical status code. See open-telemetry/opentelemetry-specification#657 (comment)

noahfalk requested review from bogdandrutu, carlosalberto, jmacd, reyang, SergeyKanzhelev, tedsuo, tigrannajaryan and yurishkuro as code owners July 1, 2020 03:48

Add support for more expressive status API

cd147a9

noahfalk force-pushed the span_status branch from 68bfcb0 to cd147a9 Compare July 1, 2020 03:50

Oberon00 reviewed Jul 1, 2020

View reviewed changes

text/0123-improve-span-status-api.md Outdated Show resolved Hide resolved

Oberon00 reviewed Jul 1, 2020

View reviewed changes

arminru reviewed Jul 1, 2020

View reviewed changes

Shorten line length

32bab4b

noahfalk requested a review from a team July 6, 2020 20:03

Addressing review feedback

ddb7f49

Oberon00 mentioned this pull request Jul 7, 2020

Library instrumentation: StatusCanonicalCode open-telemetry/opentelemetry-specification#306

Closed

arminru reviewed Jul 7, 2020

View reviewed changes

jmacd reviewed Jul 7, 2020

View reviewed changes

mwear reviewed Jul 8, 2020

View reviewed changes

Apply suggestions from code review

6c0a4de

Feedback suggestions Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com>

Oberon00 mentioned this pull request Jul 10, 2020

Exception reporting specification open-telemetry/opentelemetry-specification#697

Merged

trask mentioned this pull request Jul 16, 2020

Verify actual status instead of just errored open-telemetry/opentelemetry-java-instrumentation#608

Closed

noahfalk closed this Jul 16, 2020

trask mentioned this pull request Jul 16, 2020

Need to map HTTP and gRPC status codes to Span.setStatus() open-telemetry/opentelemetry-java-instrumentation#326

Closed

iNikem mentioned this pull request Jul 16, 2020

Remove Span.Status from API open-telemetry/opentelemetry-specification#706

Closed

jmacd mentioned this pull request Jul 17, 2020

Add metrics semantic conventions for timed operations open-telemetry/opentelemetry-specification#657

Closed

mwear mentioned this pull request Aug 18, 2020

Span Status Improvements (without errors) #133

Closed

noahfalk mentioned this pull request Oct 2, 2023

REQUEST: New membership for noahfalk open-telemetry/community#1706

Closed

6 tasks


		Closed vs. open-ended status descriptions - We could make status represented by a fixed set of options such as the current gRPC codes or any larger/different set. It likely is simpler for tools consuming the data to only operate on fixed options and it is pleasant to imagine that all (or most) status results will be reasonably categorized into the fixed list. However I do not expect most users to take up the responsibility of mapping their status results if OpenTelemetry/back-end tools don't provide a clear representation. Instead these users will not set status or choose options that are generic and inconsistent rendering the data useless to everyone. For a fixed list to work I believe we either have to curate a very large list of options or accept that significant portions of the developer audience will not find the status representation useful.

		API using attribute semantic conventions - It is also possible to do this via semantic conventions on attributes and although I think the strongly typed API is preferable I don't have that strong of an opinion. Semantic conventions are likely to have higher performance overhead, higher risk of error in key names and are less discoverable/refactorable in IDEs. The advantages are that there is some past precedent for doing this specific to http codes and new conventions can be added easily. If we go semantic conventions it does imply that Exception becomes a type that can be directly passed as an argument to SetAttribute(). Requiring the user to destructure the exception into a list of key value pairs would be overly onerous and error-prone for a common usage scenario. If desired the SDK or exporter could destructure it, but that can be determined independently from API design and I'd like to keep it out of scope.


		API using attribute semantic conventions - It is also possible to do this via semantic conventions on attributes and although I think the strongly typed API is preferable I don't have that strong of an opinion. Semantic conventions are likely to have higher performance overhead, higher risk of error in key names and are less discoverable/refactorable in IDEs. The advantages are that there is some past precedent for doing this specific to http codes and new conventions can be added easily. If we go semantic conventions it does imply that Exception becomes a type that can be directly passed as an argument to SetAttribute(). Requiring the user to destructure the exception into a list of key value pairs would be overly onerous and error-prone for a common usage scenario. If desired the SDK or exporter could destructure it, but that can be determined independently from API design and I'd like to keep it out of scope.

		API using Span events - Most of the rationale for attribute semantic conventions also applies here, events are effectively another key-value store mechanism. The timestamp that is attached to an event appears to hold little value as status is probably produced at the same approximate time the span end timestamp is recorded. Similar to attribute conventions it sounds like there is precedent for storing some errors as events.

Add support for more expressive span status API #123

Add support for more expressive span status API #123

Conversation

noahfalk commented Jul 1, 2020

linux-foundation-easycla bot commented Jul 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Oberon00 Jul 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Oberon00 Jul 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noahfalk commented Jul 6, 2020

Choose a reason for hiding this comment

jmacd Jul 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmacd commented Jul 7, 2020

mtwo commented Jul 7, 2020

Oberon00 commented Jul 7, 2020

jmacd commented Jul 7, 2020

mtwo commented Jul 7, 2020 via email

mtwo commented Jul 8, 2020 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noahfalk commented Jul 10, 2020

Oberon00 commented Jul 10, 2020

jmacd commented Jul 10, 2020

Oberon00 commented Jul 10, 2020

yurishkuro commented Jul 10, 2020

noahfalk commented Jul 10, 2020

iNikem commented Jul 15, 2020

noahfalk commented Jul 15, 2020 • edited Loading

iNikem commented Jul 16, 2020

jmacd commented Jul 17, 2020

linux-foundation-easycla bot commented Jul 1, 2020 •

edited

Loading

Oberon00 Jul 1, 2020 •

edited

Loading

Oberon00 Jul 1, 2020 •

edited

Loading

jmacd Jul 7, 2020 •

edited

Loading

noahfalk commented Jul 15, 2020 •

edited

Loading