-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add semantic conventions for errors #427
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# Semantic conventions for errors | ||
|
||
This document defines semantic conventions for recording errors. | ||
|
||
## Recording an Error | ||
|
||
An error SHOULD be recorded as an `Event` on the span during which it occurred. | ||
The name of the event MUST be `"error"`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The implication of having a name like "error" is that there is exactly one error that can be logged per span. What is the recommendation to track errors where a single request forks out into multiple requests (scatter-gather pattern) and there are multiple failures? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the proper way would be creating a span for each sub-request. If the sub-request failure causes the parent request to fail, there has to be a single "combined" error anyway (e.g. the first error, a generic "one or more subrequests have failed" error, etc.). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in golang, we are creating a separate spanevent for each error message. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mwear A reminder about what we discussed in the last WG meeting: possibly introduce SeverityNumber instead of "error" event name as the indication of an erroneous event. This will align the Span Event better with Log data model: https://github.com/open-telemetry/oteps/blob/master/text/logs/0097-log-data-model.md#error-semantics |
||
|
||
## Attributes | ||
|
||
The table below indicates which attributes should be added to the `Event` and | ||
their types. | ||
|
||
| Attribute name | Type | Notes and examples | Required? | | ||
| :------------- | :--------- | :----------------------------------------------------------- | :-------- | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mwear thanks for tagging this PR @ open-telemetry/oteps#86. I like the discussion on error.stack. We should add below tag https://github.com/opentracing/specification/blob/master/semantic_conventions.md#span-tags-table
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this was added, I'd rather call it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tigrannajaryan suggested that we use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @jmacd. Small correction: log data model uses SeverityNumber. Would be useful to keep the name consistent here too (and I am not sure what casing we want to use here - log data model does not mandate a particular casing). |
||
| error.type | String | The type of the error (its fully-qualified class name, if applicable). E.g. "java.net.ConnectException", "OSError" | Yes | | ||
| error.message | String | The error message. E.g. `"Division by zero"`, `"Can't convert 'int' object to str implicitly"` | Yes | | ||
| error.stack | Array\<String\> | A stack trace formatted as an array of strings. The stackframe at which the exeception occurred should be the first entry. E.g. `["Exception in thread \"main\" java.lang.RuntimeException: Test exception", " at com.example.GenerateTrace.methodB(GenerateTrace.java:13)", " at com.example.GenerateTrace.methodA(GenerateTrace.java:9)", " at com.example.GenerateTrace.main(GenerateTrace.java:5)" ]` | No | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like how simple this is, but it does not expose information about file:line etc in a structured format for backends which may use it for grouping and/or filtering. And given the fact that most languages (at least JS, Java, Go, Python, Ruby & PHP) provide access to stack info in a structured/already parsed way we will end up with
That is unneeded extra work both in OpenTelemetry and backends. So my 2 cents are that [
"{file: \"GenerateTrace.java\", line: 13, func: \"com.example.GenerateTrace.methodB\"}",
"{file: \"GenerateTrace.java\", line: 9, func: \com.example.GenerateTrace.methodA\"}",
] And as a fallback if a string can't be parsed as JSON - it is a string in human-readable format you proposed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Defining an internal structure makes sense if languages have a mechanism (other than string parsing) to access the data. If folks can chime in with what fields are readily available for stackframes perhaps we can come up with a set of required and optional attributes to collect. For Ruby we have Thead::Backtrace::Location. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are existing systems that solved this problem long time ago. We don't need to reinvent the wheel. Eg https://docs.sentry.io/development/sdk-dev/event-payloads/stacktrace/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose that I have some reservations about the cost and complexity of collecting data at such a granular level. Most (if not all) of the relevant information is available in a stack trace formatted as an array of strings (or a newline delimited one). If we added the language as an attribute backends would have the information they need to parse the relevant data, and spare OpenTelemetry users from having to gather it. While this does increase the complexity of for the backend, the trade off is simplicity and improved performance for OpenTelemetry users. If we do go the route of more granular data, I think we should have actual structures for collecting it, and possibly separate APIs. That might make OTEP-69 a better fit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mwear It would be helpful if you describe the complexity (ideally write some code) even for one language of your choice like Ruby. I insist that practically for most languages there is no significant difference between generating @yurishkuro all those error reporting services reinvent their own wheels to solve exactly the same task. And Sentry is not a particularly good example - it is a very old wheel with some legacy. I would start with something very simple like:
It is self-documented and covers 99% use cases. I intentionally omit
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vmihailenco - I think that stack frame addressing may be a bit more varied than that. In Java you need both a class and a function to fully qualify a function unit and you won't have a column. In some cases a frame will not have a file or number- I believe that an exception thrown by a system framework in .Net would be an example of the latter. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In Java I see
That information can be nicely fitted in
Ack, but that is not very popular and useful case - debugging becomes very complicated without such basic info. But we could omit missing fields and use only There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That was actually just the first case I thought of, and that case is indeed useful- you may not go to the source to view the location, but a system framework frame is as useful to troubleshooting as an application source location. I believe a C/C++ stack trace won't have line numbers when debugging symbols have been stripped, either. I think that to @yurishkuro's point, having a "simple" tuple is liable to back us into a corner, and there is some prior work that could inform a more extensible approach. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, the idea that there is some prior work with good ideas was repeated several times already - what are those ideas and how they solve C++ without debugging symbols use case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AFAIK, in C++ you often have at least unwind information for exception handling, so you can get the stack in form of a list of return addresses (e.g. using libunwind). With that you record the base address of the module in which that adress points with it, so you get something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You say you needed arrays for these conventions, but I can't see any advantages of the array over a newline-separated string. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was (incorrectly) under the assumption that there were length limits for attribute values and that a collection would be treated differently, in that the length limits would apply to each entry rather than the value as a whole. While it's true that OpenTelemetry doesn't specify length limits, backends and alternate SDK implementations might choose to do so. Having a stack trace stored as an array might make it easier to avoid truncation in this situation, so I still think the idea has some merit. Aside from that, I think a newline delimited string and an array of strings are roughly equivalent, and transforming between the two representations is fairly trivial. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A couple of attributes that will be important which are covered in open-telemetry/oteps#69 are a unique hash for the particular exception and another unique hash for the exception type and line it was thrown in the code. This is important for deduplication and aggregation. Also I still think it is important to support more kinds of errors besides application exceptions. For example, core dumps from IoT devices which would get sent after the device reboots. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There also needs to be an attribute which contains specific input parameter values. This is supported by several backend systems. It helps in debugging exceptions caused by unexpected data inputs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kbrockhoff could you explain why the hashes can't be computed in the backend? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In some cases, the only way to effectively calculate the identity of an exception object is the address in memory. The only alternative I can think of that would work on the backend would be if the file and line number of entire nested exception chain plus a full list of local variable values from the stack is recorded. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Address of what in memory? That sounds deployment-dependent. At Uber we have an internal system that performs deduping of exceptions, all in the backend, by hashing the stacktrace. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hashing of the stacktrace working fine for identifying multiple triggers of the same exception/event but does not help in identifying multiple reports of the same invocation or event. For example if auto-instrumentation of a method call records the exception which is then also recorded by tracing instrumentation higher up in the call stack. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hashing of the stacktrace could be important when information is dropped- for example a Java stacktrace with nested causes can run into the hundreds of lines, so it's not uncommon to limit the data to the nearest so many frames and drop the rest, retaining a hash of the full exception would allow determination whether two now-identical stack traces started out that way. On the other hand, creating a hash of the exception is going to require serializing the whole exception in order to hash it which might not be needed if you are only going to send a few frames. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A string representation, whether it's an array of strings or newline-separated, does encode data in a way that can be troublesome to parse later on. In most languages I'm familiar with, the language offers access to the stack at the time of exception handling- it would seem that putting the stack frames into a normalized form would make sense at that point, rather than requiring a language dependent parse stage on the backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the semantic conventions here (especially
error.type
) it seems that these conventions are specifically geared towards recording thrown exceptions rather than any kind of error (cf. #306 which, despite the title, contains quite some interesting discussion about the issue).