Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add semantic conventions for errors #427

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions specification/data-error-semantic-conventions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Semantic conventions for errors

This document defines semantic conventions for recording errors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the semantic conventions here (especially error.type) it seems that these conventions are specifically geared towards recording thrown exceptions rather than any kind of error (cf. #306 which, despite the title, contains quite some interesting discussion about the issue).


## Recording an Error

An error SHOULD be recorded as an `Event` on the span during which it occurred.
The name of the event MUST be `"error"`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implication of having a name like "error" is that there is exactly one error that can be logged per span. What is the recommendation to track errors where a single request forks out into multiple requests (scatter-gather pattern) and there are multiple failures?

Copy link
Member

@Oberon00 Oberon00 Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proper way would be creating a span for each sub-request. If the sub-request failure causes the parent request to fail, there has to be a single "combined" error anyway (e.g. the first error, a generic "one or more subrequests have failed" error, etc.).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in golang, we are creating a separate spanevent for each error message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwear A reminder about what we discussed in the last WG meeting: possibly introduce SeverityNumber instead of "error" event name as the indication of an erroneous event. This will align the Span Event better with Log data model: https://github.com/open-telemetry/oteps/blob/master/text/logs/0097-log-data-model.md#error-semantics
I think there is value in aligning embedded Span Events and standalone Log Records as much as possible.


## Attributes

The table below indicates which attributes should be added to the `Event` and
their types.

| Attribute name | Type | Notes and examples | Required? |
| :------------- | :--------- | :----------------------------------------------------------- | :-------- |
Copy link

@vikrambe vikrambe Feb 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwear thanks for tagging this PR @ open-telemetry/oteps#86.
Proposed fields are already part of opentracing specification. Please refer https://github.com/opentracing/specification/blob/master/semantic_conventions.md#log-fields-table

I like the discussion on error.stack. We should add below tag https://github.com/opentracing/specification/blob/master/semantic_conventions.md#span-tags-table

error | bool | true if and only if the application considers the operation represented by the Span to have failed|

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was added, I'd rather call it fatal_error, unrecoverable or similar. It would still be an error that has occurred even after being able to recover from it by retrying or using another fallback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan suggested that we use log.severity to indicate how severe the exception is. I would prefer log.severity in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jmacd. Small correction: log data model uses SeverityNumber. Would be useful to keep the name consistent here too (and I am not sure what casing we want to use here - log data model does not mandate a particular casing).

| error.type | String | The type of the error (its fully-qualified class name, if applicable). E.g. "java.net.ConnectException", "OSError" | Yes |
| error.message | String | The error message. E.g. `"Division by zero"`, `"Can't convert 'int' object to str implicitly"` | Yes |
| error.stack | Array\<String\> | A stack trace formatted as an array of strings. The stackframe at which the exeception occurred should be the first entry. E.g. `["Exception in thread \"main\" java.lang.RuntimeException: Test exception", " at com.example.GenerateTrace.methodB(GenerateTrace.java:13)", " at com.example.GenerateTrace.methodA(GenerateTrace.java:9)", " at com.example.GenerateTrace.main(GenerateTrace.java:5)" ]` | No |
Copy link

@vmihailenco vmihailenco Jan 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how simple this is, but it does not expose information about file:line etc in a structured format for backends which may use it for grouping and/or filtering. And given the fact that most languages (at least JS, Java, Go, Python, Ruby & PHP) provide access to stack info in a structured/already parsed way we will end up with

  • Languages/runtimes providing stack info in a structured way like {file: string, line: number, func: string}
  • OpenTelemetry requiring encoding that stack info as array of strings in language native human-readable format
  • Backends having to parse those strings to extract information they need

That is unneeded extra work both in OpenTelemetry and backends. So my 2 cents are that error.stack should have some internal structure like in this proposal. Perhaps by default it should be array of JSON-encoded strings like this:

[
  "{file: \"GenerateTrace.java\", line: 13, func: \"com.example.GenerateTrace.methodB\"}",
  "{file: \"GenerateTrace.java\", line: 9, func: \com.example.GenerateTrace.methodA\"}",
]

And as a fallback if a string can't be parsed as JSON - it is a string in human-readable format you proposed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining an internal structure makes sense if languages have a mechanism (other than string parsing) to access the data. If folks can chime in with what fields are readily available for stackframes perhaps we can come up with a set of required and optional attributes to collect.

For Ruby we have Thead::Backtrace::Location. The line and file are readily available, but func is not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing systems that solved this problem long time ago. We don't need to reinvent the wheel.

Eg https://docs.sentry.io/development/sdk-dev/event-payloads/stacktrace/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that I have some reservations about the cost and complexity of collecting data at such a granular level. Most (if not all) of the relevant information is available in a stack trace formatted as an array of strings (or a newline delimited one). If we added the language as an attribute backends would have the information they need to parse the relevant data, and spare OpenTelemetry users from having to gather it. While this does increase the complexity of for the backend, the trade off is simplicity and improved performance for OpenTelemetry users.

If we do go the route of more granular data, I think we should have actual structures for collecting it, and possibly separate APIs. That might make OTEP-69 a better fit.

Copy link

@vmihailenco vmihailenco Jan 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwear It would be helpful if you describe the complexity (ideally write some code) even for one language of your choice like Ruby. I insist that practically for most languages there is no significant difference between generating at com.example.GenerateTrace.methodA(GenerateTrace.java:9) and {file: "com.example.GenerateTrace.methodA(GenerateTrace.java", line: 9}. Or the difference will be the same as between sprintf and json_encode. And it is not that hard to use sprintf to generate JSON if you worried about micro-optimizations. Overall the requirement to use JSON is imposed by OpenTelemetry spec - it is not specific to this use case. IMHO OpenTelemetry could be more flexible and let exporters decide what to do in this case instead of fitting data structures for protobuf usage. E.g. JSON or MsgPack handles it without any problems.

@yurishkuro all those error reporting services reinvent their own wheels to solve exactly the same task. And Sentry is not a particularly good example - it is a very old wheel with some legacy.

I would start with something very simple like:

  • file - required
  • line - required
  • func - required
  • column - optional for JS

It is self-documented and covers 99% use cases. I intentionally omit

  • language specific things like module, class, namespace etc - they can be stored in func attribute and are not that interesting to complicate things
  • frame vars aka context-locals - AFAIK only supported by Python & Ruby and double/triple stack size
  • IOS and symbolication in general - too complex and too specific
  • in_app, abs_path, raw_function - specific to Sentry
  • source_version - it is not clear why every frame needs this information and how to obtain it

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vmihailenco - I think that stack frame addressing may be a bit more varied than that. In Java you need both a class and a function to fully qualify a function unit and you won't have a column. In some cases a frame will not have a file or number- I believe that an exception thrown by a system framework in .Net would be an example of the latter.

Copy link

@vmihailenco vmihailenco Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Java I see StackTraceElement that has

  • getMethodName
  • getClassName that can be used together with ClassUtil.getFilepath to obtain filepath
  • getLineNumber

That information can be nicely fitted in {func, file, line} structure I suggested.

In some cases a frame will not have a file or number

Ack, but that is not very popular and useful case - debugging becomes very complicated without such basic info. But we could omit missing fields and use only {func} when file:line is not available.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was actually just the first case I thought of, and that case is indeed useful- you may not go to the source to view the location, but a system framework frame is as useful to troubleshooting as an application source location. I believe a C/C++ stack trace won't have line numbers when debugging symbols have been stripped, either. I think that to @yurishkuro's point, having a "simple" tuple is liable to back us into a corner, and there is some prior work that could inform a more extensible approach.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the idea that there is some prior work with good ideas was repeated several times already - what are those ideas and how they solve C++ without debugging symbols use case?

Copy link
Member

@Oberon00 Oberon00 Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, in C++ you often have at least unwind information for exception handling, so you can get the stack in form of a list of return addresses (e.g. using libunwind). With that you record the base address of the module in which that adress points with it, so you get something like 0xfff12345 (libfoobar.so+0xabc12). At the backend, you may have (detached) debugging symbols for libfoobar.so available, so you can now combine that information and symbolize the module+offset pair to get something like 0xfff12345 (libfoobar.so!MyClass::myFunction(int,long)+0xa1, MyClass.cpp:356).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say you needed arrays for these conventions, but I can't see any advantages of the array over a newline-separated string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was (incorrectly) under the assumption that there were length limits for attribute values and that a collection would be treated differently, in that the length limits would apply to each entry rather than the value as a whole.

While it's true that OpenTelemetry doesn't specify length limits, backends and alternate SDK implementations might choose to do so. Having a stack trace stored as an array might make it easier to avoid truncation in this situation, so I still think the idea has some merit. Aside from that, I think a newline delimited string and an array of strings are roughly equivalent, and transforming between the two representations is fairly trivial.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of attributes that will be important which are covered in open-telemetry/oteps#69 are a unique hash for the particular exception and another unique hash for the exception type and line it was thrown in the code. This is important for deduplication and aggregation.

Also I still think it is important to support more kinds of errors besides application exceptions. For example, core dumps from IoT devices which would get sent after the device reboots.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There also needs to be an attribute which contains specific input parameter values. This is supported by several backend systems. It helps in debugging exceptions caused by unexpected data inputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbrockhoff could you explain why the hashes can't be computed in the backend?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases, the only way to effectively calculate the identity of an exception object is the address in memory. The only alternative I can think of that would work on the backend would be if the file and line number of entire nested exception chain plus a full list of local variable values from the stack is recorded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address of what in memory? That sounds deployment-dependent. At Uber we have an internal system that performs deduping of exceptions, all in the backend, by hashing the stacktrace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashing of the stacktrace working fine for identifying multiple triggers of the same exception/event but does not help in identifying multiple reports of the same invocation or event. For example if auto-instrumentation of a method call records the exception which is then also recorded by tracing instrumentation higher up in the call stack.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashing of the stacktrace could be important when information is dropped- for example a Java stacktrace with nested causes can run into the hundreds of lines, so it's not uncommon to limit the data to the nearest so many frames and drop the rest, retaining a hash of the full exception would allow determination whether two now-identical stack traces started out that way.

On the other hand, creating a hash of the exception is going to require serializing the whole exception in order to hash it which might not be needed if you are only going to send a few frames.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A string representation, whether it's an array of strings or newline-separated, does encode data in a way that can be troublesome to parse later on. In most languages I'm familiar with, the language offers access to the stack at the time of exception handling- it would seem that putting the stack frames into a normalized form would make sense at that point, rather than requiring a language dependent parse stage on the backend.