Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error flagging with status codes #136

Merged
merged 33 commits into from
Sep 17, 2020
Merged
Changes from 27 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
bd0177d
Add error flagging proposal
tedsuo Sep 9, 2020
54121cd
removed half finished sentence
tedsuo Sep 9, 2020
725cbc0
Remove error.message, it is redundant with the status message.
tedsuo Sep 9, 2020
ae86379
whitespace
tedsuo Sep 9, 2020
74dd7f5
whitespace
tedsuo Sep 9, 2020
0d34ea1
clarify convenience methods are in a seprate package
tedsuo Sep 9, 2020
d951674
whitespace
tedsuo Sep 9, 2020
c814974
whitespace
tedsuo Sep 9, 2020
e787f94
use correct RFC language
tedsuo Sep 9, 2020
3a25b4c
spelling
tedsuo Sep 9, 2020
4a42c3a
spellingUpdate text/trace/0000-error_flagging.md
tedsuo Sep 9, 2020
d204bc7
capitalization
tedsuo Sep 9, 2020
f551758
Capitalization
tedsuo Sep 9, 2020
44b1fa4
error mapping -> status mapping
tedsuo Sep 9, 2020
bb1bfb6
whitespace
tedsuo Sep 9, 2020
ccc0f5e
replace override status codes with user_override boolean
tedsuo Sep 9, 2020
241e2aa
rewrite based on error WG feedback
tedsuo Sep 10, 2020
db506d6
indicate that status source is a new field.
tedsuo Sep 10, 2020
7e5ef2b
remove some garbage
tedsuo Sep 10, 2020
6fb4bcb
Consolidate OPERATOR and APPLICATION into USER
tedsuo Sep 10, 2020
639e661
spelling
tedsuo Sep 10, 2020
0c8899b
nominal -> normal
tedsuo Sep 10, 2020
ec77065
markdownlint
tedsuo Sep 11, 2020
788844d
Add PR number to file name
tedsuo Sep 11, 2020
ed50989
more lint
tedsuo Sep 11, 2020
a6de9cc
clarify end users
tedsuo Sep 15, 2020
19a3d23
clarify end user
tedsuo Sep 15, 2020
cc8f305
clarify terms, update old intro
tedsuo Sep 15, 2020
2532107
fix intro
tedsuo Sep 15, 2020
7de13e9
clarify the meaning of normal
tedsuo Sep 15, 2020
1537cc3
clarify status mapping
tedsuo Sep 15, 2020
2b9e943
Update text/trace/0136-error_flagging.md
tedsuo Sep 16, 2020
76f0597
final nits
tedsuo Sep 16, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions text/trace/0136-error_flagging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Error Flagging with Status Codes

This proposal adds two status codes explicitly for use as overrides by the end user, and proposes a canonical mapping of semantic conventions to status codes. This clarifies how error reporting should work in OpenTelemetry.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

## Motivation

Error reporting is a fundamental use case for distributed tracing. While we prefer that error flagging occurs within analysis tools, and not within instrumentation plugins, a number of currently supported analysis tools and protocols rely on the existence of an explicit error flag reported from instrumentation. In OpenTelemetry, the error flag is called "status codes".
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is often dependent on the situation, but without even a baseline of suggested status codes for each convention, the instrumentation author is placed under the heavy burden of making the decision. Worse, the decisions will not be in sync across different instrumentation packages.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

There is one other missing piece, required for proper error flagging. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate that it is the end user setting the status code, and not instrumentation plugins. In these specific cases, the error flagging is known to be correct: the end user has decided the status of the span, and they do not want another interpretation.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End user would be the person using the application (e.g., the person shopping on the instrumented web shop) rather than the developer instrumenting the application, right? Maybe we can call it application or instrumentation developer? I think end user only makes sense from a back end vendor's POV (like we both are) but not for a general audience.


While generic instrumentation can only provide a generic schema, end users are capable of making subjective decisions about their systems. And, as the end user, they should get to have the final call in what constitutes an error. In order to accomplish this, there must be a way to differentiate between errors flagged by instrumentation, and errors flagged by the end user.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

## Explanation

The following changes add several missing features required for proper error reporting, and are completely backwards compatible with OpenTelemetry today.

### Status Codes

Currently, OpenTelemetry does not have a use case for differentiating between different types of errors. However, this use case may appear in the future. For now, we would like to reduce the number of status codes, and then add them back in as the need becomes clear. We would also like to differentiate between status codes which have not been
set, and an explicit OK status set by an end user.

* `UNSET` is the default status code.
* `ERROR` represents all error types.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
* `OK` represents a span which has been explicitly marked as being free of errors, and should not be counted against an error budget. Note that only end users should set this status. Instead, instrumentation should leave the status as `UNSET` for normal operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `OK` represents a span which has been explicitly marked as being free of errors, and should not be counted against an error budget. Note that only end users should set this status. Instead, instrumentation should leave the status as `UNSET` for normal operations.
* `OK` represents a span which has been explicitly marked as being free of errors, and should not be counted against an error budget. Note that only end users should set this status. Instead, instrumentations should leave the status as `UNSET` for normal operations.

tedsuo marked this conversation as resolved.
Show resolved Hide resolved

### `Status Source`
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

A new Status Source field identifies the origin of the status code on the span. This is important, as statuses set by application developers and operators have been confirmed by the end user to be correct to the particular situation. Statuses set by instrumentation, on the other hand, are only following a generic schema.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

* `INSTRUMENTATION` is the default source. This is used for instrumentation contained within shared code, such as OSS libraries and frameworks. All instrumentation plugins shipped with OpenTelemetry use this status code.
* `USER` identifies statuses set by application developers or operators, either in application code or the collector.

Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD pay attention to the status codes when set by `USER`, as it is a communication from the application developer or operator and contains valuable information.

### Status Mapping Schema

As part of the specification, OpenTelemetry provides a canonical mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box.
arminru marked this conversation as resolved.
Show resolved Hide resolved

Please note that semantic conventions, and thus status mapping from conventions, are still a work in progress and will continue to change after GA.

### Status Processor

The collector will provide a processor and a configuration language to make adjustments to this status mapping schema. This provides the flexibility and customization needed for real world scenarios.

### Convenience methods

As a convenience, OpenTelemetry provides helper functions for adding semantic conventions and exceptions to a span. These helper functions will also set the correct status code. This simplifies the life of the instrumentation author, and helps ensure compliance and data quality.

Note that these convenience methods simply wire together multiple API calls. They should live in a helper package, and should not be directly added to existing API interfaces. Given how many semantic conventions we have, there will be a pile of them.

## Internal details

This proposal is mostly backwards compatible with existing code, protocols, and the OpenTracing bridge. The only potential exception is the removal of status codes enums from the current OTLP protocol, and the rewriting of the small number of instrumentation plugins that were making use of them.

## BUT ERRORS ARE SUBJECTIVE!! HOW CAN WE KNOW WHAT IS AN ERROR? WHO ARE WE TO DEFINE THIS?

First of all, every tracing system to-date comes with a default set of errors. No system requires that end users start completely from scratch. So... be calm!! Have faith!!

While flagging errors can be a subjective decision, it is true that many semantic conventions qualify as an error. By providing a default mapping of semantic conventions to errors, we ensure compatibility with existing analysis tools (e.g. Jaeger), and provide guidance to users and future implementers.

Obviously, all systems are different, and users will want to adjust error reporting on a case by case basis. Unwanted errors may be suppressed, and additional errors may be added. The collector will provide a processor and a configuration language to make this a straightforward process. Working from a baseline of standard errors will provide a better experience than having to define a schema from scratch.

Note that analysis tools MAY disregard Span Status, and do their own error analysis. There is no requirement that the status code is respected, even when Status Source is set. However, it is strongly suggested that analysis tools SHOULD pay attention to the status code when Status Source is set, as it represents a subjective decision made by either the operator or application developer.

## Remind me why we need status codes again?

Status codes provide a low overhead mechanism for checking if a span counts against an error budget, without having to scan every attribute and event. It is an inexpensive and low cardinality approach to track multiple types of error budgets. This reduces overhead and could be a benefit for many systems.

However, adding in an existing set of error types without first clearly defining their use and how they might be set has caused confusion. If the status codes are not set consistently and correctly, then the resulting error budgeting will not be useful. So we are consolidating all error types into a single ERROR type, to avoid this situation. We may add more error types back in if we can agree on their use cases and a method for applying them consistently.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

## Open questions

If we add error processing to the Collector, it is unclear what the overhead would be.

It is also unclear what the cost is for backends to scan for errors on every span, without a hint from instrumentation that an error might be present.

## Prior art and alternatives

In OpenTracing, the lack of a Collector and status mapping schema proved to be unwieldy. It placed a burden on instrumentation plugin authors to set the error flag correctly, and led to an explosion of non-standardized configuration options in every plugin just to adjust the default error flagging. This in turn placed a configuration burden on application developers.

An alternative is the `error.hint` proposal, paired with the removal of status code. This would work, but essentially provides the same mechanism provided in this proposal, only with a large number of breaking changes. It also does not address the need for user overrides.

## Future Work

The inclusion of status codes and status mappings help the OpenTelemetry community speak the same language in terms of error reporting. It lifts the burden on future analysis tools, and (when respected) it allows users to employ multiple analysis tools without having to synchronize an important form of configuration across multiple tools.

In the future, OpenTelemetry may add a control plane which allows dynamic configuration of the status mapping schema.