Skip to content

Commit

Permalink
error handling proposal (open-telemetry#153)
Browse files Browse the repository at this point in the history
* error handling proposal

* added self-monitoring

* Reword principles

* Reword guidance

* Reword diagnostics

* Reword exceptions

* Add note on logs, callbacks

* Formatting

* formatting and a mention of ToString

* dynamic languages

* returning noops, not nulls

* Reformat for open-telemetry#192

* Update specification/error-handling.md

Co-Authored-By: Armin Ruech <armin.ruech@gmail.com>
  • Loading branch information
2 people authored and Sergey Kanzhelev committed Feb 18, 2020
1 parent 2cbada1 commit bbb2acf
Showing 1 changed file with 58 additions and 0 deletions.
58 changes: 58 additions & 0 deletions specification/error-handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Error handling in OpenTelemetry

OpenTelemetry generates telemetry data to help users monitor application code.
In most cases, the work that the library performs is not essential from the perspective of application business logic.
We assume that users would prefer to lose telemetry data rather than have the library significantly change the behavior of the instrumented application.

OpenTelemetry may be enabled via platform extensibility mechanisms, or dynamically loaded at runtime.
This makes the use of the library non-obvious for end users, and may even be outside of the application developer's control.
This makes for some unique requirements with respect to error handling.

## Basic error handling principles

OpenTelemetry implementations MUST NOT throw unhandled exceptions at run time.

1. API methods MUST NOT throw unhandled exceptions when used incorrectly by end users.
The API and SDK SHOULD provide safe defaults for missing or invalid arguments.
For instance, a name like `empty` may be used if the user passes in `null` as the span name argument during `Span` construction.
2. The API or SDK may _fail fast_ and cause the application to fail on initialization, e.g. because of a bad user config or environment, but MUST NOT cause the application to fail later at run time, e.g. due to dynamic config settings received from the Collector.
3. The SDK MUST NOT throw unhandled exceptions for errors in their own operations.
For example, an exporter should not throw an exception when it cannot reach the endpoint to which it sends telemetry data.

## Guidance

1. API methods that accept external callbacks MUST handle all errors.
2. Background tasks (e.g. threads, asynchronous tasks, and spawned processes) should run in the context of a global error handler to ensure that exceptions do not affect the end user application.
3. Long-running background tasks should not fail permanently in response to internal errors.
In general, internal exceptions should only affect the execution context of the request that caused the exception.
4. Internal error handling should follow language-specific conventions.
In general, developers should minimize the scope of error handlers and add special processing for expected exceptions.
5. Beware external callbacks and overrideable interfaces: Expect them to throw.
6. Beware to call any methods that wasn't explicitly provided by API and SDK users as a callbacks.
Method `ToString` that SDK may decide to call on user object may be badly implemented and lead to stack overflow.
It is common that the application never calls this method and this bad implementation would never be caught by an application owner.
7. Whenever API call returns values that is expected to be non-`null` value - in case of error in processing logic - SDK MUST return a "no-op" or any other "default" object that was (_ideally_) pre-allocated and readily available.
This way API call sites will not crash on attempts to access methods and properties of a `null` objects.

## Error handling and performance

Error handling and extensive input validation may cause performance degradation, especially on dynamic languages where the input object types are not guaranteed in compile time.
Runtime type checks will impact performance and are error prone, exceptions may occur despite the best effort.

It is recommended to have a global exception handling logic that will guarantee that exceptions are not leaking to the user code.
And make a reasonable trade off of the SDK performance and fullness of type checks that will provide a better on-error behavior and SDK errors troubleshooting.

## Self-diagnostics

All OpenTelemetry libraries -- the API, SDK, exporters, instrumentation adapters, etc. -- are encouraged to expose self-troubleshooting metrics, spans, and other telemetry that can be easily enabled and filtered out by default.

One good example of such telemetry is a `Span` exporter that indicates how much time exporters spend uploading telemetry.
Another example may be a metric exposed by a `SpanProcessor` that describes the current queue size of telemetry data to be uploaded.

Whenever the library suppresses an error that would otherwise have been exposed to the user, the library SHOULD log the error using language-specific conventions.
SDKs MAY expose callbacks to allow end users to handle self-diagnostics separately from application code.

## Exceptions to the rule

SDK authors MAY supply settings that allow end users to change the library's default error handling behavior.
Application developers may want to run with strict error handling in a staging environment to catch invalid uses of the API, or malformed config.

0 comments on commit bbb2acf

Please sign in to comment.