-
Notifications
You must be signed in to change notification settings - Fork 598
Description
Error handling patterns in public API interfaces
This issue follows the mADR template following the ADR structure and is written in the style of an architectural decision record. It attempts to provide a top-level view on how we should handle errors in our public traits, something that's come up in a number of discrete issues and a PR now:
- [Metrics] Revisit every public Error enums and Results #2564
- Should OTLP Exporter Error enum be public and visible to the processor? #2561
- chore: modify LogExporter and TraceExporter interfaces to support returning failure #2381
My hope is that we can collectively make a design decision and argue the details of that here, and then the actual PRs and implementations become straightforward.
Additionally, If we like this format (ADRs!), we could begin capturing these in the repository itself for future reference, along with other architectural artefacts.
Context and Problem Statement
There is uncertainty around how to model errors in in the opentelemetry-rust public API interfaces - that is, APIs facing the consumers. At the time of writing this is an important issue to resolve as moving beginning to move the signals towards RC and eventually a stable release is an urgent priority.
The situation is as follows; a concrete example is given, but the issue holds across various public traits, in particular the exporters:
- A given public interface in
opentelemetry-sdk,such as trait LogExporter - ... exposes multiple discrete actions with logically disjoint error types (e.g. export and shutdown - that is, the class of errors returned for each of these actions are foreseeably very different, as is the callers reaction to them
- ... is implemented by multiple concrete types such as
InMemoryLogExporter,OtlpLogExporter,StdOutLogExporterthat have different error requirements - for instance, anOtlpLogExporterwill experience network failures, anInMemoryLogExporterwill not - Potentially has operations on the API that, either in the direct implementation, or in a derived utility that utilises the direct implementation, call multiple API actions and therefore need to return an aggregated log type
Today, we have a situation where a single error type is used per API-trait, and some methods simply swallow their errors. In the example above of LogExporter, shutdown swallows errors, and export returns the LogError type, a type that could conceptually be thought of as belonging to the entire trait, not a particular method. For the exporters, the opentelemetry-specification tells us that they need to indicate success or failure, with a distinction made between 'failed' and 'timed out'.
There are also similar examples in the builders for providers and exports.
Considered Options
Option 1: Continue as is
Continue the status quo, returning a mix of either nothing or the trait-wide error type. This is inconsistent and limits the caller's ability to handle errors.
Option 2: Extend trait-wide error type to all methods on trait
In this option we keep the existing error type, add it to the remaining methods on the trait, and extend the error type to include errors covering the new error conditions. This will mean that callers will have to know how and when to discard errors from a particular API call based on an understanding of which subset of errors that particular call can make.
Conversely, it will reduce the number of error types in the code base.
Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type
For example, in the above we'd have something like:
pub trait LogExporter {
fn export(...) -> Result<..., ExportError>;
fn shutdown(...) -> Result<..., ShutdownError>
}
// Concrete errors for an export operation
pub enum ExportError {
// The distinction between failure and timed out is part of the OTEL spec
// we need to meet.
ExportFailed,
ExportTimedOut(Duration),
// Allow impls to box up errors that can't be logically mapped
// back to one of the APIs errors
#[error("Unknown error (should not occur): {source:?}")]
Unknown {
source: Box<dyn std::error::Error + Send + Sync>,
},
}
// Aggregate error type for convenience
// Note: This will be added in response to need, not pre-emptively
#[derive(Debug, thiserror::Error)]
pub enum LogError {
#[error("Export error: {0}")]
InitError(#[from] ExportError),
#[error("Shutdown error: {0}")]
ShutdownError(#[from] ShutdownError),
}
// A downcast helper for callers that need to work with impl-specific
// unknown errors concretely
impl ExportError {
/// Attempt to downcast the inner `source` error to a specific type `T`
pub fn downcast_ref<T: std::error::Error + 'static>(&self) -> Option<&T> {
if let ExportError::Unknown { source } = self {
source.downcast_ref::<T>()
} else {
None
}
}
}Decision Outcome
Chosen option: "Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type"
Consequences
- Good, because callers can handle focussed errors with focussed remediation
- Good, because implementors of the
pub traits can box up custom errors in a fashion that follow's canonical's error and panic discipline guide, by avoiding type erasure of impl-specific errors - Good, because the per-trait error type (
LogErrorforLogExporterabove) provides consumers of the trait that hit multiple methods in a single method an error type they can use - Bad, because there's more code than a single error type
- Bad, because a caller may need to use
downcast_refif they have a known trait impl and want to handle aUnknownerror