You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The observability requirements for stable components recommend emitting telemetry in a way that allows users to differentiate between errors originating from a component and errors propagated from downstream components. This is currently somewhat complicated to do in receivers that use receiverhelper, notably the OTLP receiver (see OTLP receiver telemetry review), for two reasons:
All errors are surfaced as the same otelcol_receiver_refused_x metric;
If an internal error happens before the telemetry payload was fully received and parsed, we cannot determine the number of telemetry items involved, and thus cannot properly surface the error with ObsReport.EndXOp. This means that StartXOp may be delayed until everything is parsed (as in the OTLP receiver), which mean internal failures are never surfaced through metrics.
Describe the solution you'd like
Following the precedent of the pipeline auto-instrumentation RFC, I believe we should differentiate between payloads that were "refused" by downstream components and requests that "failed".
Telemetry-wise, this would mean specializing the otelcol_receiver_refused_x metric to downstream errors (ones returned from nextConsumer.ConsumeX; this is already the case de-facto in the OTLP receiver), and add a new metric to account for internal errors:
Either a simple otelcol_receiver_failed_requests metric (maybe _operations if we want to account for scrapers?);
Or a generic otelcol_receiver_requests metric which counts all receiver operations, with an outcome: success / failure / refused attribute, following the convention in the above RFC.
API-wise, with the goal of avoiding breakage, I think the simplest way to implement this would be to add a new method to ObsReport which could be called in place of EndXOp, which would emit a "failure" metric instead of a "refused" metric, and encourage component authors to call StartXOp as early in processing as possible. (Note: This could also be used to improve the timing information provided by tracing by adding a span event signifying the end of internal processing). Under the assumption that most receivers behave like the OTLP receiver and mostly only wrap downstream processing in Start/EndXOp, components that haven't updated would continue to behave as before.
Describe alternatives you've considered
We could also leave things as-is, and let receiver component authors add their own internal failure metrics.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
The observability requirements for stable components recommend emitting telemetry in a way that allows users to differentiate between errors originating from a component and errors propagated from downstream components. This is currently somewhat complicated to do in receivers that use
receiverhelper
, notably the OTLP receiver (see OTLP receiver telemetry review), for two reasons:otelcol_receiver_refused_x
metric;ObsReport.EndXOp
. This means thatStartXOp
may be delayed until everything is parsed (as in the OTLP receiver), which mean internal failures are never surfaced through metrics.Describe the solution you'd like
Following the precedent of the pipeline auto-instrumentation RFC, I believe we should differentiate between payloads that were "refused" by downstream components and requests that "failed".
Telemetry-wise, this would mean specializing the
otelcol_receiver_refused_x
metric to downstream errors (ones returned fromnextConsumer.ConsumeX
; this is already the case de-facto in the OTLP receiver), and add a new metric to account for internal errors:otelcol_receiver_failed_requests
metric (maybe_operations
if we want to account for scrapers?);otelcol_receiver_requests
metric which counts all receiver operations, with anoutcome: success / failure / refused
attribute, following the convention in the above RFC.API-wise, with the goal of avoiding breakage, I think the simplest way to implement this would be to add a new method to
ObsReport
which could be called in place ofEndXOp
, which would emit a "failure" metric instead of a "refused" metric, and encourage component authors to callStartXOp
as early in processing as possible. (Note: This could also be used to improve the timing information provided by tracing by adding a span event signifying the end of internal processing). Under the assumption that most receivers behave like the OTLP receiver and mostly only wrap downstream processing inStart/EndXOp
, components that haven't updated would continue to behave as before.Describe alternatives you've considered
We could also leave things as-is, and let receiver component authors add their own internal failure metrics.
The text was updated successfully, but these errors were encountered: