[receiverhelper] Add metric for requests that failed to be received #12207

jade-guiton-dd · 2025-01-29T11:06:40Z

Is your feature request related to a problem? Please describe.

The observability requirements for stable components recommend emitting telemetry in a way that allows users to differentiate between errors originating from a component and errors propagated from downstream components. This is currently somewhat complicated to do in receivers that use receiverhelper, notably the OTLP receiver (see OTLP receiver telemetry review), for two reasons:

All errors are surfaced as the same otelcol_receiver_refused_x metric;
If an internal error happens before the telemetry payload was fully received and parsed, we cannot determine the number of telemetry items involved, and thus cannot properly surface the error with ObsReport.EndXOp. This means that StartXOp may be delayed until everything is parsed (as in the OTLP receiver), which mean internal failures are never surfaced through metrics.

Describe the solution you'd like

Following the precedent of the pipeline auto-instrumentation RFC, I believe we should differentiate between payloads that were "refused" by downstream components and requests that "failed".

Telemetry-wise, this would mean specializing the otelcol_receiver_refused_x metric to downstream errors (ones returned from nextConsumer.ConsumeX; this is already the case de-facto in the OTLP receiver), and add a new metric to account for internal errors:

Either a simple otelcol_receiver_failed_requests metric (maybe _operations if we want to account for scrapers?);
Or a generic otelcol_receiver_requests metric which counts all receiver operations, with an outcome: success / failure / refused attribute, following the convention in the above RFC.

API-wise, with the goal of avoiding breakage, I think the simplest way to implement this would be to add a new method to ObsReport which could be called in place of EndXOp, which would emit a "failure" metric instead of a "refused" metric, and encourage component authors to call StartXOp as early in processing as possible. (Note: This could also be used to improve the timing information provided by tracing by adding a span event signifying the end of internal processing). Under the assumption that most receivers behave like the OTLP receiver and mostly only wrap downstream processing in Start/EndXOp, components that haven't updated would continue to behave as before.

Describe alternatives you've considered
We could also leave things as-is, and let receiver component authors add their own internal failure metrics.

The text was updated successfully, but these errors were encountered:

jade-guiton-dd added area:receiver enhancement New feature or request labels Jan 29, 2025

jade-guiton-dd mentioned this issue Jan 29, 2025

[receiver/otlp] Review telemetry #11139

Open

github-project-automation bot moved this to Todo in Collector: v1 Jan 29, 2025

jade-guiton-dd added this to Collector: v1 Jan 29, 2025

jade-guiton-dd removed this from Collector: v1 Jan 29, 2025

mx-psi added this to the Self observability milestone Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiverhelper] Add metric for requests that failed to be received #12207

[receiverhelper] Add metric for requests that failed to be received #12207

jade-guiton-dd commented Jan 29, 2025

[receiverhelper] Add metric for requests that failed to be received #12207

[receiverhelper] Add metric for requests that failed to be received #12207

Comments

jade-guiton-dd commented Jan 29, 2025