-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propagate errors from exporters to receivers #7486
Propagate errors from exporters to receivers #7486
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #7486 +/- ##
==========================================
+ Coverage 90.79% 91.00% +0.21%
==========================================
Files 296 300 +4
Lines 14790 14972 +182
==========================================
+ Hits 13428 13625 +197
+ Misses 1087 1071 -16
- Partials 275 276 +1
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks a lot @jpkrohling. This will help with open-telemetry/opentelemetry-collector-contrib#20511 too.
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Reopening, this is not stale; it's waiting on the dependent PR. |
647e196
to
0e2400a
Compare
PR updated. I think we are ready to move with this one given we don't have a final proposal for #7439. |
This functionality may not work if we have any of the following:
Do we need to document or codify the requirements for this functionality to be used reliably? |
I have a couple of Qs about this:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments from me.
e1dc75d
to
a0e7a4a
Compare
I just updated this PR to address the latest review comments.
For the context propagation to work, the pipeline is synchronous, such as when there are no batching processors or sending queues in it. When that happens, (4) may reject data, which will then be reflected by (2)'s response to (1). To be clear: (2) will not send a response to (1) without hearing back from (4) before.
Given the constraints I mentioned earlier, the error from (2) is the same as (4), but it might happen that (2) is the one returning the error without even touching (4). In that case, the source of the error is not clear to (1). I don't think it should matter to the client where the error is coming from. Operators of (2) and (4) should be able to determine the place the error happened based on (2)'s and (4)'s metrics. |
My question was why (2) doesn't reject data with the same 400 error while (4) does reject it. Do they have different OTLP validation logic?
I disagree here. As a developer of (1), if I see an error, I need to know whether it's caused by a problem in the collector or the backend to be able to report it to a responsible team or investigate myself. I think collector can wrap the error message keeping the same error code to make it clear where the error is coming from |
I think I'm not seeing what you are seeing and would appreciate more details. What I have in mind is that perhaps there's an extra processor (or, in the future, an interceptor) that will return a 400 on (4) while (2) is just acting as a proxy.
That's doable. We have the RequestError with a constructor already, we can add an extra message to the message. |
6934602
to
4024e2f
Compare
4024e2f
to
6c3fd19
Compare
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
975a234
to
e45443d
Compare
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
bae14d0
to
ba9fee0
Compare
@evan-bradley , @codeboten , could you please review this PR again? I believe all points have been addressed, apart from the documentation one brought by Alex. If we agree that docs/design.md is the right place, I'll document it as part of a follow-up PR. |
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
I updated this PR to return 500s if the receiver got a permanent error from the next consumer. I believe this, along with #7868 bring the correct behavior with minimal disruption to existing pipelines. |
Regarding this comment from @bogdandrutu on another PR:
It's worth noting (is it obvious? or is it not correct?) that the component that is sending the data (e.g. the collector's receiver) is not able to retry/re-send the data only to a selection of consumers, e.g. those that returned a non-permanent error and not to those that returned success (or a permanent error). In light of the above, here are the scenarios that I can identify and options for each: Scenario A: Multiple consumers, some succeeding, some failing with permanent errors and some failing with retryable errors For Scenario A, we seem to only have the following two options for the sender of the data:
In the second case, the sending component also shouldn't be passing the retryable error back to the caller. I lean towards the second option. We'd need to provide documentation educating users about the consequences of creating pipelines with a fan-out, like using the same receiver in different pipelines of the same signal type, or using more than one exporter in a pipeline. Scenario B: When all consumers failed with retryable errors, I believe it's important for the sender to be able to pass the retryable error back to the caller, or retry itself, depending on its configuration. Scenario C: The sending component should be able to act accordingly to the error - i.e., in case of retryable error, retry (if configured to do so) or return the retryable error back to the caller. If I'm not mistaken, this is what this PR was about from the start. |
For exporters that is not a real problem, since with the retry and queuing mechanism that should not happen (or very rarely one exporter may be slower and have it's queue full). |
@astencel-sumo @jpkrohling before trying to solve this scenarios, I would like to understand when a retryable error can happen. Also I am thinking that we may have to have a way to mark a "health/status" at pipeline level (based on pipeline ID) and the fanout / router consumer can use that information. Think about something like when we have the queue full for a an exporter, the we mark the entire pipeline that exports to that exporter (or multiple pipelines) as "busy/resource exhausted" and then when receiver tries to put data to a fanout consumer that includes that pipeline we will reject them immediately so data can be retried to a different collector instance without duplication. I have not thought at all scenarios, but I would like one of you to think, document:
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
When using an OTel Collector as a gateway between OTLP clients and servers, errors from the server are not always propagated correctly from the exporter back to the client. When this happens, the client gets misleading error information, making debugging harder.
For example, given the following scenario:
External OTLP HTTP client (1) -> Collector's OTLP receiver (2) -> Collector's OTLP exporter (3) -> External OTLP gRPC server (4)
When (4) returns a "InvalidArgument", indicating that the client has send bad input data, (1) should be informed about that by receiving a "400 Bad Request" from (2).
This PR changes both the OTLP exporter and receivers so that the error flow works in that combination. Other exporters should be advised to:
internal/errs
package to wrap Client HTTP errors, so that receivers can reuse the original status codeThis PR is in draft mode, currently lacking:
internal/errs
should be a new module?Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de