Vector source/sink not propagating non-retryable failures #17873

sbalmos · 2023-07-05T20:25:00Z

sbalmos
Jul 5, 2023

I'm still trying to quantify exactly what the bug is, but it seems like the vector source/sink does not propagate back up non-retryable delivery failures from the end sink. In my setup, I have a Vector instance as a sort of post office delivery multiplexer, reading from Kafka and distributing to one or more exporter Vector instances, connected via the vector sink.

                            / -> (vector sink) -> (vector source) Exporter 1 -> http sink
Kafka -> (kafka source) Mux | -> (vector sink) -> (vector source) Exporter 2 -> loki sink
                            \ -> (vector sink) -> (vector source) Exporter 3 -> splunk_hec_logs sink

In exporters where some failures are non-retryable (e.g. HTTP sink with non-retryable errors like 400), it seems as if at the exporter level the event is appropriately dropped. However, this drop action apparently does not communicate back a hard failure acknowledgement or similar signal through the vector protocol, back to the Mux instance. The vector sink retry on the Mux instance apparently sees a delivery failure (or rather lack of acknowledgement) and repeatedly tries to retry delivery of the message to the exporter. This continues ad nauseam until the retry count is exceeded or (more likely) the exporter's vector sink buffer on the Mux is filled - which then has a follow-on bad behavior of stopping the whole mux in its tracks, stopping delivery of all messages to all destination exporters.

dsmith3197 · 2023-07-06T19:28:26Z

dsmith3197
Jul 6, 2023

Hi @sbalmos,

The issue here is that the Vector source will respond with internal or data_loss error codes (code ref) on failed delivery, but the Vector sink treats those as retryable errors (code ref).

To resolve this, we need to update either the Vector source or sink to make them consistent in how they handle rejected data.

2 replies

dsmith3197 Jul 6, 2023

I created #17895 to track this.

sbalmos Jul 7, 2023
Author

Just PR'd it as #17904. From my perspective, the sink was at fault for not treating the properly-reported GRPC error as non-retryable.

dsmith3197 · 2023-07-06T19:35:05Z

dsmith3197
Jul 6, 2023

In the meantime, is there a need to have multiple vector instances? From my understanding of your setup thus far, you could consolidate everything into a single vector instance and scale horizontally if needed with a Kafka consumer group.

2 replies

sbalmos Jul 6, 2023
Author

There's not really a need to have everything split out. I'm already working on collapsing the configs down into the mux's config. It just helped to keep things cleaner split out - destination-specific transforms, filters, and rate-limits were in an exporter's config, an exporter could be independently scaled or restarted separate from the mux, the mux only had to deal with event routing, etc. But I agree, collapsing everything back down and scaling at the mux level is certainly doable.

sbalmos Jul 10, 2023
Author

Spoke slightly too soon. ;) One exporter has to remain separate - egress IP whitelisting at the target system has me running that exporter on a tainted worker node. I'm looking into probably just using the HTTP source/sink. It looks like it would properly propagate the non-retryable failure back to the mux if I'm tracing the code correctly?

jszwedko · 2023-07-18T19:09:52Z

jszwedko
Jul 18, 2023
Maintainer

@sbalmos could you share the config and version you are using?

1 reply

sbalmos Jul 18, 2023
Author

Gah, I was writing a reply here thinking it was to my other thread from this morning. Big ctrl-a delete moment. ;) So as it stands here, I've successfully worked around this for the time being using the HTTP source/sink with native-mode protobuf codec. I'll switch it back to the vector source/sink in the next release when the merged #17904 gets released.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector source/sink not propagating non-retryable failures #17873

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Vector source/sink not propagating non-retryable failures #17873

sbalmos Jul 5, 2023

Replies: 3 comments · 5 replies

dsmith3197 Jul 6, 2023

dsmith3197 Jul 6, 2023

sbalmos Jul 7, 2023 Author

dsmith3197 Jul 6, 2023

sbalmos Jul 6, 2023 Author

sbalmos Jul 10, 2023 Author

jszwedko Jul 18, 2023 Maintainer

sbalmos Jul 18, 2023 Author

sbalmos
Jul 5, 2023

Replies: 3 comments 5 replies

dsmith3197
Jul 6, 2023

sbalmos Jul 7, 2023
Author

dsmith3197
Jul 6, 2023

sbalmos Jul 6, 2023
Author

sbalmos Jul 10, 2023
Author

jszwedko
Jul 18, 2023
Maintainer

sbalmos Jul 18, 2023
Author