Loki sink: out-of-order logs might cause loss of other in-order logs? #20706

MartinEmrich · 2024-06-21T09:16:09Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I noticed a quite sudden drop in log volume. Looking at both Vector and Loki logs, I notices tons of "400 Bad Request" in the loki sink log of vector, and tons of "too late timestamp" messages in the Loki log.

So for reasons unknown, the Vector tries to deliver some logs with yesterdays timestamp (way beyond expectations for any reordering within vector).

But my impression is: Vector buffers and sends logs in big chunks. I assume Loki will reject such a chunked request if one single log line violates it's timestamp monotony requirements. Now that whole request will be discarded:

2024-06-21T09:05:52.140971Z ERROR sink{component_kind="sink" component_id=loki component_type=loki}:request{request_id=2592}: vector::sinks::util::retries: Non-retriable error; dropping the request. error=Server responded with an error: 400 Bad Request internal_log_rate_limit=true
2024-06-21T09:05:52.141070Z ERROR sink{component_kind="sink" component_id=loki component_type=loki}:request{request_id=2592}: vector_common::internal_event::service: Internal log [Service call failed. No retries or retries exhausted.] has been suppressed 1 times.
2024-06-21T09:05:52.141073Z ERROR sink{component_kind="sink" component_id=loki component_type=loki}:request{request_id=2592}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(ServerError { code: 400 }) request_id=2592 error_type="request_failed" stage="sending" internal_log_rate_limit=true

Would that also drop all other log events contained in that request, not only the streams loki won't accept? Currently it looks that way.

Configuration

sinks:
      loki:
        type: loki
        inputs:
          - input1
          - input2
        encoding:
          codec: json
        out_of_order_action: accept
        labels:
          "*": "{{ loki_labels }}"
          "platform": "{{ platform }}"
        remove_label_fields: true
        endpoint: ...
        batch:
          max_bytes: 1048576
          max_events: 1000
          timeout_secs: 1
        buffer:
          when_full: block
          type: disk
          max_size: 1073741824
        healthcheck:
          enabled: true
          path: ...
          port: ...

Version

timberio/vector:0.39.0-distroless-libc

Debug Output

No response

Example Data

No response

Additional Context

No response

References

Related topic: #5024

The text was updated successfully, but these errors were encountered:

daddyrex1 · 2024-06-21T10:01:29Z

Would be great if someone could look into it.

jszwedko · 2024-06-21T17:16:27Z

Would that also drop all other log events contained in that request, not only the streams loki won't accept? Currently it looks that way.

You are correct, the entire request is dropped. Only dropping the entry with the invalid timestamp would be good improvement. It looks like the response from Loki does include the invalid entry so seemingly Vector could drop just that entry and retry.

I don't know when we'll get to implementing this though. PRs would be welcome.

MartinEmrich · 2024-06-24T07:55:53Z

Thinking about for a few days, I formed somewhat of an opinion:

I would assume Vector to re-use the buffering/batching code across all similar sinks, and it is most probably not a data structure optimized for selective deletion. So the obvious fix (deleting exactly the unaccepted events from the batch and submitting it again, all which further events are queuing up) might be quite expensive.

In my specific case, the offending log events were a day old, I still try to wrap my head around how a log event could survive this long within just two vectors without persistence (one as a daemon set on the K8s cluster collecting with the kubernetes source, and a second aggregator stage writing then to Loki, both w/o disk buffers).

For more normal cases, Loki actually has a max-age window of 2h, so it will accept events +-1h out-of order.

IMHO Loki should drop too-old events and keep the rest, this should not be the job of the submitting component (Vector).

I now wonder, what's the best short term fix:

What is the Loki protocol exactly, would disabling batching make it explode? If it would keep persistent connections with pipelining and a lean binary protocol, that could work. But if it's HTTP/JSON based, sending HTTP requests headers (bigger than the log event itself) with each event will most likely crash.
I think about dropping too-old events (or re-timestamping and marking them) within VRL... Are there some best practices in place?

Will do some experiments next.....

jszwedko · 2024-07-02T18:01:49Z

I would assume Vector to re-use the buffering/batching code across all similar sinks, and it is most probably not a data structure optimized for selective deletion. So the obvious fix (deleting exactly the unaccepted events from the batch and submitting it again, all which further events are queuing up) might be quite expensive.

That assumption is accurate: a change like this could be quite invasive. We have a similar need of it for the Elasticsearch sink to handle partial failures: #140.

What is the Loki protocol exactly, would disabling batching make it explode? If it would keep persistent connections with pipelining and a lean binary protocol, that could work. But if it's HTTP/JSON based, sending HTTP requests headers (bigger than the log event itself) with each event will most likely crash.

By default it sends protobuf messages over HTTP, but can be configured to send JSON per https://vector.dev/docs/reference/configuration/sinks/loki/#request-encoding

I think about dropping too-old events (or re-timestamping and marking them) within VRL... Are there some best practices in place?

That's a good thought, you could do something like that with a remap transform and VRL. First check if the timestamp is too old (compare with now()) and then either reset it or drop the event.

MartinEmrich added the type: bug A code related bug. label Jun 21, 2024

jszwedko added the sink: loki Anything `loki` sink related label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loki sink: out-of-order logs might cause loss of other in-order logs? #20706

Loki sink: out-of-order logs might cause loss of other in-order logs? #20706

MartinEmrich commented Jun 21, 2024

daddyrex1 commented Jun 21, 2024

jszwedko commented Jun 21, 2024

MartinEmrich commented Jun 24, 2024

jszwedko commented Jul 2, 2024

Loki sink: out-of-order logs might cause loss of other in-order logs? #20706

Loki sink: out-of-order logs might cause loss of other in-order logs? #20706

Comments

MartinEmrich commented Jun 21, 2024

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

daddyrex1 commented Jun 21, 2024

jszwedko commented Jun 21, 2024

MartinEmrich commented Jun 24, 2024

jszwedko commented Jul 2, 2024