-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loki sink: out-of-order logs might cause loss of other in-order logs? #20706
Comments
Would be great if someone could look into it. |
You are correct, the entire request is dropped. Only dropping the entry with the invalid timestamp would be good improvement. It looks like the response from Loki does include the invalid entry so seemingly Vector could drop just that entry and retry. I don't know when we'll get to implementing this though. PRs would be welcome. |
Thinking about for a few days, I formed somewhat of an opinion: I would assume Vector to re-use the buffering/batching code across all similar sinks, and it is most probably not a data structure optimized for selective deletion. So the obvious fix (deleting exactly the unaccepted events from the batch and submitting it again, all which further events are queuing up) might be quite expensive. In my specific case, the offending log events were a day old, I still try to wrap my head around how a log event could survive this long within just two vectors without persistence (one as a daemon set on the K8s cluster collecting with the kubernetes source, and a second aggregator stage writing then to Loki, both w/o disk buffers). For more normal cases, Loki actually has a max-age window of 2h, so it will accept events +-1h out-of order. IMHO Loki should drop too-old events and keep the rest, this should not be the job of the submitting component (Vector). I now wonder, what's the best short term fix:
Will do some experiments next..... |
That assumption is accurate: a change like this could be quite invasive. We have a similar need of it for the Elasticsearch sink to handle partial failures: #140.
By default it sends protobuf messages over HTTP, but can be configured to send JSON per https://vector.dev/docs/reference/configuration/sinks/loki/#request-encoding
That's a good thought, you could do something like that with a |
A note for the community
Problem
I noticed a quite sudden drop in log volume. Looking at both Vector and Loki logs, I notices tons of "400 Bad Request" in the loki sink log of vector, and tons of "too late timestamp" messages in the Loki log.
So for reasons unknown, the Vector tries to deliver some logs with yesterdays timestamp (way beyond expectations for any reordering within vector).
But my impression is: Vector buffers and sends logs in big chunks. I assume Loki will reject such a chunked request if one single log line violates it's timestamp monotony requirements. Now that whole request will be discarded:
Would that also drop all other log events contained in that request, not only the streams loki won't accept? Currently it looks that way.
Configuration
Version
timberio/vector:0.39.0-distroless-libc
Debug Output
No response
Example Data
No response
Additional Context
No response
References
Related topic: #5024
The text was updated successfully, but these errors were encountered: