-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow out of order log submission #1544
Comments
Hello, To be totally honest this is how Cortex storage works and Loki is based on it. I'm not saying we are not willing to work on it but the amount of work required is consequent. So far we're trying to find alternatives instead of reworking the whole foundation of Loki. I'm curious to know how you are sending logs to Loki ? Are you using Promtail or in-app API call ? Do you care about ordering ? Something that could be done is using a server side timestamp in Loki. |
In-app API calls, this library to be exact https://github.com/GreyZmeem/python-logging-loki . I would say the ordering is not ultra critical as my data does contain it's own timestamps. Our app essentially proxies the loki websocket to a front end UI, and having the UI do sorting on new log lines as they arrive is trivial. For querying exact time ranges we would just expand the range by a value (the largest amount of time we would expect a log to be 'late' by) and filter it down to the desired time range. So yes a flag for loki to just use it's own timestamps would be perfect for us. |
As @cyriltovena mentioned, trying to buffer the data and sort it inside Loki is a big task, and would come with a big consequence in memory usage, especially with a large amount of streams. This is primarily why we aren't too excited about implementing this and have been deferring the task of sorting to the clients. However, the idea of applying the timestamp by Loki at the time of ingestion is certainly intriguing! If ok with you @shughes-uk I'd like to rename this issue to reflect this new feature? |
(or if you'd like you can create a separate issue for this and keep this one for posterity) |
Fine with me! Both solutions are totally workable for my use case, and I think it makes sense to do the easy one first. |
I'm using Loki to log stack traces from distributed serverless workers, so this keeps coming up as a problem. Applying timestamp @ the time of ingestion would work well for this scenario. |
If we do this, let’s make sure that time is applied by distributor, so that all ingesters get the same value. |
How do we handle batches ? For performance reason, data is sent in batches. Should we just realign out of order entries ? |
This is quite a concern for me. I'm planning to use HTTP API calls, if it drops out of order message, how do you account for network latency? We should not use server timestamp as it could be incorrect. |
I imagine using server timestamps for out of order submissions could be a flag that you can disable/enable? Still not the perfect solution but for people who don't need exact precision it would get a solution out there faster. |
hello everyone, apologies if there's already another issue that has more information on this topic, are there any developments or further discussions on this? we realized we were getting a fair amount of loglines being dropped, and while we acknowledge that we could use more cardinality as we are probably too conservative with labels, this issue is of interest to us thank you in advance! |
This would be a great feature and one of the major blockers for our organization to adopt Loki. Is there any setting that a timestamp can't be in the future? Suppose there is a bug or a manual entry or something stupid, is there a way of stopping the minimum timestamp being in the future so logs won't be rejected for a long time? |
For Arduino projects where the controllers has no idea of the time of day, this would be a fantastic feature - letting Loki set the timestamp. |
We have requirement for using the line timestamp so that logs can be searched and correlated in time domain properly and we are collecting quite many different kind of logs from AWS and EKS. Many of these have out of order entries. As far as I can see it is not theoretically possible to sort these before inserting to index. We do not know what kind of delays there are in the log pipeline for example due to maintenance breaks etc. Because of this you can not wait for X amount of time before submitting entry and ordering is not possible. At least not in a fool proof way while retaining reasonable delay before seeing the log entries in the log system. Replacing the timestamp of out of order line with latest timestamp might make future arriving entries out of order too. What you could do is insert last used timestamp as timestamp for this entry which was out of order as a workaround if modifying index is not an option. Ultimately time index would need to be modifiable so it is possible to enter out of order entries. |
I get why the caching and out of order insert might be tricky. Would it be possible for Loki to assign the time stamp, thereby the ordering of arrival is absolute. |
how about use with two timestamp field? |
I just ran into an issue of chronologically misaligned log entries with standard apache2 access logs. While one could argue that Apache just lacks (or am I missing anything here?) a field with another timestamp source / reason to use instead of But looking at what can be done on the Loki side ... there certainly is an upper limit of how much timestamp jumping this produces - depending on timeouts likely just a few seconds. So "just" running behind the clock a few seconds and then sorting the ingested log lines would be sufficient to eliminate this. I noticed exactly this had been proposed before: #3119 (comment) |
I'm new to Loki and ran into this issue while ingesting already existing logfiles. My log file actually has the timestamps out of order - probably a threading problem.
Because this is in existing logfiles, it prevents Promtail from reading them at all. |
Even if Loki was somewhat tolerant to out of order (for instance if it permitted logs within 1 minute of the latest value, and sorted them/dealt with that server side), it would go a long way in making Loki more user friendly, and more practical in multi threaded situations. |
I'm definitely according with this issue. Ntp drifts on servers and web access logs written in an order depending on response time are a nightmare to be consistent. We have to keep loki storage as is. It's its power to have immutable chunks composed of time ordered blocks. But we could :
The only impact seems in the wal controller |
Sorry for my impatience but when we can expect WAL based out of order toleration? As now I've reconsider arch of one of our projects and Loki is planed to be excluded exactly because this issue. Project MVP targeted at June 2021. Can we expect Loki release with WAL reordering before May 2021? Thank you! |
Personally I can't talk as loki team but @owen-d as you are the WAL implementation author, do you think my proposal acceptable to handle this case ? |
No immediate promises on 2.4, but we're going to try and release it quickly relative to the delay between most releases. |
@owen-d until this feature can be supported, are there any workarounds to tweak FluentBit? .. I checked the official documentation here https://grafana.com/docs/loki/latest/clients/fluentbit/#buffering and the mechanism with dque didn't help .. we were still getting an enormous amount of Out of Entry errors. |
We ended up switching to FluentD .. with FluentD the out of order errors are gone with two things:
|
Another update: after load testing the new setup with FluentD, we still had some amount of |
@sherifkayad could you share your Fluentd config? For clarity, when you say Fluentd, do you mean in an aggregator role or actually replacing the Fluent Bit daemonset? |
@stevehipwell I am using FluentD as a Daemonset being the aggregator and the processor at the same time .. Also due to some temp transition we have exposed a FluentD TCP Syslog endpoint to allow one of our older legacy systems (Let's assume it's called fileConfigs:
00_system.conf: |-
<system>
workers 4
log_level warn
</system>
01_sources.conf: |-
## logs from podman
<worker 0>
<source>
@type tail
@id in_tail_container_logs
@label @KUBERNETES
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type multi_format
<pattern>
format json
time_key time
time_type string
time_format "%Y-%m-%dT%H:%M:%S.%NZ"
keep_time_key false
</pattern>
<pattern>
format regexp
expression /^(?<time>.+) (?<stream>stdout|stderr)( (.))? (?<log>.*)$/
time_format '%Y-%m-%dT%H:%M:%S.%NZ'
keep_time_key false
</pattern>
</parse>
emit_unmatched_lines true
</source>
</worker>
## logs for SYSTEM X via syslog
<source>
@type syslog
@label @SYSTEMX
port 5140
<transport tcp>
</transport>
bind 0.0.0.0
tag systemx.*
frame_type octet_count
<parse>
message_format rfc5424
</parse>
emit_unmatched_lines true
</source>
02_filters.conf: |-
<label @KUBERNETES>
<filter kubernetes.**>
@type kubernetes_metadata
@id filter_kube_metadata
skip_labels false
skip_container_metadata false
skip_namespace_metadata false
skip_master_url true
</filter>
<filter kubernetes.**>
@type parser
key_name log
reserve_data true
remove_key_name_field true
emit_invalid_record_to_error false
<parse>
@type json
</parse>
</filter>
<match kubernetes.**>
@type relabel
@label @DISPATCH
</match>
</label>
<label @SYSTEMX>
<filter systemx.**>
@type parser
key_name message
reserve_data true
remove_key_name_field true
emit_invalid_record_to_error false
<parse>
@type json
</parse>
</filter>
<match systemx.**>
@type relabel
@label @DISPATCH
</match>
</label>
03_dispatch.conf: |-
<label @DISPATCH>
<filter **>
@type prometheus
<metric>
name fluentd_input_status_num_records_total
type counter
desc The total number of incoming records
<labels>
tag ${tag}
hostname ${hostname}
</labels>
</metric>
</filter>
<filter **>
@type record_modifier
<record>
fluentd_worker "#{worker_id}"
</record>
</filter>
<match **>
@type relabel
@label @OUTPUT
</match>
</label>
04_outputs.conf: |-
<label @OUTPUT>
<match kubernetes.**>
@type loki
url "http://loki-loki-distributed-distributor:3100"
flush_interval 1s
flush_at_shutdown true
retry_limit 60
buffer_chunk_limit 5m
remove_keys kubernetes,docker
line_format json
<label>
fluentd_worker
stream
container $.kubernetes.container_name
node $.kubernetes.host
app $.kubernetes.labels.app
namespace $.kubernetes.namespace_name
instance $.kubernetes.pod_name
# other custom labels we have ...
</label>
extra_labels {"job":"fluentd", "cluster":"my-clyster"}
</match>
<match systemx.**>
@type loki
url "http://loki-loki-distributed-distributor:3100"
flush_interval 1s
flush_at_shutdown true
retry_limit 60
buffer_chunk_limit 5m
extract_kubernetes_labels false
# remove_keys stream,kubernetes,docker
line_format json
<label>
fluentd_worker
# other custom labels we have ...
</label>
extra_labels {"job":"systemx"}
</match>
</label> |
@francisdb that's for Grafana Cloud, we're still waiting on the v2.4.0 release. |
I'm very happy to be able to close this issue <3. Out of order support by default has been merged into |
@ningyougang - I just observed the same unexpected-to-me error messages like:
In that case, Loki was sending a request to itself at http://127.0.0.1:3100/loki/api/v1/cache/generation_numbers, but the handler for that path is registered only when compactor retention is enabled (and my Loki config did not have |
I'd love to use Loki in a distributed system easier and without being forced relatively high cardinality labels based on something like process ID. This goes double for systems like AWS Lambda.
This main obstacle to this for me is being unable to submit 'out of order' log lines, it would be great if loki could have a feature that would enable this.
At one point I found an old issue relating to this request but it was closed with "not something we need before releasing". Perhaps it is time to revisit this?
Cheers
The text was updated successfully, but these errors were encountered: