Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow out of order log submission #1544

Closed
shughes-uk opened this issue Jan 17, 2020 · 67 comments
Closed

Allow out of order log submission #1544

shughes-uk opened this issue Jan 17, 2020 · 67 comments
Assignees
Labels
component/loki help wanted We would love help on these issues. Please come help us! keepalive An issue or PR that will be kept alive and never marked as stale.

Comments

@shughes-uk
Copy link

I'd love to use Loki in a distributed system easier and without being forced relatively high cardinality labels based on something like process ID. This goes double for systems like AWS Lambda.

This main obstacle to this for me is being unable to submit 'out of order' log lines, it would be great if loki could have a feature that would enable this.

At one point I found an old issue relating to this request but it was closed with "not something we need before releasing". Perhaps it is time to revisit this?

Cheers

@cyriltovena
Copy link
Contributor

Hello,

To be totally honest this is how Cortex storage works and Loki is based on it. I'm not saying we are not willing to work on it but the amount of work required is consequent. So far we're trying to find alternatives instead of reworking the whole foundation of Loki.

I'm curious to know how you are sending logs to Loki ? Are you using Promtail or in-app API call ? Do you care about ordering ?

Something that could be done is using a server side timestamp in Loki.

@shughes-uk
Copy link
Author

In-app API calls, this library to be exact https://github.com/GreyZmeem/python-logging-loki .

I would say the ordering is not ultra critical as my data does contain it's own timestamps. Our app essentially proxies the loki websocket to a front end UI, and having the UI do sorting on new log lines as they arrive is trivial.

For querying exact time ranges we would just expand the range by a value (the largest amount of time we would expect a log to be 'late' by) and filter it down to the desired time range.

So yes a flag for loki to just use it's own timestamps would be perfect for us.

@slim-bean
Copy link
Collaborator

As @cyriltovena mentioned, trying to buffer the data and sort it inside Loki is a big task, and would come with a big consequence in memory usage, especially with a large amount of streams. This is primarily why we aren't too excited about implementing this and have been deferring the task of sorting to the clients.

However, the idea of applying the timestamp by Loki at the time of ingestion is certainly intriguing!

If ok with you @shughes-uk I'd like to rename this issue to reflect this new feature?

@slim-bean
Copy link
Collaborator

(or if you'd like you can create a separate issue for this and keep this one for posterity)

@shughes-uk
Copy link
Author

shughes-uk commented Jan 22, 2020

Fine with me! Both solutions are totally workable for my use case, and I think it makes sense to do the easy one first.

@tmikaeld
Copy link

I'm using Loki to log stack traces from distributed serverless workers, so this keeps coming up as a problem. Applying timestamp @ the time of ingestion would work well for this scenario.

@cyriltovena cyriltovena added component/loki help wanted We would love help on these issues. Please come help us! keepalive An issue or PR that will be kept alive and never marked as stale. labels Jan 28, 2020
@pstibrany
Copy link
Member

If we do this, let’s make sure that time is applied by distributor, so that all ingesters get the same value.

@cyriltovena
Copy link
Contributor

How do we handle batches ? For performance reason, data is sent in batches. Should we just realign out of order entries ?

@imtrobin
Copy link

This is quite a concern for me. I'm planning to use HTTP API calls, if it drops out of order message, how do you account for network latency? We should not use server timestamp as it could be incorrect.

@shughes-uk
Copy link
Author

I imagine using server timestamps for out of order submissions could be a flag that you can disable/enable? Still not the perfect solution but for people who don't need exact precision it would get a solution out there faster.

@judemorrissey
Copy link

hello everyone, apologies if there's already another issue that has more information on this topic, are there any developments or further discussions on this? we realized we were getting a fair amount of loglines being dropped, and while we acknowledge that we could use more cardinality as we are probably too conservative with labels, this issue is of interest to us

thank you in advance!

@verticelo
Copy link

This would be a great feature and one of the major blockers for our organization to adopt Loki. Is there any setting that a timestamp can't be in the future? Suppose there is a bug or a manual entry or something stupid, is there a way of stopping the minimum timestamp being in the future so logs won't be rejected for a long time?

@HairingX
Copy link

For Arduino projects where the controllers has no idea of the time of day, this would be a fantastic feature - letting Loki set the timestamp.
It would make IOC devices require less power while having the possibility of using the logging features in Loki.

@tlaukkan
Copy link

tlaukkan commented Dec 30, 2020

We have requirement for using the line timestamp so that logs can be searched and correlated in time domain properly and we are collecting quite many different kind of logs from AWS and EKS. Many of these have out of order entries. As far as I can see it is not theoretically possible to sort these before inserting to index. We do not know what kind of delays there are in the log pipeline for example due to maintenance breaks etc. Because of this you can not wait for X amount of time before submitting entry and ordering is not possible. At least not in a fool proof way while retaining reasonable delay before seeing the log entries in the log system. Replacing the timestamp of out of order line with latest timestamp might make future arriving entries out of order too. What you could do is insert last used timestamp as timestamp for this entry which was out of order as a workaround if modifying index is not an option.

Ultimately time index would need to be modifiable so it is possible to enter out of order entries.

@HairingX
Copy link

I get why the caching and out of order insert might be tricky.

Would it be possible for Loki to assign the time stamp, thereby the ordering of arrival is absolute.
IOC devices would be able to use Loki for logging. Of course this would require you to log the messages with http or alike as the cloud solutions might deliver in a wrong order - unless the order does not matter, which sometimes it doesn't.

@hanjm
Copy link

hanjm commented Jan 23, 2021

how about use with two timestamp field?
for write: two timestamp field, one is loki receive timestamp, another is log timestamp. allow some torence between real timestamp and log timestamp (such as 1 minutes).
for read: scan more torence loki time range and sort by log timestamp field.

@frittentheke
Copy link
Contributor

frittentheke commented Feb 11, 2021

I just ran into an issue of chronologically misaligned log entries with standard apache2 access logs.
I suppose this is simply due to Apache logging the Time the request was received (standard english format) (http://httpd.apache.org/docs/2.4/mod/mod_log_config.html), which then takes a different amount of time to respond to at which point the log is finally reaching the log file.

While one could argue that Apache just lacks (or am I missing anything here?) a field with another timestamp source / reason to use instead of %t. If Promtail was able to do a bit of math on the data / fields (#3129), one could use the seconds since epoch as a timestamp (%t{format}) and simply add the duration (%D) (while also logging that). This would then get us back to an ever increasing timestamp ... just with the time when the request was answered.

But looking at what can be done on the Loki side ... there certainly is an upper limit of how much timestamp jumping this produces - depending on timeouts likely just a few seconds. So "just" running behind the clock a few seconds and then sorting the ingested log lines would be sufficient to eliminate this.

I noticed exactly this had been proposed before: #3119 (comment)

@michaelosthege
Copy link

I'm new to Loki and ran into this issue while ingesting already existing logfiles.

My log file actually has the timestamps out of order - probably a threading problem.

D	2021-02-13T17:35:08.981Z	app.py:221	step 2...
D	2021-02-13T17:35:08.011Z	app.py:221	step 1...

Because this is in existing logfiles, it prevents Promtail from reading them at all.

@loganmc10
Copy link

Even if Loki was somewhat tolerant to out of order (for instance if it permitted logs within 1 minute of the latest value, and sorted them/dealt with that server side), it would go a long way in making Loki more user friendly, and more practical in multi threaded situations.

@gillg
Copy link

gillg commented Mar 2, 2021

I'm definitely according with this issue. Ntp drifts on servers and web access logs written in an order depending on response time are a nightmare to be consistent.
As we have now a feature of WAL on loki, I propose an idea.

We have to keep loki storage as is. It's its power to have immutable chunks composed of time ordered blocks.

But we could :

  • Add a new config var out-of-order-toleration
  • This vas can't be higher than wal duration
  • As a first solution we could assume, if you make the choice to use it, to make queries with potential missing logs during this tolerance period (if it's easiest to implement)
  • Before writing WAL as blocks / chunks reorder logs

The only impact seems in the wal controller

@vladimirfx
Copy link

Sorry for my impatience but when we can expect WAL based out of order toleration?

As now I've reconsider arch of one of our projects and Loki is planed to be excluded exactly because this issue. Project MVP targeted at June 2021. Can we expect Loki release with WAL reordering before May 2021?

Thank you!

@gillg
Copy link

gillg commented Mar 5, 2021

Sorry for my impatience but when we can expect WAL based out of order toleration?

Personally I can't talk as loki team but @owen-d as you are the WAL implementation author, do you think my proposal acceptable to handle this case ?

@owen-d
Copy link
Member

owen-d commented Aug 30, 2021

No immediate promises on 2.4, but we're going to try and release it quickly relative to the delay between most releases.

@sherifkayad
Copy link

@owen-d until this feature can be supported, are there any workarounds to tweak FluentBit? .. I checked the official documentation here https://grafana.com/docs/loki/latest/clients/fluentbit/#buffering and the mechanism with dque didn't help .. we were still getting an enormous amount of Out of Entry errors.

@sherifkayad
Copy link

We ended up switching to FluentD .. with FluentD the out of order errors are gone with two things:

  • Proper usage of labels to distinguish logs
  • Adding the fluentd worker ID to the list of labels for multi-worker input plugins e.g. sys-log over TCP

@sherifkayad
Copy link

sherifkayad commented Sep 13, 2021

Another update: after load testing the new setup with FluentD, we still had some amount of out of order errors .. The rate of getting these is in fact much smaller than fluent-bit .. but still .. I believe the feature linked to this issue is a must have for us 😄

@stevehipwell
Copy link
Contributor

@sherifkayad could you share your Fluentd config? For clarity, when you say Fluentd, do you mean in an aggregator role or actually replacing the Fluent Bit daemonset?

@sherifkayad
Copy link

@stevehipwell I am using FluentD as a Daemonset being the aggregator and the processor at the same time .. Also due to some temp transition we have exposed a FluentD TCP Syslog endpoint to allow one of our older legacy systems (Let's assume it's called systemx) to ship data to Loki in the easiest way without having to surgically operate on that older system. Below is our full FluentD config (we are using the FluentD Helm Chart btw):

fileConfigs:
  00_system.conf: |-
    <system>
      workers 4
      log_level warn
    </system>

  01_sources.conf: |-
    ## logs from podman
    <worker 0>
      <source>
        @type tail
        @id in_tail_container_logs
        @label @KUBERNETES
        path /var/log/containers/*.log
        pos_file /var/log/fluentd-containers.log.pos
        tag kubernetes.*
        read_from_head true
        <parse>
          @type multi_format
          <pattern>
            format json
            time_key time
            time_type string
            time_format "%Y-%m-%dT%H:%M:%S.%NZ"
            keep_time_key false
          </pattern>
          <pattern>
            format regexp
            expression /^(?<time>.+) (?<stream>stdout|stderr)( (.))? (?<log>.*)$/
            time_format '%Y-%m-%dT%H:%M:%S.%NZ'
            keep_time_key false
          </pattern>
        </parse>
        emit_unmatched_lines true
      </source>
    </worker>

    ## logs for SYSTEM X via syslog
    <source>
      @type syslog
      @label @SYSTEMX
      port 5140
      <transport tcp>
      </transport>
      bind 0.0.0.0
      tag systemx.*
      frame_type octet_count
      <parse>
        message_format rfc5424
      </parse>
      emit_unmatched_lines true
    </source>

  02_filters.conf: |-
    <label @KUBERNETES>
      <filter kubernetes.**>
        @type kubernetes_metadata
        @id filter_kube_metadata
        skip_labels false
        skip_container_metadata false
        skip_namespace_metadata false
        skip_master_url true
      </filter>

      <filter kubernetes.**>
        @type parser
        key_name log
        reserve_data true
        remove_key_name_field true
        emit_invalid_record_to_error false
        <parse>
          @type json
        </parse>
      </filter>

      <match kubernetes.**>
        @type relabel
        @label @DISPATCH
      </match>
    </label>

    <label @SYSTEMX>
      <filter systemx.**>
        @type parser
        key_name message
        reserve_data true
        remove_key_name_field true
        emit_invalid_record_to_error false
        <parse>
          @type json
        </parse>
      </filter>

      <match systemx.**>
        @type relabel
        @label @DISPATCH
      </match>
    </label>
  
  03_dispatch.conf: |-
    <label @DISPATCH>
      <filter **>
        @type prometheus
        <metric>
          name fluentd_input_status_num_records_total
          type counter
          desc The total number of incoming records
          <labels>
            tag ${tag}
            hostname ${hostname}
          </labels>
        </metric>
      </filter>

      <filter **>
        @type record_modifier
        <record>
          fluentd_worker "#{worker_id}"
        </record>
      </filter>

      <match **>
        @type relabel
        @label @OUTPUT
      </match>
    </label>

  04_outputs.conf: |-
    <label @OUTPUT>
      <match kubernetes.**>
        @type loki
        url "http://loki-loki-distributed-distributor:3100"
        flush_interval 1s
        flush_at_shutdown true
        retry_limit 60
        buffer_chunk_limit 5m
        remove_keys kubernetes,docker
        line_format json
        <label>
          fluentd_worker
          stream
          container $.kubernetes.container_name
          node $.kubernetes.host
          app $.kubernetes.labels.app
          namespace $.kubernetes.namespace_name
          instance $.kubernetes.pod_name
          # other custom labels we have ... 
        </label>
        extra_labels {"job":"fluentd", "cluster":"my-clyster"}
      </match>

      <match systemx.**>
        @type loki
        url "http://loki-loki-distributed-distributor:3100"
        flush_interval 1s
        flush_at_shutdown true
        retry_limit 60
        buffer_chunk_limit 5m
        extract_kubernetes_labels false
        # remove_keys stream,kubernetes,docker
        line_format json
        <label>
          fluentd_worker
          # other custom labels we have ...
        </label>
        extra_labels {"job":"systemx"}
      </match>
    </label>

@francisdb
Copy link

https://grafana.com/blog/2021/09/16/avoid-dropped-logs-due-to-out-of-order-timestamps-with-a-new-loki-feature/

@stevehipwell
Copy link
Contributor

@francisdb that's for Grafana Cloud, we're still waiting on the v2.4.0 release.

@owen-d
Copy link
Member

owen-d commented Oct 29, 2021

I'm very happy to be able to close this issue <3. Out of order support by default has been merged into main and we'll be releasing 2.4 within the next few weeks :).

@ningyougang
Copy link

ningyougang commented Nov 25, 2022

@owen-d

I'm very happy to be able to close this issue <3. Out of order support by default has been merged into main and we'll be releasing 2.4 within the next few weeks :).

Very thanks for new release. i have checked, currently, didn't find out of order issue anymore.

But there has some error logs in loki side. e.g.

kubectl logs loki-0 -n ${mynamespace}
...
...
level=error ts=2022-11-25T05:55:56.264429151Z caller=gennumber_client.go:70 msg="error getting cache gen numbers from the store" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264492588Z caller=gennumber_loader.go:129 msg="error loading cache generation numbers" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264668532Z caller=gennumber_client.go:70 msg="error getting cache gen numbers from the store" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264708865Z caller=gennumber_loader.go:129 msg="error loading cache generation numbers" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264840943Z caller=gennumber_client.go:70 msg="error getting cache gen numbers from the store" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264869936Z caller=gennumber_client.go:70 msg="error getting cache gen numbers from the store" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264878406Z caller=gennumber_loader.go:129 msg="error loading cache generation numbers" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264885372Z caller=delete_requests_client.go:211 msg="error getting delete requests from the store" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264889201Z caller=gennumber_client.go:70 msg="error getting cache gen numbers from the store" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264896823Z caller=gennumber_loader.go:129 msg="error loading cache generation numbers" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264907359Z caller=gennumber_client.go:70 msg="error getting cache gen numbers from the store" err="unexpected status code: 404"
ts=2022-11-25T05:55:56.26492361Z caller=spanlogger.go:80 user=fake level=error msg="failed loading deletes for user" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264923621Z caller=gennumber_loader.go:129 msg="error loading cache generation numbers" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264937664Z caller=gennumber_client.go:70 msg="error getting cache gen numbers from the store" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264940626Z caller=gennumber_loader.go:129 msg="error loading cache generation numbers" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264948588Z caller=gennumber_client.go:70 msg="error getting cache gen numbers from the store" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264911113Z caller=gennumber_client.go:70 msg="error getting cache gen numbers from the store" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264970785Z caller=gennumber_loader.go:129 msg="error loading cache generation numbers" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264986845Z caller=gennumber_loader.go:129 msg="error loading cache generation numbers" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.264991803Z caller=gennumber_loader.go:129 msg="error loading cache generation numbers" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.265039732Z caller=delete_requests_client.go:211 msg="error getting delete requests from the store" err="unexpected status code: 404"
ts=2022-11-25T05:55:56.265063024Z caller=spanlogger.go:80 user=fake level=error msg="failed loading deletes for user" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.265203547Z caller=delete_requests_client.go:211 msg="error getting delete requests from the store" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.26520212Z caller=delete_requests_client.go:211 msg="error getting delete requests from the store" err="unexpected status code: 404"
ts=2022-11-25T05:55:56.26523334Z caller=spanlogger.go:80 user=fake level=error msg="failed loading deletes for user" err="unexpected status code: 404"
ts=2022-11-25T05:55:56.265241025Z caller=spanlogger.go:80 user=fake level=error msg="failed loading deletes for user" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.265235331Z caller=delete_requests_client.go:211 msg="error getting delete requests from the store" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.265250397Z caller=delete_requests_client.go:211 msg="error getting delete requests from the store" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.265251806Z caller=delete_requests_client.go:211 msg="error getting delete requests from the store" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.265266956Z caller=delete_requests_client.go:211 msg="error getting delete requests from the store" err="unexpected status code: 404"
ts=2022-11-25T05:55:56.265278623Z caller=spanlogger.go:80 user=fake level=error msg="failed loading deletes for user" err="unexpected status code: 404"
ts=2022-11-25T05:55:56.26529171Z caller=spanlogger.go:80 user=fake level=error msg="failed loading deletes for user" err="unexpected status code: 404"
ts=2022-11-25T05:55:56.265300688Z caller=spanlogger.go:80 user=fake level=error msg="failed loading deletes for user" err="unexpected status code: 404"
ts=2022-11-25T05:55:56.265300497Z caller=spanlogger.go:80 user=fake level=error msg="failed loading deletes for user" err="unexpected status code: 404"
level=error ts=2022-11-25T05:55:56.265309266Z caller=delete_requests_client.go:211 msg="error getting delete requests from the store" err="unexpected status code: 404"
ts=2022-11-25T05:55:56.26534413Z caller=spanlogger.go:80 user=fake level=error msg="failed loading deletes for user" err="unexpected status code: 404"

Have any idea for above error logs? Btw, in spite of above logs printed issue happened, but can query the logs successfully finally.

I doubt has relation with cache, so i added below cache configuration to loki, seems above logs printed issue still existed after executed query_range operation
image

@ericrrath
Copy link

@ningyougang - I just observed the same unexpected-to-me error messages like:

 2022-11-30T22:58:56.385213734+00:00 stderr F level=error ts=2022-11-30T22:58:56.373375959Z caller=gennumber_loader.go:129 msg="error loading cache generation numbers" err="unexpected status code: 404"

In that case, Loki was sending a request to itself at http://127.0.0.1:3100/loki/api/v1/cache/generation_numbers, but the handler for that path is registered only when compactor retention is enabled (and my Loki config did not have compactor.retention_enabled=true). I wonder if there's a similar explanation for the other error messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/loki help wanted We would love help on these issues. Please come help us! keepalive An issue or PR that will be kept alive and never marked as stale.
Projects
None yet
Development

No branches or pull requests