Stops send log after Loki Error 'entry out of order' (Http status 400) #49

raphaelquati · 2021-08-18T02:31:23Z

When our application has some load (a lot of broker messages to consume - using Rebus worker threads), the sink suddenly stops sending logs to Loki.

To investigate, we configured the application to have 2 sinks enabled: console and this sink, and Promtail configured to send console log to Loki.
We found this line on Loki log:

level=warn ts=2021-08-17T21:57:55.768341371Z caller=grpc_logging.go:38 method=/logproto.Pusher/Push duration=1.39409ms err="rpc error: code = Code(400) desc = entry with timestamp 2021-08-17 21:56:54.73 +0000 UTC ignored, reason: 'entry out of order' for stream: {SourceContext=\"Rebus.Retry.ErrorTracking.InMemErrorTracker\", app=\"myapp\", container=\"myapp\", errorNumber=\"1\", messageId=\"52xxb042-xxaa-xxxx-813a-31fd6810bb25\", namespace=\"default\", node_name=\"aks-agentpool\", pod=\"myapp-7bd5d4fbcb-gjvxm\"},\ntotal ignored: 1 out of 4" msg="gRPC\n"

After this error, the sink does not recover and stops sending logs to Loki.

A similar problem (entry out of order) is discussed here)

We are using version 6.0.1

The text was updated successfully, but these errors were encountered:

mishamyte · 2021-08-18T06:34:10Z

Hi, @raphaelquati!

As I could see here are 2 problems - problem with sink recovery & out of entry case.

Will investigate bot cases

Could you, please, describe me does this entry rejection happens for the same label sets of with the different?

raphaelquati · 2021-08-18T14:06:34Z

Hi. @mishamyte,

It happens with different label sets. But one particular label always changes: "pod":
The application is running at a k8s cluster, with auto-scaling enabled. Different pods (same application) suffer from the same problem.

https://gist.github.com/raphaelquati/0726f081a49e8d55e51314ef2ec86e3c

mishamyte · 2021-08-18T14:20:29Z

@raphaelquati, could you also give me some more info - some logs with label pod=\"myapp-qa-7bd5d4fbcb-xttkn\"?

Successfully delivered and failed one should be in that part

raphaelquati · 2021-08-18T14:38:19Z

This pod started at +-08:40 and stops at +- 08:50

And I've found one error in Loki

raphaelquati · 2021-08-18T15:07:59Z

Before the error at Loki. the sink was sending correctly to Loki

Yellow lines (level detected): Sink sent the logs correctly
Gray lines (unknow level): Console log (promtail)

raphaelquati · 2021-08-18T15:27:02Z

Last successful log sent:

mishamyte · 2021-08-20T07:19:21Z

Thank you, @raphaelquati!

I have national holidays in my country, so I will return to investigation next week, after 25 of August

mishamyte · 2021-08-26T17:32:51Z

Hi again @raphaelquati,

Just did a little checks.

So error entry out of order is an error, produced by Loki. This happens when an entries with the same label sets came not in timestamp order.

This behavior is designed and discussed in grafana/loki#3379, as you mentioned before.

For this case I suggest using of a unique label for a pod instance.

Also, I tried some use cases to stop a sink from working and can't simulate such behavior to find out. So, I would be grateful, if you have a such one

mishamyte · 2021-08-29T16:11:40Z

Hi @raphaelquati,

Good news. We have found a problem with logic, that creates bottleneck for log events.

It is described in #52

I will provide a fix and I think, it could help in your case

mishamyte · 2021-08-30T14:59:11Z

Hi, @raphaelquati!

Just released v7.0.2 where entries, rejected by Loki are dropped from the queue. This helps sink to deliver next correct entries to Loki and fix a bottleneck with a memory.

Hope, you will find this useful in your situation!

raphaelquati · 2021-08-30T22:53:52Z

Hi @mishamyte,

Just did a little checks.

So error entry out of order is an error, produced by Loki. This happens when an entries with the same label sets came not in timestamp order.

This behavior is designed and discussed in grafana/loki#3379, as you mentioned before.

For this case I suggest using of a unique label for a pod instance.

The label sets are unique because the "pod" label is always unique (created by Kubernetes).

Just released v7.0.2 where entries, rejected by Loki are dropped from the queue. This helps sink to deliver next correct entries to Loki and fix a bottleneck with a memory.

I will update the code with the new version, and check the results.
Thank you!

raphaelquati · 2021-08-31T16:11:59Z

I'm testing now the new version, but I was thinking.....

I'd realized that Promtail doesn't have the 'out of order' problem (like discussed in grafana/loki#3379) because it uses the timestamp injected by Kubernetes (when console log is captured by), and Promtail injects the application timestamp as a second label:

"time" and "ts" have little different time values, but this guarantees the stream will never have 'out of order' logs.

As Grafana.Loki sends directly to Loki, 'out of order' error can occur because of the multithreading nature of our application.

In our case, the idea to have two timestamps is acceptable, because we don't want to lose any logline.

mishamyte · 2021-08-31T19:12:10Z

I don't think it is possible via Loki for now.

Why Loki team did not implemented this behavior is described here.

Also allowing out of order is discussed in grafana/loki#1544.

By the last info:

No immediate promises on 2.4, but we're going to try and release it quickly relative to the delay between most releases.

Also there is ordering inside a batch.

But in general case I can't give a stable variants for this case. You could make a batch size smaller, so in this case a possibility of multipod concurrency could be decreased. But it's a weird and non-reliable variant as for me

raphaelquati · 2021-08-31T21:31:23Z

I understand. But the problem occurs with one pod only, as I wrote before.
And yes, it's very weird.

Why Loki team did not implemented this behavior is described here.

Yes. But inside a batch, the batch is sorted, not the stream. If the next batch contains an "older" timestamp (in a high load situation - it's our case), the error 'out of order' will show at Loki.

mishamyte · 2021-08-31T21:53:51Z

Yep, correct.

Could you tell me how many events are logged per second? Just for interest

raphaelquati · 2021-08-31T22:19:58Z

Loki is reporting (one pod):

Staging: Max 271 log lines per second
Production: Max 901 log lines per second

mishamyte · 2021-09-09T22:08:28Z

Released in v7.1.0-beta.0

github-actions · 2021-09-24T00:49:50Z

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue.

mishamyte added the invalid This doesn't seem right label Aug 18, 2021

mishamyte added bug Something isn't working and removed invalid This doesn't seem right labels Aug 29, 2021

This was referenced Sep 3, 2021

Created internal timestamp calculated on arrival of log in queue (preventing Out of Order) #54

Closed

Created timestamp on arrival of log in queue (preventing Out of Order) #55

Merged

mishamyte closed this as completed in #55 Sep 9, 2021

github-actions bot locked as resolved and limited conversation to collaborators Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stops send log after Loki Error 'entry out of order' (Http status 400) #49

Stops send log after Loki Error 'entry out of order' (Http status 400) #49

raphaelquati commented Aug 18, 2021 •

edited

Loading

mishamyte commented Aug 18, 2021

raphaelquati commented Aug 18, 2021

mishamyte commented Aug 18, 2021

raphaelquati commented Aug 18, 2021

raphaelquati commented Aug 18, 2021

raphaelquati commented Aug 18, 2021

mishamyte commented Aug 20, 2021

mishamyte commented Aug 26, 2021

mishamyte commented Aug 29, 2021 •

edited

Loading

mishamyte commented Aug 30, 2021

raphaelquati commented Aug 30, 2021 •

edited

Loading

raphaelquati commented Aug 31, 2021 •

edited

Loading

mishamyte commented Aug 31, 2021

raphaelquati commented Aug 31, 2021

mishamyte commented Aug 31, 2021 •

edited

Loading

raphaelquati commented Aug 31, 2021 •

edited

Loading

mishamyte commented Sep 9, 2021 •

edited

Loading

github-actions bot commented Sep 24, 2021

Stops send log after Loki Error 'entry out of order' (Http status 400) #49

Stops send log after Loki Error 'entry out of order' (Http status 400) #49

Comments

raphaelquati commented Aug 18, 2021 • edited Loading

mishamyte commented Aug 18, 2021

raphaelquati commented Aug 18, 2021

mishamyte commented Aug 18, 2021

raphaelquati commented Aug 18, 2021

raphaelquati commented Aug 18, 2021

raphaelquati commented Aug 18, 2021

mishamyte commented Aug 20, 2021

mishamyte commented Aug 26, 2021

mishamyte commented Aug 29, 2021 • edited Loading

mishamyte commented Aug 30, 2021

raphaelquati commented Aug 30, 2021 • edited Loading

raphaelquati commented Aug 31, 2021 • edited Loading

mishamyte commented Aug 31, 2021

raphaelquati commented Aug 31, 2021

mishamyte commented Aug 31, 2021 • edited Loading

raphaelquati commented Aug 31, 2021 • edited Loading

mishamyte commented Sep 9, 2021 • edited Loading

github-actions bot commented Sep 24, 2021

raphaelquati commented Aug 18, 2021 •

edited

Loading

mishamyte commented Aug 29, 2021 •

edited

Loading

raphaelquati commented Aug 30, 2021 •

edited

Loading

raphaelquati commented Aug 31, 2021 •

edited

Loading

mishamyte commented Aug 31, 2021 •

edited

Loading

raphaelquati commented Aug 31, 2021 •

edited

Loading

mishamyte commented Sep 9, 2021 •

edited

Loading