Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stops send log after Loki Error 'entry out of order' (Http status 400) #49

Closed
raphaelquati opened this issue Aug 18, 2021 · 18 comments · Fixed by #55
Closed

Stops send log after Loki Error 'entry out of order' (Http status 400) #49

raphaelquati opened this issue Aug 18, 2021 · 18 comments · Fixed by #55
Labels
bug Something isn't working

Comments

@raphaelquati
Copy link
Contributor

raphaelquati commented Aug 18, 2021

When our application has some load (a lot of broker messages to consume - using Rebus worker threads), the sink suddenly stops sending logs to Loki.

To investigate, we configured the application to have 2 sinks enabled: console and this sink, and Promtail configured to send console log to Loki.
We found this line on Loki log:

level=warn ts=2021-08-17T21:57:55.768341371Z caller=grpc_logging.go:38 method=/logproto.Pusher/Push duration=1.39409ms err="rpc error: code = Code(400) desc = entry with timestamp 2021-08-17 21:56:54.73 +0000 UTC ignored, reason: 'entry out of order' for stream: {SourceContext=\"Rebus.Retry.ErrorTracking.InMemErrorTracker\", app=\"myapp\", container=\"myapp\", errorNumber=\"1\", messageId=\"52xxb042-xxaa-xxxx-813a-31fd6810bb25\", namespace=\"default\", node_name=\"aks-agentpool\", pod=\"myapp-7bd5d4fbcb-gjvxm\"},\ntotal ignored: 1 out of 4" msg="gRPC\n"

After this error, the sink does not recover and stops sending logs to Loki.

A similar problem (entry out of order) is discussed here)

We are using version 6.0.1

@mishamyte mishamyte added the invalid This doesn't seem right label Aug 18, 2021
@mishamyte
Copy link
Member

Hi, @raphaelquati!

As I could see here are 2 problems - problem with sink recovery & out of entry case.

Will investigate bot cases

Could you, please, describe me does this entry rejection happens for the same label sets of with the different?

@raphaelquati
Copy link
Contributor Author

Hi. @mishamyte,

It happens with different label sets. But one particular label always changes: "pod":
The application is running at a k8s cluster, with auto-scaling enabled. Different pods (same application) suffer from the same problem.

https://gist.github.com/raphaelquati/0726f081a49e8d55e51314ef2ec86e3c

@mishamyte
Copy link
Member

@raphaelquati, could you also give me some more info - some logs with label pod=\"myapp-qa-7bd5d4fbcb-xttkn\"?

Successfully delivered and failed one should be in that part

@raphaelquati
Copy link
Contributor Author

This pod started at +-08:40 and stops at +- 08:50

image

And I've found one error in Loki

image

@raphaelquati
Copy link
Contributor Author

Before the error at Loki. the sink was sending correctly to Loki

Yellow lines (level detected): Sink sent the logs correctly
Gray lines (unknow level): Console log (promtail)

image

@raphaelquati
Copy link
Contributor Author

Last successful log sent:

image

@mishamyte
Copy link
Member

Thank you, @raphaelquati!

I have national holidays in my country, so I will return to investigation next week, after 25 of August

@mishamyte
Copy link
Member

Hi again @raphaelquati,

Just did a little checks.

  1. So error entry out of order is an error, produced by Loki. This happens when an entries with the same label sets came not in timestamp order.

This behavior is designed and discussed in grafana/loki#3379, as you mentioned before.

For this case I suggest using of a unique label for a pod instance.

  1. Also, I tried some use cases to stop a sink from working and can't simulate such behavior to find out. So, I would be grateful, if you have a such one

@mishamyte
Copy link
Member

mishamyte commented Aug 29, 2021

Hi @raphaelquati,

Good news. We have found a problem with logic, that creates bottleneck for log events.

It is described in #52

I will provide a fix and I think, it could help in your case

@mishamyte mishamyte added bug Something isn't working and removed invalid This doesn't seem right labels Aug 29, 2021
@mishamyte
Copy link
Member

Hi, @raphaelquati!

Just released v7.0.2 where entries, rejected by Loki are dropped from the queue. This helps sink to deliver next correct entries to Loki and fix a bottleneck with a memory.

Hope, you will find this useful in your situation!

@raphaelquati
Copy link
Contributor Author

raphaelquati commented Aug 30, 2021

Hi @mishamyte,

Just did a little checks.

  1. So error entry out of order is an error, produced by Loki. This happens when an entries with the same label sets came not in timestamp order.

This behavior is designed and discussed in grafana/loki#3379, as you mentioned before.

For this case I suggest using of a unique label for a pod instance.

The label sets are unique because the "pod" label is always unique (created by Kubernetes).

Just released v7.0.2 where entries, rejected by Loki are dropped from the queue. This helps sink to deliver next correct entries to Loki and fix a bottleneck with a memory.

I will update the code with the new version, and check the results.
Thank you!

@raphaelquati
Copy link
Contributor Author

raphaelquati commented Aug 31, 2021

I'm testing now the new version, but I was thinking.....

I'd realized that Promtail doesn't have the 'out of order' problem (like discussed in grafana/loki#3379) because it uses the timestamp injected by Kubernetes (when console log is captured by), and Promtail injects the application timestamp as a second label:

image

"time" and "ts" have little different time values, but this guarantees the stream will never have 'out of order' logs.

As Grafana.Loki sends directly to Loki, 'out of order' error can occur because of the multithreading nature of our application.

In our case, the idea to have two timestamps is acceptable, because we don't want to lose any logline.

@mishamyte
Copy link
Member

I don't think it is possible via Loki for now.

Why Loki team did not implemented this behavior is described here.

Also allowing out of order is discussed in grafana/loki#1544.

By the last info:

No immediate promises on 2.4, but we're going to try and release it quickly relative to the delay between most releases.

Also there is ordering inside a batch.

But in general case I can't give a stable variants for this case. You could make a batch size smaller, so in this case a possibility of multipod concurrency could be decreased. But it's a weird and non-reliable variant as for me

@raphaelquati
Copy link
Contributor Author

I understand. But the problem occurs with one pod only, as I wrote before.
And yes, it's very weird.

Why Loki team did not implemented this behavior is described here.

Yes. But inside a batch, the batch is sorted, not the stream. If the next batch contains an "older" timestamp (in a high load situation - it's our case), the error 'out of order' will show at Loki.

@mishamyte
Copy link
Member

mishamyte commented Aug 31, 2021

Yep, correct.

Could you tell me how many events are logged per second? Just for interest

@raphaelquati
Copy link
Contributor Author

raphaelquati commented Aug 31, 2021

Loki is reporting (one pod):

Staging: Max 271 log lines per second
Production: Max 901 log lines per second

@mishamyte
Copy link
Member

mishamyte commented Sep 9, 2021

Released in v7.1.0-beta.0

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.