Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-order errors with log streams from multiple services despite unique labels #898

Closed
laitalaj opened this issue Aug 15, 2019 · 31 comments · Fixed by #1247
Closed

Out-of-order errors with log streams from multiple services despite unique labels #898

laitalaj opened this issue Aug 15, 2019 · 31 comments · Fixed by #1247

Comments

@laitalaj
Copy link
Contributor

Describe the bug
We are getting some "400 entry out of order for stream" -errors in our multiple-sending-hosts, multiple-sending-processes setup despite every host-process combination having unique labels.

According to this comment, we should be able to have multiple parallel log streams getting pushed into Loki as long as the labels for these streams differ.

Our setup currently consists of

  1. Fluent Bit agents on our servers, forwarding logs to a load balancer
  2. the load balancer forwarding the logs to two Fluentd instances
  3. each Fluentd instance using two threads to process the logs and forward them to Loki using fluent-plugin-grafana-loki

In each log, we include both the hostname of the Fluentd instance and the thread ID as labels. Therefore, each parallel stream should have unique labels, and out-of-order errors shouldn't happen. That is not the case, though - we do get out-of-order errors when we use two hosts with two threads each. The errors don't happen for some strange reason, however, when using only one host with two threads (e.g. when skipping the load balancing).

As building a setup as complex as ours for reproducing this isn't feasible, I wrote a script that manages to reproduce a similar scenario locally. It uses just pure protobufs to send logs to /api/prom/push, so this doesn't seem to be a Fluentd/Fluentd plugin problem. See the chapter below for reproduction instructions.

To Reproduce

  1. Start the loki/production stack with docker-compose (loki version: master-78c6aa49, the other parts of the stack are not used and not relevant)
  2. Run the lokitoy script, found at https://github.com/laitalaj/lokitoy , according to the instructions in the README
  3. Notice the errors in loki logs and in the script output

Expected behavior
The script makes sure that each parallel logging process sends it's log stream in order, and that the log streams from each process have unique label sets. According to this comment, there should not be any out-of-order errors in this case.

Environment:

  • Production: Fluentd with Loki plugin, forwarding to a Loki running on Docker in a cloud hosting service
  • Local reproduction: Python script sending mock logs to a Loki running on Docker in localhost

Screenshots, promtail config, or terminal output
Some error lines from the loki logs:

loki_1      | level=warn ts=2019-08-15T08:16:09.0429259Z caller=grpc_logging.go:37 method=/logproto.Pusher/Push duration=659.1µs err="rpc error: code = Code(400) desc = entry out of order for stream: {app=\"m\", uniq0=\"0\", uniq1=\"0\"}" msg="gRPC\n"
loki_1      | level=warn ts=2019-08-15T08:16:10.5270068Z caller=grpc_logging.go:37 duration=289.4µs method=/logproto.Pusher/Push err="rpc error: code = Code(400) desc = entry out of order for stream: {app=\"l\", uniq0=\"1\", uniq1=\"1\"}" msg="gRPC\n"

Some findings
I've poked around a bit with the lokitoy script, and catalogued some behavior in https://github.com/laitalaj/lokitoy#some-findings

@cyriltovena
Copy link
Contributor

Hey @laitalaj ,

Thanks for the script, I ran it on my machine and after debugging, it simply send entries out of order.

For a given unique stream the max timestamp was 1566905035753341913 and the new entry that it was trying to add was 1566905035280495167.

So you're simply sending out of order entries which is not possible in Loki.

@laitalaj
Copy link
Contributor Author

@cyriltovena , could you elaborate a bit on how you figured out that the entries are being sent out-of-order for a given unique stream? If that is indeed the case, there's some bug in my script, and I'd like to reproduce and fix that to do a clean reproduction of the problem in loki (as the issue still stands in the production setup that I described in the OP)

@laitalaj
Copy link
Contributor Author

I quickly built a lightweight flask app called lokimock that just receives stuff from lokitoy and checks that the timestamps for uniquely labeled streams are in order. That, at least, didn't catch any out-of-order stuff from the script. I also added an option for making lokitoy send stuff deliberately out-of-order and lokimock managed to catch the out-of-orders in that case.

Therefore, to me it seems a lot like lokitoy is indeed sending stuff in order for all the unique label sets - maybe whatever made it seem to you like the stuff is out-of-order can shed some light on what's actually going on here, @cyriltovena

@cyriltovena
Copy link
Contributor

@laitalaj I ran your script and debug Loki, I put a break point where this error happen and the timestamp received was before.

@laitalaj
Copy link
Contributor Author

It makes sense that when the error happens the timestamp shown as just received is before the timestamp received earlier. However, I'm fairly certain that the incoming streams with unique labels are in reality in order and that mix-ups happen in Loki instead of in the reproduction script:

  • That lokimock thing that I created yesterday would notice if the script was sending stuff out-of-order
  • As detailed in the findings section of lokitoy's readme, changing the value of "shared_labels" in lokitoy to, for example, { "app": ['a', 'b', 'c', 'd', 'e', 'f'] }, prevents the errors from happening in loki. If the script was actually sending stuff out of order, there would be out-of-order errors regardless of the shared labels available.

Therefore, I think what's happening in loki should be investigated further, @cyriltovena . For example, what's different when the value of "app" label can be 'a', 'b', 'c', 'd', 'e' or 'f' (which results in no errors) than when the value of "app" label can be 'h', 'i', 'j', 'k', 'l' or 'm' (which results in out-of-order errors)?

@cyriltovena
Copy link
Contributor

I will give it another look for sure.

@cyriltovena
Copy link
Contributor

Are you waiting for a response before pushing a new batch ? If you don't wait for a response nothing stops one request to be processed before.

@laitalaj
Copy link
Contributor Author

@cyriltovena , yes. The log batch is sent at https://github.com/laitalaj/lokitoy/blob/master/lokitoy.py#L117 , using the requests library. Requests is blocking, so the thread waits for a response there until continuing (and lokitoy even checks the response code and content before starting to build the next batch)

@laitalaj
Copy link
Contributor Author

The loki plugin for fluentd, that we use in the production stack where we faced this problem originally, also seems to wait for a response correctly.

@stale
Copy link

stale bot commented Oct 26, 2019

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Oct 26, 2019
@stale stale bot closed this as completed Nov 2, 2019
@jgehrcke
Copy link
Contributor

jgehrcke commented Nov 6, 2019

@cyriltovena can we please re-open this issue? It seems to be quite relevant.

@laitalaj thank you for the debugging work, and all the details you have provided here. Quite insightful, and indeed a bit frightening. How did you work around this problem in your production environment?

@woodsaj since you are the author of #168 (comment) (which is challenged by this issue here) -- can you maybe chime in and assess the situation? Thank you!

@laitalaj
Copy link
Contributor Author

laitalaj commented Nov 7, 2019

@jgehrcke , we added a third Fluentd instance, and that fixed the problems 😅 That's pretty strange, but consistent with the findings that only very specific label sets cause the problems 🤔

@cyriltovena
Copy link
Contributor

Sure but please add your details then, this a hard problem to solve so give us as much input as possible on how this is affecting you.

@cyriltovena cyriltovena reopened this Nov 7, 2019
@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Nov 7, 2019
@pstibrany
Copy link
Member

pstibrany commented Nov 11, 2019

@laitalaj Thanks a lot for providing simple Python script. It helped to locate an issue.

Bug is in Loki, which uses very simple "FastFingerprint" hash function to generate a key to map of streams. You can probably see where this is going... this hash function is prone to collisions, and creates same the fingerprint for different set of labels.

Some examples:

  • Both (app=l,uniq0=0,uniq1=1) and (app=m,uniq0=1,uniq1=1) have fingerprint e002a3a451262627
  • Both (app=l,uniq0=1,uniq1=0) and (app=m,uniq0=0,uniq1=0) hash to e002a3a451262247
  • Both (app=l,uniq0=0,uniq1=0) and (app=m,uniq0=1,uniq1=0) hash to e002a2a4512624f4

All these label sets are generated by your script, thanks!

Loki effectively treats different label sets with same fingerprint as the same series, which then leads to out of order errors.

/cc @cyriltovena

@spahl
Copy link

spahl commented Nov 11, 2019

Nice:-) I did the same investigation yesterday and landed on the same conclusions. Short/Small number of labels are especially prone to collision.

Swapping FastFingerPrint for FingerPrint solves this issue.

diff --git a/pkg/ingester/instance.go b/pkg/ingester/instance.go
index a10fe6e9..76a8e123 100644
--- a/pkg/ingester/instance.go
+++ b/pkg/ingester/instance.go
@@ -86,7 +86,7 @@ func (i *instance) consumeChunk(ctx context.Context, labels []client.LabelAdapte
     i.streamsMtx.Lock()
     defer i.streamsMtx.Unlock()
 
-    fp := client.FastFingerprint(labels)
+    fp := client.Fingerprint(client.FromLabelAdaptersToLabels(labels))
     stream, ok := i.streams[fp]
     if !ok {
         stream = newStream(fp, labels, i.blockSize)
@@ -136,7 +136,7 @@ func (i *instance) Push(ctx context.Context, req *logproto.PushRequest) error {
 }
 
 func (i *instance) getOrCreateStream(labels []client.LabelAdapter) (*stream, error) {
-    fp := client.FastFingerprint(labels)
+    fp := client.Fingerprint(client.FromLabelAdaptersToLabels(labels))
 
     stream, ok := i.streams[fp]
     if ok {

@pstibrany
Copy link
Member

Swapping FastFingerPrint for FingerPrint solves this issue.

There is still a chance for collisions though :)

@spahl
Copy link

spahl commented Nov 11, 2019

There is still a chance for collisions though :)

Under what conditions?

@pstibrany
Copy link
Member

There is still a chance for collisions though :)

Under what conditions?

Collisions are just a fact of life with hash functions (unless you use strong crypto hash functions, where collisions are difficult to find, but they also use much larger space of hash values). Proper solution needs to take that into account.

@spahl
Copy link

spahl commented Nov 11, 2019

I fully agree, but FingerPrint is used a lot in Prometheus and working fine for the case. FastFingerPrint on the other hand is clearly called out to be prone to collisions in the code (https://github.com/prometheus/common/blob/b5fe7d854c42dc7842e48d1ca58f60feae09d77b/model/labelset.go#L147).

Do you have a better solution in mind? One could switch to use a big string of all key/values in the label but that is a lot of string comparisons.

In the meantime simply switching to FingerPrint would allow an immediate solution until something better comes along (since right now it is clearly broken).

@pstibrany
Copy link
Member

@spahl
Copy link

spahl commented Nov 11, 2019

Ah cool I did not know about the mapper. Thanks!

@pstibrany
Copy link
Member

Ah cool I did not know about the mapper. Thanks!

I just learned about it hour ago myself :-)

@laitalaj
Copy link
Contributor Author

Great job finding the culprit @pstibrany @spahl , and good to hear that my little script proved useful!

pstibrany added a commit that referenced this issue Nov 12, 2019
Uses slightly adapted fpMapper code from Cortex.

Fixes issue #898

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
@mr-karan
Copy link
Contributor

mr-karan commented Dec 2, 2019

+1 I faced the same issue while using Fluentbit as a daemonset with the configuration key AutoKubernetesLabels set to true. Is there a fix planned for this?

@cyriltovena
Copy link
Contributor

cyriltovena commented Dec 2, 2019

Fluentbit uses the same client code as promtail and the last bit is on Loki side, so it should also cover fluentbit.

cyriltovena pushed a commit that referenced this issue Dec 5, 2019
)

* pkg/ingester: handle labels mapping to the same fast fingerprint.

Uses slightly adapted fpMapper code from Cortex.

Fixes issue #898

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* empty commit to force new build

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Removed empty lines in imports.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Added test that pushes entries concurrently.

To be run with -race.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Stream now keeps both original and mapped fingerprint

Mapped fingerprint is used in streams map.
Original ("rawFP") is preserved when saving chunks.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Vendoring

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Reverted previous commit (keep the test, with updated checks).

Preserving raw fingerprints could lead to data loss when
doing chunk deduplication while storing chunks. We don't want
that.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* go mod vendor

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
@chancez
Copy link
Contributor

chancez commented Dec 19, 2019

fluent/fluent-bit#1746 seems related

@chancez
Copy link
Contributor

chancez commented Dec 19, 2019

FWIW, I've had minor improvement on this situation (but still out of order errors after a period of time) by using the timekey buffer option with fluentd https://docs.fluentd.org/configuration/buffer-section#time.

I'm currently looking at using namespace, container, and pod name + time as the buffer chunk keys so fluentd chunks are buffered per-source and time chunk:

<match kube.**>
  @type loki
  url "http://loki.loki:3100"
  extra_labels {"env":"dev"}

  <label>
    fluentd_worker
    container $.kubernetes.container_name
    namespace $.kubernetes.namespace_name
    pod $.kubernetes.pod_name
  </label>

  <buffer time, tag, $.kubernetes.container_name, $.kubernetes.namespace_name, $.kubernetes.pod_name, fluentd_worker>
    timekey 180
    timekey_wait 300
    chunk_limit_size 60MB
  </buffer>

</match>

This config produces a chunk per pod where each chunk contains 180 seconds of data for a given time frame, ensuring ordering, and it waits 300 seconds before flushing, allowing logs to arrive late before we submit to Loki.

So far with these values it seems to be working.

Also to note is I'm running fluent-bit -> fluentd aggregator -> loki.

@srstsavage
Copy link
Contributor

srstsavage commented Jan 9, 2020

@laitalaj Sorry to necropost, but can you elaborate on how you add thread id as a label on your fluentd instances?

In each log, we include both the hostname of the Fluentd instance and the thread ID as labels.

I assume you're running fluentd in multi-worker mode and setting the worker id as described in the loki plugin? Or have you found some way to include a buffer thread id when setting flush_thread_count > 1 in the buffer section? I'm searching for the latter and haven't found a way.

@laitalaj
Copy link
Contributor Author

@shane-axiom , yeah, I meant worker ID as a label, exactly as described in the plugin readme. We haven't run into a situation where we'd need multiple flush threads per worker yet.

@maxisam
Copy link

maxisam commented Jul 28, 2021

I am using loki 2.2.1 and I still see this issue.

@jgehrcke
Copy link
Contributor

I am using loki 2.2.1 and I still see this issue.

Well, there's a real world scenario here where the error is perfectly legit: when one tries to add a sample (log entry) older than the newest one in the system, for the same log stream (same label set). Did you try to verify that in your case it isn't imply that that happened? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants