Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using partition_key raises Local: Timed out errors #186

Closed
Draiken opened this issue Mar 9, 2022 · 11 comments
Closed

Using partition_key raises Local: Timed out errors #186

Draiken opened this issue Mar 9, 2022 · 11 comments

Comments

@Draiken
Copy link

Draiken commented Mar 9, 2022

We've recently started using partition_key when producing messages and started seeing a big number of Rdkafka::RdkafkaError: Local: Timed out (timed_out) even with a small number of messages.

IDK if this value needs to be memoized (at least for a short time?) or if anything else needs to be tweaked in order to get it to work reliably.

We attempted to increase all the *.timeout.ms variables to no avail. They would take longer to raise errors, but they'd still raise them.

We were using an older version of librdkafka but even after updating to 0.11.1 the error persisted.

The specific line that was raising the error was this one:
https://github.com/appsignal/rdkafka-ruby/blob/76e6221492c9dba8224412cf3f202e91574089e8/lib/rdkafka/metadata.rb#L20

I'm unsure if this is a bug on rdkafka-ruby, librdkafka or perhaps something misconfigured on our side. However we searched the documentation for anything that could lead us to a culprit but couldn't find anything. The error from librdkafka is rather generic: https://github.com/edenhill/librdkafka/blob/master/INTRODUCTION.md#error-local-time-out

Any help is appreciated!

@mensfeld
Copy link
Member

mensfeld commented Mar 9, 2022

Does it happen when creating messages to existing topics or new topics (first message)?

@Draiken
Copy link
Author

Draiken commented Mar 9, 2022

All existing topics. We have auto create set to false.

@Draiken
Copy link
Author

Draiken commented Mar 9, 2022

To give a bit more context: we are using Heroku Kafka, so we don't have any access to logs on the Kafka side.

@geoff2k
Copy link
Contributor

geoff2k commented Mar 9, 2022

Are you on the basic plan type? standard and extended plans can access kafka logs per their docs.

(Perhaps not helpful, but just clarifying that there are some mechanisms that allow access to logs when using heroku kafka).

@Draiken
Copy link
Author

Draiken commented Mar 9, 2022

We are on standard but the logs were completely useless so far. They just tell us rate of consumption, iops, and things like that... example line of the only logs we get:

source=KAFKA addon=kafka-corrugated-39111 sample#load-avg-1m=0.0025 sample#load-avg-5m=0.005 sample#load-avg-15m=0 sample#read-iops=0 sample#write-iops=1.5339 sample#memory-total=16386156kB sample#memory-free=6966620kB sample#memory-cached=4228484kB sample#bytes-in-per-second=75.89 sample#bytes-out-per-second=5.74

@Draiken
Copy link
Author

Draiken commented Mar 9, 2022

It just occurred to me that Heroku Kafka might be having an errror with the metadata request, but if we don't get a response within the 250ms timeout, we wouldn't see the real error, only the timeout.

According to their ACLs we should have Describe permission to Topic so it should not be a permission issue.

@mensfeld
Copy link
Member

mensfeld commented Mar 9, 2022

Would you mind also telling me:

  1. rate at which you send data
  2. exact config you have for production
  3. average and max size of messages
  4. does it happen without a partition key usage?

@Draiken
Copy link
Author

Draiken commented Mar 10, 2022

Of course!

  • Our producers were generating around 60 messages per second peak. But later we could reproduce the error with much lower rates
  • This one is a bit tricky since we tried a bunch, and Racecar wraps most of it. We saw this with all the default *.timeout.ms. The only values we changed from defaults were:
{
  "fetch.max.bytes" => 2000000,
  "auto.commit.interval.ms" => 2000,
}
  • This varies a ton (and we had issues with all message sizes)
    Max size I saw was round ~7kb
    Average among all topics is ~600b
  • No, after tracking down the bug we found the issue only happened after we started using partition_key since it triggered that Metadata call, which then resulted in these timeouts

@thijsc
Copy link
Collaborator

thijsc commented Mar 10, 2022

This sounds like it's probably not caused by the Ruby layer, but by some interaction between librdkafka and kafka. @leonmaia mentioned that there was a timeout in listing groups in #185. Could that maybe be related?

@leonmaia
Copy link
Contributor

FWIW: confluentinc/librdkafka#3705, there's a know issue in librdkafka that addresses some time out issues. The fix is only released in a RC1 release tho.

@thijsc
Copy link
Collaborator

thijsc commented Nov 8, 2022

This should be fixed, we are using 1.9.2 now. Let me know if there are still issues.

@thijsc thijsc closed this as completed Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants