Using `partition_key` raises `Local: Timed out` errors #186

Draiken · 2022-03-09T18:16:10Z

We've recently started using partition_key when producing messages and started seeing a big number of Rdkafka::RdkafkaError: Local: Timed out (timed_out) even with a small number of messages.

IDK if this value needs to be memoized (at least for a short time?) or if anything else needs to be tweaked in order to get it to work reliably.

We attempted to increase all the *.timeout.ms variables to no avail. They would take longer to raise errors, but they'd still raise them.

We were using an older version of librdkafka but even after updating to 0.11.1 the error persisted.

The specific line that was raising the error was this one:
https://github.com/appsignal/rdkafka-ruby/blob/76e6221492c9dba8224412cf3f202e91574089e8/lib/rdkafka/metadata.rb#L20

I'm unsure if this is a bug on rdkafka-ruby, librdkafka or perhaps something misconfigured on our side. However we searched the documentation for anything that could lead us to a culprit but couldn't find anything. The error from librdkafka is rather generic: https://github.com/edenhill/librdkafka/blob/master/INTRODUCTION.md#error-local-time-out

Any help is appreciated!

The text was updated successfully, but these errors were encountered:

mensfeld · 2022-03-09T18:21:26Z

Does it happen when creating messages to existing topics or new topics (first message)?

Draiken · 2022-03-09T18:25:34Z

All existing topics. We have auto create set to false.

Draiken · 2022-03-09T18:32:55Z

To give a bit more context: we are using Heroku Kafka, so we don't have any access to logs on the Kafka side.

geoff2k · 2022-03-09T18:43:09Z

Are you on the basic plan type? standard and extended plans can access kafka logs per their docs.

(Perhaps not helpful, but just clarifying that there are some mechanisms that allow access to logs when using heroku kafka).

Draiken · 2022-03-09T19:01:09Z

We are on standard but the logs were completely useless so far. They just tell us rate of consumption, iops, and things like that... example line of the only logs we get:

source=KAFKA addon=kafka-corrugated-39111 sample#load-avg-1m=0.0025 sample#load-avg-5m=0.005 sample#load-avg-15m=0 sample#read-iops=0 sample#write-iops=1.5339 sample#memory-total=16386156kB sample#memory-free=6966620kB sample#memory-cached=4228484kB sample#bytes-in-per-second=75.89 sample#bytes-out-per-second=5.74

Draiken · 2022-03-09T19:04:50Z

It just occurred to me that Heroku Kafka might be having an errror with the metadata request, but if we don't get a response within the 250ms timeout, we wouldn't see the real error, only the timeout.

According to their ACLs we should have Describe permission to Topic so it should not be a permission issue.

mensfeld · 2022-03-09T19:05:52Z

Would you mind also telling me:

rate at which you send data
exact config you have for production
average and max size of messages
does it happen without a partition key usage?

Draiken · 2022-03-10T13:54:48Z

Of course!

Our producers were generating around 60 messages per second peak. But later we could reproduce the error with much lower rates
This one is a bit tricky since we tried a bunch, and Racecar wraps most of it. We saw this with all the default *.timeout.ms. The only values we changed from defaults were:

{
  "fetch.max.bytes" => 2000000,
  "auto.commit.interval.ms" => 2000,
}

This varies a ton (and we had issues with all message sizes)
Max size I saw was round ~7kb
Average among all topics is ~600b
No, after tracking down the bug we found the issue only happened after we started using partition_key since it triggered that Metadata call, which then resulted in these timeouts

thijsc · 2022-03-10T16:04:13Z

This sounds like it's probably not caused by the Ruby layer, but by some interaction between librdkafka and kafka. @leonmaia mentioned that there was a timeout in listing groups in #185. Could that maybe be related?

leonmaia · 2022-03-14T11:53:46Z

FWIW: confluentinc/librdkafka#3705, there's a know issue in librdkafka that addresses some time out issues. The fix is only released in a RC1 release tho.

thijsc · 2022-11-08T16:52:19Z

This should be fixed, we are using 1.9.2 now. Let me know if there are still issues.

thijsc closed this as completed Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using `partition_key` raises `Local: Timed out` errors #186

Using `partition_key` raises `Local: Timed out` errors #186

Draiken commented Mar 9, 2022

mensfeld commented Mar 9, 2022

Draiken commented Mar 9, 2022

Draiken commented Mar 9, 2022

geoff2k commented Mar 9, 2022

Draiken commented Mar 9, 2022

Draiken commented Mar 9, 2022

mensfeld commented Mar 9, 2022

Draiken commented Mar 10, 2022

thijsc commented Mar 10, 2022

leonmaia commented Mar 14, 2022

thijsc commented Nov 8, 2022

Using partition_key raises Local: Timed out errors #186

Using partition_key raises Local: Timed out errors #186

Comments

Draiken commented Mar 9, 2022

mensfeld commented Mar 9, 2022

Draiken commented Mar 9, 2022

Draiken commented Mar 9, 2022

geoff2k commented Mar 9, 2022

Draiken commented Mar 9, 2022

Draiken commented Mar 9, 2022

mensfeld commented Mar 9, 2022

Draiken commented Mar 10, 2022

thijsc commented Mar 10, 2022

leonmaia commented Mar 14, 2022

thijsc commented Nov 8, 2022

Using `partition_key` raises `Local: Timed out` errors #186

Using `partition_key` raises `Local: Timed out` errors #186