Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librdkafka: poor performance of Redpanda (compared to Kafka) because of timeouts ApiVersionRequest #25269

Open
mdorier opened this issue Mar 6, 2025 · 2 comments

Comments

@mdorier
Copy link

mdorier commented Mar 6, 2025

I'm using librdkafka in a C++ program and comparing performance between Redpanda and Kafka in a number of setups. The test case uses a topic with 32 partitions (across 2 Redpanda servers). The client application consists of 32 processes (on the same 64-core machine). Each process consumes from one partition by being assigned to it via rd_kafka_assign, then polling using rd_kafka_consume_batch_queue (with a 100ms timeout). Batching is set to 1000 messages, and these messages are 2KB each.

On top of Kafka, this code takes about 4.5 seconds to consume 1 million events per partitions (so 32 million aggregate). But on top of Redpanda, the performance is disastrous. While some of the processes start consuming right away, others take several seconds (sometimes up to 2 minutes) to get started, and show the following on their stderr:

%5|1741270587.252|REQTMOUT|rdkafka#consumer-2| [thrd:x3005c0s37b1n0:9092/bootstrap]: x3005c0s37b1n0:9092/bootstrap: Timed out ApiVersionRequest in flight (after 10007ms, timeout #0)
%4|1741270587.252|FAIL|rdkafka#consumer-2| [thrd:x3005c0s37b1n0:9092/bootstrap]: x3005c0s37b1n0:9092/bootstrap: ApiVersionRequest failed: Local: Timed out: probably due to broker version < 0.10 (see api.version.request configuration) (after 10007ms in state APIVERSION_QUERY)
%3|1741270588.246|FAIL|rdkafka#consumer-2| [thrd:x3005c0s37b1n0:9092/bootstrap]: x3005c0s37b1n0:9092/bootstrap: Connect to ipv4#10.201.1.236:9092 failed: Connection refused (after 0ms in state CONNECT)
%3|1741270588.246|ERROR|rdkafka#consumer-2| [thrd:app]: rdkafka#consumer-2: x3005c0s37b1n0:9092/bootstrap: Connect to ipv4#10.201.1.236:9092 failed: Connection refused (after 0ms in state CONNECT)
%5|1741270599.256|REQTMOUT|rdkafka#consumer-2| [thrd:x3005c0s37b1n0:9092/bootstrap]: x3005c0s37b1n0:9092/bootstrap: Timed out ApiVersionRequest in flight (after 10009ms, timeout #0)
%3|1741270599.256|ERROR|rdkafka#consumer-2| [thrd:app]: rdkafka#consumer-2: x3005c0s37b1n0:9092/bootstrap: ApiVersionRequest failed: Local: Timed out: probably due to broker version < 0.10 (see api.version.request configuration) (after 10009ms in state APIVERSION_QUERY)
%5|1741270610.257|REQTMOUT|rdkafka#consumer-2| [thrd:x3005c0s37b1n0:9092/bootstrap]: x3005c0s37b1n0:9092/bootstrap: Timed out ApiVersionRequest in flight (after 10010ms, timeout #0)
%3|1741270610.257|ERROR|rdkafka#consumer-2| [thrd:app]: rdkafka#consumer-2: x3005c0s37b1n0:9092/bootstrap: ApiVersionRequest failed: Local: Timed out: probably due to broker version < 0.10 (see api.version.request configuration) (after 10010ms in state APIVERSION_QUERY, 1 identical error(s) suppressed)
%3|1741270611.248|FAIL|rdkafka#consumer-2| [thrd:x3005c0s37b1n0:9092/bootstrap]: x3005c0s37b1n0:9092/bootstrap: Connect to ipv4#10.201.1.236:9092 failed: Connection refused (after 0ms in state CONNECT)
%3|1741270612.250|FAIL|rdkafka#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: x3005c0s37b1n0:9092: Connect to ipv4#10.201.1.236:9092 failed: Connection refused (after 0ms in state CONNECT)
%5|1741270622.356|REQTMOUT|rdkafka#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/0: Timed out ApiVersionRequest in flight (after 10009ms, timeout #0)
%3|1741270622.356|ERROR|rdkafka#consumer-2| [thrd:app]: rdkafka#consumer-2: GroupCoordinator: x3005c0s37b1n0:9092: ApiVersionRequest failed: Local: Timed out: probably due to broker version < 0.10 (see api.version.request configuration) (after 10009ms in state APIVERSION_QUERY)
%3|1741270622.356|FAIL|rdkafka#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: x3005c0s37b1n0:9092: Connect to ipv4#10.201.1.236:9092 failed: Connection refused (after 0ms in state CONNECT)
%3|1741270623.534|FAIL|rdkafka#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: x3005c0s37b1n0:9092: Connect to ipv4#10.201.1.236:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)

Once they get started, they are about as fast as Kafka (but by this point the processes that had started right away have finished consuming and are just waiting). From the log, it looks like an ApiVersionRequest request is timing out repeatedly, with a 10 sec timeout.

Any idea what can be causing this and if there are configuration parameters (either in Redpanda or in librdkafka) that I can use?

Redpanda version: 24.2.7
Librdkafka version: 2.8.0

Note that I am not using a container, I'm using the Redpanda binaries directly, and I don't have sudo rights on the machine running it, so no kernel tuning possible.

JIRA Link: CORE-9319

@mdorier
Copy link
Author

mdorier commented Mar 7, 2025

I think the timeouts above are only part of the problem. I set "api.version.request.timeout.ms" to "200" and the timeout values in the error log went down to 1000ms instead of 10000ms (not sure why it's not 200ms); the slowest process showed 8 such timed out requests in its log, yet it waited nearly 4 minutes before receiving its first message from Redpanda. The timestamp of timeout messages in the error logs show a gap of ~20 seconds in-between most requests except for a gap of 2 minutes between two of them. I looked for more ways to configure the consumer and found that setting "reconnect.backoff.ms" and "reconnect.backoff.max.ms" both to 0 removed the 20 second delay between attempts. It doesn't remove the failed attempts so I still have a few seconds lost on these requests timing out, but at least the performance is much closer to that of Kafka.

@patrickangeles
Copy link

patrickangeles commented Mar 7, 2025

Do you have any more details on your environment? For starters:

  • Operating system / version
  • Hardware (cores, x86 or ARM, memory, etc.)
  • Storage medium and configuration
  • Filesystem type (ext4, xfs, etc.)

Also any details on the setup? What startup parameters did you give to Redpanda?

Finally, you mention:
| test case uses a topic with 32 partitions (across 2 Redpanda servers)
Did you mean three Redpanda servers? Or two Redpanda servers and you're using a replication factor of one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants