You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using librdkafka in a C++ program and comparing performance between Redpanda and Kafka in a number of setups. The test case uses a topic with 32 partitions (across 2 Redpanda servers). The client application consists of 32 processes (on the same 64-core machine). Each process consumes from one partition by being assigned to it via rd_kafka_assign, then polling using rd_kafka_consume_batch_queue (with a 100ms timeout). Batching is set to 1000 messages, and these messages are 2KB each.
On top of Kafka, this code takes about 4.5 seconds to consume 1 million events per partitions (so 32 million aggregate). But on top of Redpanda, the performance is disastrous. While some of the processes start consuming right away, others take several seconds (sometimes up to 2 minutes) to get started, and show the following on their stderr:
%5|1741270587.252|REQTMOUT|rdkafka#consumer-2| [thrd:x3005c0s37b1n0:9092/bootstrap]: x3005c0s37b1n0:9092/bootstrap: Timed out ApiVersionRequest in flight (after 10007ms, timeout #0)
%4|1741270587.252|FAIL|rdkafka#consumer-2| [thrd:x3005c0s37b1n0:9092/bootstrap]: x3005c0s37b1n0:9092/bootstrap: ApiVersionRequest failed: Local: Timed out: probably due to broker version < 0.10 (see api.version.request configuration) (after 10007ms in state APIVERSION_QUERY)
%3|1741270588.246|FAIL|rdkafka#consumer-2| [thrd:x3005c0s37b1n0:9092/bootstrap]: x3005c0s37b1n0:9092/bootstrap: Connect to ipv4#10.201.1.236:9092 failed: Connection refused (after 0ms in state CONNECT)
%3|1741270588.246|ERROR|rdkafka#consumer-2| [thrd:app]: rdkafka#consumer-2: x3005c0s37b1n0:9092/bootstrap: Connect to ipv4#10.201.1.236:9092 failed: Connection refused (after 0ms in state CONNECT)
%5|1741270599.256|REQTMOUT|rdkafka#consumer-2| [thrd:x3005c0s37b1n0:9092/bootstrap]: x3005c0s37b1n0:9092/bootstrap: Timed out ApiVersionRequest in flight (after 10009ms, timeout #0)
%3|1741270599.256|ERROR|rdkafka#consumer-2| [thrd:app]: rdkafka#consumer-2: x3005c0s37b1n0:9092/bootstrap: ApiVersionRequest failed: Local: Timed out: probably due to broker version < 0.10 (see api.version.request configuration) (after 10009ms in state APIVERSION_QUERY)
%5|1741270610.257|REQTMOUT|rdkafka#consumer-2| [thrd:x3005c0s37b1n0:9092/bootstrap]: x3005c0s37b1n0:9092/bootstrap: Timed out ApiVersionRequest in flight (after 10010ms, timeout #0)
%3|1741270610.257|ERROR|rdkafka#consumer-2| [thrd:app]: rdkafka#consumer-2: x3005c0s37b1n0:9092/bootstrap: ApiVersionRequest failed: Local: Timed out: probably due to broker version < 0.10 (see api.version.request configuration) (after 10010ms in state APIVERSION_QUERY, 1 identical error(s) suppressed)
%3|1741270611.248|FAIL|rdkafka#consumer-2| [thrd:x3005c0s37b1n0:9092/bootstrap]: x3005c0s37b1n0:9092/bootstrap: Connect to ipv4#10.201.1.236:9092 failed: Connection refused (after 0ms in state CONNECT)
%3|1741270612.250|FAIL|rdkafka#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: x3005c0s37b1n0:9092: Connect to ipv4#10.201.1.236:9092 failed: Connection refused (after 0ms in state CONNECT)
%5|1741270622.356|REQTMOUT|rdkafka#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/0: Timed out ApiVersionRequest in flight (after 10009ms, timeout #0)
%3|1741270622.356|ERROR|rdkafka#consumer-2| [thrd:app]: rdkafka#consumer-2: GroupCoordinator: x3005c0s37b1n0:9092: ApiVersionRequest failed: Local: Timed out: probably due to broker version < 0.10 (see api.version.request configuration) (after 10009ms in state APIVERSION_QUERY)
%3|1741270622.356|FAIL|rdkafka#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: x3005c0s37b1n0:9092: Connect to ipv4#10.201.1.236:9092 failed: Connection refused (after 0ms in state CONNECT)
%3|1741270623.534|FAIL|rdkafka#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: x3005c0s37b1n0:9092: Connect to ipv4#10.201.1.236:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
Once they get started, they are about as fast as Kafka (but by this point the processes that had started right away have finished consuming and are just waiting). From the log, it looks like an ApiVersionRequest request is timing out repeatedly, with a 10 sec timeout.
Any idea what can be causing this and if there are configuration parameters (either in Redpanda or in librdkafka) that I can use?
Note that I am not using a container, I'm using the Redpanda binaries directly, and I don't have sudo rights on the machine running it, so no kernel tuning possible.
I think the timeouts above are only part of the problem. I set "api.version.request.timeout.ms" to "200" and the timeout values in the error log went down to 1000ms instead of 10000ms (not sure why it's not 200ms); the slowest process showed 8 such timed out requests in its log, yet it waited nearly 4 minutes before receiving its first message from Redpanda. The timestamp of timeout messages in the error logs show a gap of ~20 seconds in-between most requests except for a gap of 2 minutes between two of them. I looked for more ways to configure the consumer and found that setting "reconnect.backoff.ms" and "reconnect.backoff.max.ms" both to 0 removed the 20 second delay between attempts. It doesn't remove the failed attempts so I still have a few seconds lost on these requests timing out, but at least the performance is much closer to that of Kafka.
Do you have any more details on your environment? For starters:
Operating system / version
Hardware (cores, x86 or ARM, memory, etc.)
Storage medium and configuration
Filesystem type (ext4, xfs, etc.)
Also any details on the setup? What startup parameters did you give to Redpanda?
Finally, you mention:
| test case uses a topic with 32 partitions (across 2 Redpanda servers)
Did you mean three Redpanda servers? Or two Redpanda servers and you're using a replication factor of one?
I'm using librdkafka in a C++ program and comparing performance between Redpanda and Kafka in a number of setups. The test case uses a topic with 32 partitions (across 2 Redpanda servers). The client application consists of 32 processes (on the same 64-core machine). Each process consumes from one partition by being assigned to it via
rd_kafka_assign
, then polling usingrd_kafka_consume_batch_queue
(with a 100ms timeout). Batching is set to 1000 messages, and these messages are 2KB each.On top of Kafka, this code takes about 4.5 seconds to consume 1 million events per partitions (so 32 million aggregate). But on top of Redpanda, the performance is disastrous. While some of the processes start consuming right away, others take several seconds (sometimes up to 2 minutes) to get started, and show the following on their stderr:
Once they get started, they are about as fast as Kafka (but by this point the processes that had started right away have finished consuming and are just waiting). From the log, it looks like an ApiVersionRequest request is timing out repeatedly, with a 10 sec timeout.
Any idea what can be causing this and if there are configuration parameters (either in Redpanda or in librdkafka) that I can use?
Redpanda version: 24.2.7
Librdkafka version: 2.8.0
Note that I am not using a container, I'm using the Redpanda binaries directly, and I don't have sudo rights on the machine running it, so no kernel tuning possible.
JIRA Link: CORE-9319
The text was updated successfully, but these errors were encountered: