Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throughput decrease badly while broker number increase. #3626

Open
5 of 7 tasks
eelyaj opened this issue Nov 22, 2021 · 6 comments
Open
5 of 7 tasks

Throughput decrease badly while broker number increase. #3626

eelyaj opened this issue Nov 22, 2021 · 6 comments

Comments

@eelyaj
Copy link

eelyaj commented Nov 22, 2021

Description

I have a 100 partiation topic in my system. I found that when there are 3 kafka brokers, librdkafka can send 1,500,000 packets per second to kafka. But when I increase broker number from 3 to 20, librdkafka can only send 68,0000 packets per second.

I use rd_kafka_produce_batch api in my producer, with parmeters partition=RD_KAFKA_PARTITION_UA, msgflags=RD_KAFKA_MSG_F_COPY, message_cnt= 1000.

The 'top' cpu output is like:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7435 root 21 1 1880768 1.2g 7744 R 59.7 3.7 9:58.51 KafkaProducer
7453 root 22 2 1880768 1.2g 7744 R 33.0 3.7 5:17.44 Serializer2
7452 root 22 2 1880768 1.2g 7744 R 32.7 3.7 5:16.51 Serializer1
7451 root 22 2 1880768 1.2g 7744 R 31.7 3.7 5:12.27 Serializer0
7018 root 20 0 1880768 1.2g 7744 S 13.2 3.7 1:56.81 rdk:broker15000
7021 root 20 0 1880768 1.2g 7744 S 10.2 3.7 1:39.75 rdk:broker15000
7027 root 20 0 1880768 1.2g 7744 S 9.2 3.7 1:36.97 rdk:broker15000
7454 root 23 3 1880768 1.2g 7744 R 8.6 3.7 1:20.87 UdpDispatch
7025 root 20 0 1880768 1.2g 7744 S 8.3 3.7 1:29.82 rdk:broker15000
7033 root 20 0 1880768 1.2g 7744 S 8.3 3.7 1:36.80 rdk:broker15000
7019 root 20 0 1880768 1.2g 7744 S 7.9 3.7 1:07.05 rdk:broker15000
7030 root 20 0 1880768 1.2g 7744 S 7.9 3.7 1:18.41 rdk:broker15000
7020 root 20 0 1880768 1.2g 7744 R 7.6 3.7 1:13.25 rdk:broker15000
7036 root 20 0 1880768 1.2g 7744 R 7.6 3.7 0:55.21 rdk:broker15000
7024 root 20 0 1880768 1.2g 7744 S 6.9 3.7 0:55.41 rdk:broker15000
7035 root 20 0 1880768 1.2g 7744 S 6.9 3.7 1:02.24 rdk:broker15000
7028 root 20 0 1880768 1.2g 7744 S 6.3 3.7 0:55.50 rdk:broker15000
7034 root 20 0 1880768 1.2g 7744 S 5.6 3.7 0:54.06 rdk:broker15000
7026 root 20 0 1880768 1.2g 7744 S 5.3 3.7 0:42.09 rdk:broker15000

The 'KafkaProducer' thread is where I use rd_kafka_produce_batch api to send message to kafka.
I also use 'perf top' to show cpu info of 'KafkaProducer':
Samples: 17K of event 'cycles', Event count (approx.): 2675618158 lost: 0/0
Children Self Shared Object Symbol

  • 25.78% 0.06% [kernel] [k] system_call_fastpath
  • 19.76% 0.12% [kernel] [k] sys_write
  • 2.60% sys_write
    • 2.33% vfs_write
      • 2.36% do_sync_write
        • 4.77% pipe_write
          • 1.70% __wake_up_sync_key
            • 1.64% __wake_up_common
              • 1.29% pollwake
                • 1.86% default_wake_function
                  • 3.44% try_to_wake_up
  • 18.65% 0.31% [kernel] [k] vfs_write
  • 17.77% 0.31% [kernel] [k] do_sync_write
  • 17.31% 0.74% [kernel] [k] pipe_write
  • 14.63% 1.16% [kernel] [k] try_to_wake_up
  • 13.04% 0.16% libpthread-2.17.so [.] __lll_unlock_wake
  • 12.90% 0.07% [kernel] [k] __wake_up_sync_key
  • 11.64% 0.35% [kernel] [k] __wake_up_common
  • 11.26% 0.52% [kernel] [k] pollwake
  • 10.75% 0.06% [kernel] [k] default_wake_function
  • 10.72% 0.79% [kernel] [k] __schedule
  • 10.71% 0.04% [kernel] [k] schedule_user
  • 10.24% 0.04% [kernel] [k] sysret_careful
  • 7.52% 0.05% [kernel] [k] ttwu_do_activate.constprop.95

the ‘write’ system calls are in the top list.

How to reproduce

Deploy large nubmer of kafka broker, it reproduce everytime in my system.
I've tried 1.1.0, 1.7.0 version.

Checklist

Please provide the following information:

  • librdkafka version (release number or git tag): 1.7.0
  • Apache Kafka version: 2.2.1
  • librdkafka client configuration: <REPLACE with e.g., message.timeout.ms=123, auto.reset.offset=earliest, ..>
    "queue.buffering.max.ms": "1000",
    "queue.buffering.max.messages": "1000000",
    "queue.buffering.max.kbytes": "2048000",
    "batch.num.messages": "10000",
    "compression.codec": "zstd",
    "socket.send.buffer.bytes": "3200000",
    "socket.receive.buffer.bytes": "3200000",
    "message.max.bytes": "209715200"
    "request.required.acks": "1",
    "request.timeout.ms": "30000",
    "partitioner": "murmur2_random"
    "replication_factor": 2,
  • Operating system: EulerOS 2.7 the same as CentOS
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue
@eelyaj
Copy link
Author

eelyaj commented Nov 22, 2021

There are lots of io event write in thread 'KafkaProducer'. I don't know if it is normal.

Thread 16 "KafkaProducer" hit Breakpoint 1, 0x00007ffff6553fb0 in write () from /usr/lib64/libpthread.so.0
(gdb) bt
#0 0x00007ffff6553fb0 in write () from /usr/lib64/libpthread.so.0
#1 0x00007ffff7da8f3b in rd_kafka_q_io_event (rkq=0x14646e0) at rdkafka_queue.h:324
#2 rd_kafka_q_yield (rkq=0x14646e0) at rdkafka_queue.h:363
#3 rd_kafka_toppar_enq_msg (rktp=, rkm=) at rdkafka_partition.c:712
#4 0x00007ffff7d5038b in rd_kafka_msg_partitioner (rkt=rkt@entry=0x1441800, rkm=rkm@entry=0x2e1f2700, do_lock=do_lock@entry=RD_DONT_LOCK) at rdkafka_msg.c:1282
#5 0x00007ffff7d51d3d in rd_kafka_produce_batch (app_rkt=, partition=-1, msgflags=, rkmessages=, message_cnt=) at rdkafka_msg.c:781

@edenhill
Copy link
Contributor

I think this might be a dup of #3538

I'm working on an improved wakeup mechanism for 1.9.

@eelyaj
Copy link
Author

eelyaj commented Nov 23, 2021

Thanks.
Is there any workround I can do to avoid this issue, any config or parameters?
or maybe I can roll back librdkafka version in my system, which version should I use?

@eelyaj
Copy link
Author

eelyaj commented Nov 23, 2021

Test with 100 partitions , 3 brokers, 4 vcpu, 16G men, 60 bytes packet.

version throughput(packets/second)
1.8.2 800,000
1.7.0 800,000
1.6.1 1,000,000
1.5.3 790,000
1.4.4 790,000
1.3.0 760,000
1.2.2 790,000
1.1.0 790,000
0.11.6 1,020,000

@anchitj
Copy link
Member

anchitj commented Jul 23, 2024

@eelyaj Is this still an issue?

@anchitj
Copy link
Member

anchitj commented Jul 23, 2024

Closing as the fix is merged already. Feel free to reopen if you still see the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants