-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rd_kafka_toppar_enq_msg() lock when produce msg #2447
Comments
info thread14 Thread 0x7f0770400700 (LWP 17877) "producer_per" 0x00007f077176438d in poll () from /lib64/libc.so.6 thread apply all bt
|
inside
|
Thread 12 is the application calling produce():
Thread 11 is the rdkafka main thread which is updating topic state from metadata:
Thread 8 is the broker thread producing messages:
From what I can see there is no deadlock, thread 8 is busy moving messages from the partition queue to the partition xmit queue. Or the message queue that is being moved is corrupt and cyclic, this we can check in gdb: gdb) source path/to/librdkafka/.gdbmacros
gdb) thread 8
gdb) frame 3
gdb) dump_msgq rktp.rktp_xmit_msgq
gdb) dump_msgq rktp.rktp_msgq |
follow your advice, there are two msgq content below. @edenhill . (gdb) dump_msgq rktp.rktp_xmit_msgq
(gdb) dump_msgq rktp.rktp_msgq
|
does produce() stop for a while or indefinitely? |
just for a while, but i receives no dr_cb reported that queue is full. So if queue in toppar is not empty, what else will cause ‘produce()’ stop. |
ERR__QUEUE_FULL is propagated by produce() failing and returning that error code, in which case the message you tried to produce will have been added to the queue. if produce() blocks for a while it typically means your system can't keep up. |
it has only 2cpus: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 10g memory. |
Monitor the CPU usage and system load to see if it is reaching its limits. |
top - 07:48:31 up 21 days, 3:50, 6 users, load average: 3.19, 1.75, 1.83 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND cpu over 200%. |
some case here: while (running) {
rd_kafka_poll(rk, 1000);
} |
Seems like your CPUs are saturated. |
i will use another vm with more cpu to test |
actually when cpu in a normal state, |
Hi, @edenhill. I have tested with a vm with 20cpus again. I think I met with same problem that producer cant send msg. as u can see:
top each thread
function that rdk:broker1 thread using
and with my vNIC thoughout is not sending msg, TX Data/Rate should be larger than 20MB when it is normal. when i run another broker_example to produce msg, it work well.
when with 2 producer, only one can work well.
So is there any thing that can helps figuring out this dilemma to me, thanks! |
let me add up something info:
|
@edenhill ,do you have any ideas? |
Are there any adverse cluster events when this happens, such as brokers going down or network congestion? |
Do you see any logs from the producer? |
I dont enable debug config. but sometimes receive
OR
the second one I think is just a warning. |
I also find that when work well,
but when stuck. the stack is like this
something different in |
So what I think is happening here is that you have some connectivity issues with the cluster, or the cluster is unstable for other reasons, which causes broker connections to go down, which in turn makes the producer re-insert messages in-flight/in-tx-queue on the partition's queue again. |
thank for swift response, I will check now. |
It might be a issue related to NIC throughtput . I change kafka broker to a vm with 10GEvnic, then work well at same scenerio with more than 200 consumers with producer 5MB/s. |
@edenhill hello edenhil |
@firefeifei What librdkafka version? What Kafka version? What do you mean by "105 errors"? Is that the number of errors? |
librdkafka 1.2.0 rd_kafka_produce return -1, errno is 105 |
105 is ENOBUFS which is |
This error(105) occurs when the message is suddenly increased and there is insufficient Kafka partion , but because this error causes the KaKa thread to deadlock and cannot recover, even after adding a partion The main problem is a thread deadlock call stack like this: |
Please try to reproduce this on the latest release, 1.6.0 |
Thanks, I will update version testing 77259 root 20 0 83.792g 0.016t 6620 R 99.6 82.9 9:48.35 rdk:broker7 |
Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ
Description
after sending msg for a while(a day or 20minutes, unpredictable), thread which is used to apply
rd_kafka_produce()
lock. I want to know what reason may causes this.How to reproduce
<your steps how to reproduce goes here, or remove section if not relevant>
making a Performance Test, sending msg in a long time, and then thread will lock。
gdb info will paste in comment.
IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
librdkafka-1.0.1
kafka_2.12-2.2.0
{"queue.buffering.max.messages", "10000000"}, {"acks", "all"}, {"linger.ms", "0"}, {"compression.codec", "none"}, {"socket.keepalive.enable", "true"}, {"enable.idempotence", "true"}, {"message.timeout.ms", "100000"}, {"reconnect.backoff.jitter.ms", "300"}
Centos 7 (x64)
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: