-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rd_kafka_destroy_app got stuck in pthread_join #2339
Comments
Can you provide a small but complete program that reproduces this error? |
Hi, Unfortunately, I can't provide it because it's a part of our product.. :( It reproduced and I got the debug=all traces. Please check the attached file.
I got two callstacks with "rdkafka" name newly, it might be because of "debug=all":
This issue reproduced when there's some -do nothing with rdkafka- interval before the producer termination as in the trace. (without the interval, it's not reproduced!) create producer ex)
Some of line in the stack trace removed as we have some wrapper implementation for pthread. Thanks in advance! Regards, |
It would be great if you could modify the example code in the repo to reproduce this issue. |
I found that there can be a problem in our wrapper implementation of pthread; a safe pointer exists for each thread and its ref goes wrong somehow with external pthread function calls. Thanks for your support :) |
Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ
Description
I've created and destroyed a producer, but rd_kafka_destroy_app hangs with the following bt:
#0 0x00007f504ef904c2 in pthread_join () from /lib64/libpthread.so.0
...
#3 0x00007f47e81153e2 in thrd_join (thr=thr@entry=139932621911808, res=res@entry=0x7f4cbc2a5e5c) at tinycthread.c:692
#4 0x00007f47e80b0812 in rd_kafka_destroy_app (rk=0x7f4e7f3e9000, flags=) at rdkafka.c:939
#5 0x00007f503d15673b in clear () at ../../KafkaAdaptor/KafkaProducerJob.cpp:165
#6 0x00007f503d15f7f6 in __base_dtor () at ../../KafkaAdaptor/KafkaProducerJob.cpp:152
...
It does not happen always, but I often can see the symptom.
There wasn't any other thread callstacks with "rdkafka" filename.
I guess "rkb" (or main internal thread) was already terminated and destroyed before join.
(gdb) p *rk
$10 = {rk_rep = 0x7f4e7f013900, rk_ops = 0x7f4e7f00f040, rk_brokers = {tqh_first = 0x0, tqh_last = 0x7f4e7f3e9010}, rk_broker_by_id = {rl_size = 16, rl_cnt = 0, rl_elems = 0x7f4e9e731480, rl_free_cb = 0x0, rl_flags = 2, rl_elemsize = 0, rl_p = 0x0}, rk_broker_cnt = {val = 0},
rk_broker_up_cnt = {val = 0}, rk_broker_down_cnt = {val = 2}, rk_broker_addrless_cnt = {val = 0}, rk_internal_rkb_lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' <repeats 39 times>, __align = 0}, rk_internal_rkb = 0x0, rk_broker_state_change_cnd = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>,
__align = 0}, rk_broker_state_change_lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, rk_broker_state_change_version = 17,
rk_broker_state_change_waiters = {rl_size = 8, rl_cnt = 0, rl_elems = 0x7f4e7ead8f80
...
Could you please have a look and check this is a bug or not?
I'll try to get the debug trace of this issue.
How to reproduce
1.0.0 release
// create producer
rd_producer = rd_kafka_new(RD_KAFKA_PRODUCER, rd_gconf, errstr, sizeof(errstr));
// topic creation
rd_topic = rd_kafka_topic_new(m_producerHandle->getProducer(), m_jobName.c_str(), rd_tconf);
.. publish some events...
// topic deletion
rd_kafka_topic_destroy(rd_topic);
// destroy producer
while (rd_kafka_outq_len(rd_producer) > 0)
rd_kafka_poll(rd_producer, 50);
rd_kafka_destroy(rd_producer); => hang
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
1.0.0 release
2.1.1 (confluent-5.1.2)
socket.timeout.ms=10000
, others as defaultsuse12
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: