rd_kafka_destroy_app got stuck in pthread_join #2339

sezruby · 2019-05-27T09:24:49Z

Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ

Description

I've created and destroyed a producer, but rd_kafka_destroy_app hangs with the following bt:

#0 0x00007f504ef904c2 in pthread_join () from /lib64/libpthread.so.0
...
#3 0x00007f47e81153e2 in thrd_join (thr=thr@entry=139932621911808, res=res@entry=0x7f4cbc2a5e5c) at tinycthread.c:692
#4 0x00007f47e80b0812 in rd_kafka_destroy_app (rk=0x7f4e7f3e9000, flags=) at rdkafka.c:939
#5 0x00007f503d15673b in clear () at ../../KafkaAdaptor/KafkaProducerJob.cpp:165
#6 0x00007f503d15f7f6 in __base_dtor () at ../../KafkaAdaptor/KafkaProducerJob.cpp:152
...

It does not happen always, but I often can see the symptom.
There wasn't any other thread callstacks with "rdkafka" filename.

I guess "rkb" (or main internal thread) was already terminated and destroyed before join.
(gdb) p *rk
$10 = {rk_rep = 0x7f4e7f013900, rk_ops = 0x7f4e7f00f040, rk_brokers = {tqh_first = 0x0, tqh_last = 0x7f4e7f3e9010}, rk_broker_by_id = {rl_size = 16, rl_cnt = 0, rl_elems = 0x7f4e9e731480, rl_free_cb = 0x0, rl_flags = 2, rl_elemsize = 0, rl_p = 0x0}, rk_broker_cnt = {val = 0},
rk_broker_up_cnt = {val = 0}, rk_broker_down_cnt = {val = 2}, rk_broker_addrless_cnt = {val = 0}, rk_internal_rkb_lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' <repeats 39 times>, __align = 0}, rk_internal_rkb = 0x0, rk_broker_state_change_cnd = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>,
__align = 0}, rk_broker_state_change_lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, rk_broker_state_change_version = 17,
rk_broker_state_change_waiters = {rl_size = 8, rl_cnt = 0, rl_elems = 0x7f4e7ead8f80
...

Could you please have a look and check this is a bug or not?
I'll try to get the debug trace of this issue.

How to reproduce

1.0.0 release

// create producer
rd_producer = rd_kafka_new(RD_KAFKA_PRODUCER, rd_gconf, errstr, sizeof(errstr));

// topic creation
rd_topic = rd_kafka_topic_new(m_producerHandle->getProducer(), m_jobName.c_str(), rd_tconf);

.. publish some events...

// topic deletion
rd_kafka_topic_destroy(rd_topic);

// destroy producer
while (rd_kafka_outq_len(rd_producer) > 0)
rd_kafka_poll(rd_producer, 50);

rd_kafka_destroy(rd_producer); => hang

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

librdkafka version (release number or git tag): 1.0.0 release
Apache Kafka version: 2.1.1 (confluent-5.1.2)
librdkafka client configuration: socket.timeout.ms=10000, others as default
Operating system: suse12
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

edenhill · 2019-05-27T09:26:12Z

Can you provide a small but complete program that reproduces this error?

sezruby · 2019-05-27T11:26:36Z

Hi,

Unfortunately, I can't provide it because it's a part of our product.. :(
I'm currently working on some poc things with our product + librdkafka.
But it's very simple like the example code in repo; I only used "produce()" function yet except for (create/destroy producer/topic/config).

It reproduced and I got the debug=all traces. Please check the attached file.
It includes traces with the termination of 2 producers consequently. The first one was destroyed successfully, but the second one got stuck at the same location:

#0  0x00007fd9b11f34c2 in pthread_join () from /lib64/libpthread.so.0
...
#3  0x00007fd0dacd93e2 in thrd_join (thr=thr@entry=140563075479296, res=res@entry=0x7fd6ff9a8e5c) at tinycthread.c:692
#4  0x00007fd0dac74812 in rd_kafka_destroy_app (rk=0x7fd852e7c000, flags=<optimized out>) at rdkafka.c:939
#5  0x00007fd99f3b973b in clear () at ../../KafkaAdaptor/KafkaProducerJob.cpp:165
#6  0x00007fd99f3c27f6 in __base_dtor  () at ../../KafkaAdaptor/KafkaProducerJob.cpp:152
...

I got two callstacks with "rdkafka" name newly, it might be because of "debug=all":

#0  0x00007fd9b11f7ff0 in sem_wait () from /lib64/libpthread.so.0
...
#10 0x00007fd9af98fc49 in __dcigettext () from /lib64/libc.so.6
#11 0x00007fd9af9e14be in strerror_r () from /lib64/libc.so.6
#12 0x00007fd0dac7e093 in rd_strerror (err=11) at rdposix.h:103
#13 rd_kafka_broker_fail (rkb=rkb@entry=0x7fd852e79c00, level=level@entry=7, err=err@entry=RD_KAFKA_RESP_ERR__DESTROY, fmt=fmt@entry=0x7fd0dad275f7 "Client is terminating") at rdkafka_broker.c:370
#14 0x00007fd0dac856b7 in rd_kafka_broker_op_serve (rkb=rkb@entry=0x7fd852e79c00, rko=0x7fd22a6dee30) at rdkafka_broker.c:2907
#15 0x00007fd0dac8664e in rd_kafka_broker_ops_serve (rkb=rkb@entry=0x7fd852e79c00, timeout_ms=<optimized out>) at rdkafka_broker.c:2976
#16 0x00007fd0dac866e7 in rd_kafka_broker_ops_io_serve (rkb=rkb@entry=0x7fd852e79c00, abs_timeout=abs_timeout@entry=22098593681823) at rdkafka_broker.c:3028
#17 0x00007fd0dac87703 in rd_kafka_broker_producer_serve (abs_timeout=22098593681823, rkb=0x7fd852e79c00) at rdkafka_broker.c:3512
#18 rd_kafka_broker_serve (rkb=rkb@entry=0x7fd852e79c00, timeout_ms=<optimized out>, timeout_ms@entry=1000) at rdkafka_broker.c:4245
#19 0x00007fd0dac87c0d in rd_kafka_broker_thread_main (arg=arg@entry=0x7fd852e79c00) at rdkafka_broker.c:4386
#20 0x00007fd0dacd9117 in _thrd_wrapper_function (aArg=<optimized out>) at tinycthread.c:576
...
#24 0x00007fd9b11f20a4 in start_thread () from /lib64/libpthread.so.0
#25 0x00007fd9afa4502d in clone () from /lib64/libc.so.6

#0  0x00007fd9b11f34c2 in pthread_join () from /lib64/libpthread.so.0
...
#3  0x00007fd0dacd93e2 in thrd_join (thr=<optimized out>, res=res@entry=0x7fd7642e4f2c) at tinycthread.c:692
#4  0x00007fd0dac6db36 in rd_kafka_destroy_internal (rk=rk@entry=0x7fd852e7c000) at rdkafka.c:1084
#5  0x00007fd0dac6e38b in rd_kafka_thread_main (arg=arg@entry=0x7fd852e7c000) at rdkafka.c:1800
#6  0x00007fd0dacd9117 in _thrd_wrapper_function (aArg=<optimized out>) at tinycthread.c:576
...
#10 0x00007fd9b11f20a4 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fd9afa4502d in clone () from /lib64/libc.so.6

This issue reproduced when there's some -do nothing with rdkafka- interval before the producer termination as in the trace. (without the interval, it's not reproduced!)

create producer
create topic
publish events
** do nothing for 5~10 mins !**
destroy topic
destroy producer => hang

ex)

%7|1558951364.322|MSGSET|hehetest#producer-15| [thrd:MY_HOST:9092/0]: MY_HOST:9092/0: MY_TOPIC_NAME [0]: MessageSet with 1 message(s) (MsgId 0, BaseSeq -1) delivered
=> some interval
%7|1558951648.576|METADATA|hehetest#producer-15| [thrd:main]: Requesting metadata for 1/1 topics: periodic refresh

Some of line in the stack trace removed as we have some wrapper implementation for pthread.
Though it's just a simple wrapper for additional statistics, it might cause this problem.
But I'd like you to confirm this is not a rdkafka issue.

Thanks in advance!

Regards,
Jin

rdkafka_debug.txt

edenhill · 2019-05-29T07:19:53Z

But it's very simple like the example code in repo

It would be great if you could modify the example code in the repo to reproduce this issue.

sezruby · 2019-05-29T08:12:23Z

I found that there can be a problem in our wrapper implementation of pthread; a safe pointer exists for each thread and its ref goes wrong somehow with external pthread function calls.
I'll reopen this after a sufficient investigation.

Thanks for your support :)

sezruby closed this as completed May 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rd_kafka_destroy_app got stuck in pthread_join #2339

rd_kafka_destroy_app got stuck in pthread_join #2339

sezruby commented May 27, 2019

edenhill commented May 27, 2019

sezruby commented May 27, 2019

edenhill commented May 29, 2019

sezruby commented May 29, 2019

rd_kafka_destroy_app got stuck in pthread_join #2339

rd_kafka_destroy_app got stuck in pthread_join #2339

Comments

sezruby commented May 27, 2019

Description

How to reproduce

Checklist

edenhill commented May 27, 2019

sezruby commented May 27, 2019

edenhill commented May 29, 2019

sezruby commented May 29, 2019