Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rd_kafka_destroy_app got stuck in pthread_join #2339

Closed
4 of 7 tasks
sezruby opened this issue May 27, 2019 · 4 comments
Closed
4 of 7 tasks

rd_kafka_destroy_app got stuck in pthread_join #2339

sezruby opened this issue May 27, 2019 · 4 comments

Comments

@sezruby
Copy link

sezruby commented May 27, 2019

Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ

Description

I've created and destroyed a producer, but rd_kafka_destroy_app hangs with the following bt:

#0 0x00007f504ef904c2 in pthread_join () from /lib64/libpthread.so.0
...
#3 0x00007f47e81153e2 in thrd_join (thr=thr@entry=139932621911808, res=res@entry=0x7f4cbc2a5e5c) at tinycthread.c:692
#4 0x00007f47e80b0812 in rd_kafka_destroy_app (rk=0x7f4e7f3e9000, flags=) at rdkafka.c:939
#5 0x00007f503d15673b in clear () at ../../KafkaAdaptor/KafkaProducerJob.cpp:165
#6 0x00007f503d15f7f6 in __base_dtor () at ../../KafkaAdaptor/KafkaProducerJob.cpp:152
...

It does not happen always, but I often can see the symptom.
There wasn't any other thread callstacks with "rdkafka" filename.

I guess "rkb" (or main internal thread) was already terminated and destroyed before join.
(gdb) p *rk
$10 = {rk_rep = 0x7f4e7f013900, rk_ops = 0x7f4e7f00f040, rk_brokers = {tqh_first = 0x0, tqh_last = 0x7f4e7f3e9010}, rk_broker_by_id = {rl_size = 16, rl_cnt = 0, rl_elems = 0x7f4e9e731480, rl_free_cb = 0x0, rl_flags = 2, rl_elemsize = 0, rl_p = 0x0}, rk_broker_cnt = {val = 0},
rk_broker_up_cnt = {val = 0}, rk_broker_down_cnt = {val = 2}, rk_broker_addrless_cnt = {val = 0}, rk_internal_rkb_lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' <repeats 39 times>, __align = 0}, rk_internal_rkb = 0x0, rk_broker_state_change_cnd = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>,
__align = 0}, rk_broker_state_change_lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, rk_broker_state_change_version = 17,
rk_broker_state_change_waiters = {rl_size = 8, rl_cnt = 0, rl_elems = 0x7f4e7ead8f80
...

Could you please have a look and check this is a bug or not?
I'll try to get the debug trace of this issue.

How to reproduce

1.0.0 release

// create producer
rd_producer = rd_kafka_new(RD_KAFKA_PRODUCER, rd_gconf, errstr, sizeof(errstr));

// topic creation
rd_topic = rd_kafka_topic_new(m_producerHandle->getProducer(), m_jobName.c_str(), rd_tconf);

.. publish some events...

// topic deletion
rd_kafka_topic_destroy(rd_topic);

// destroy producer
while (rd_kafka_outq_len(rd_producer) > 0)
rd_kafka_poll(rd_producer, 50);

rd_kafka_destroy(rd_producer); => hang

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • librdkafka version (release number or git tag): 1.0.0 release
  • Apache Kafka version: 2.1.1 (confluent-5.1.2)
  • librdkafka client configuration: socket.timeout.ms=10000, others as default
  • Operating system: suse12
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue
@edenhill
Copy link
Contributor

Can you provide a small but complete program that reproduces this error?

@sezruby
Copy link
Author

sezruby commented May 27, 2019

Hi,

Unfortunately, I can't provide it because it's a part of our product.. :(
I'm currently working on some poc things with our product + librdkafka.
But it's very simple like the example code in repo; I only used "produce()" function yet except for (create/destroy producer/topic/config).

It reproduced and I got the debug=all traces. Please check the attached file.
It includes traces with the termination of 2 producers consequently. The first one was destroyed successfully, but the second one got stuck at the same location:

#0  0x00007fd9b11f34c2 in pthread_join () from /lib64/libpthread.so.0
...
#3  0x00007fd0dacd93e2 in thrd_join (thr=thr@entry=140563075479296, res=res@entry=0x7fd6ff9a8e5c) at tinycthread.c:692
#4  0x00007fd0dac74812 in rd_kafka_destroy_app (rk=0x7fd852e7c000, flags=<optimized out>) at rdkafka.c:939
#5  0x00007fd99f3b973b in clear () at ../../KafkaAdaptor/KafkaProducerJob.cpp:165
#6  0x00007fd99f3c27f6 in __base_dtor  () at ../../KafkaAdaptor/KafkaProducerJob.cpp:152
...

I got two callstacks with "rdkafka" name newly, it might be because of "debug=all":

#0  0x00007fd9b11f7ff0 in sem_wait () from /lib64/libpthread.so.0
...
#10 0x00007fd9af98fc49 in __dcigettext () from /lib64/libc.so.6
#11 0x00007fd9af9e14be in strerror_r () from /lib64/libc.so.6
#12 0x00007fd0dac7e093 in rd_strerror (err=11) at rdposix.h:103
#13 rd_kafka_broker_fail (rkb=rkb@entry=0x7fd852e79c00, level=level@entry=7, err=err@entry=RD_KAFKA_RESP_ERR__DESTROY, fmt=fmt@entry=0x7fd0dad275f7 "Client is terminating") at rdkafka_broker.c:370
#14 0x00007fd0dac856b7 in rd_kafka_broker_op_serve (rkb=rkb@entry=0x7fd852e79c00, rko=0x7fd22a6dee30) at rdkafka_broker.c:2907
#15 0x00007fd0dac8664e in rd_kafka_broker_ops_serve (rkb=rkb@entry=0x7fd852e79c00, timeout_ms=<optimized out>) at rdkafka_broker.c:2976
#16 0x00007fd0dac866e7 in rd_kafka_broker_ops_io_serve (rkb=rkb@entry=0x7fd852e79c00, abs_timeout=abs_timeout@entry=22098593681823) at rdkafka_broker.c:3028
#17 0x00007fd0dac87703 in rd_kafka_broker_producer_serve (abs_timeout=22098593681823, rkb=0x7fd852e79c00) at rdkafka_broker.c:3512
#18 rd_kafka_broker_serve (rkb=rkb@entry=0x7fd852e79c00, timeout_ms=<optimized out>, timeout_ms@entry=1000) at rdkafka_broker.c:4245
#19 0x00007fd0dac87c0d in rd_kafka_broker_thread_main (arg=arg@entry=0x7fd852e79c00) at rdkafka_broker.c:4386
#20 0x00007fd0dacd9117 in _thrd_wrapper_function (aArg=<optimized out>) at tinycthread.c:576
...
#24 0x00007fd9b11f20a4 in start_thread () from /lib64/libpthread.so.0
#25 0x00007fd9afa4502d in clone () from /lib64/libc.so.6
#0  0x00007fd9b11f34c2 in pthread_join () from /lib64/libpthread.so.0
...
#3  0x00007fd0dacd93e2 in thrd_join (thr=<optimized out>, res=res@entry=0x7fd7642e4f2c) at tinycthread.c:692
#4  0x00007fd0dac6db36 in rd_kafka_destroy_internal (rk=rk@entry=0x7fd852e7c000) at rdkafka.c:1084
#5  0x00007fd0dac6e38b in rd_kafka_thread_main (arg=arg@entry=0x7fd852e7c000) at rdkafka.c:1800
#6  0x00007fd0dacd9117 in _thrd_wrapper_function (aArg=<optimized out>) at tinycthread.c:576
...
#10 0x00007fd9b11f20a4 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fd9afa4502d in clone () from /lib64/libc.so.6

This issue reproduced when there's some -do nothing with rdkafka- interval before the producer termination as in the trace. (without the interval, it's not reproduced!)

create producer
create topic
publish events
** do nothing for 5~10 mins !**
destroy topic
destroy producer => hang

ex)

%7|1558951364.322|MSGSET|hehetest#producer-15| [thrd:MY_HOST:9092/0]: MY_HOST:9092/0: MY_TOPIC_NAME [0]: MessageSet with 1 message(s) (MsgId 0, BaseSeq -1) delivered
=> some interval
%7|1558951648.576|METADATA|hehetest#producer-15| [thrd:main]: Requesting metadata for 1/1 topics: periodic refresh

Some of line in the stack trace removed as we have some wrapper implementation for pthread.
Though it's just a simple wrapper for additional statistics, it might cause this problem.
But I'd like you to confirm this is not a rdkafka issue.

Thanks in advance!

Regards,
Jin

rdkafka_debug.txt

@edenhill
Copy link
Contributor

But it's very simple like the example code in repo

It would be great if you could modify the example code in the repo to reproduce this issue.

@sezruby
Copy link
Author

sezruby commented May 29, 2019

I found that there can be a problem in our wrapper implementation of pthread; a safe pointer exists for each thread and its ref goes wrong somehow with external pthread function calls.
I'll reopen this after a sufficient investigation.

Thanks for your support :)

@sezruby sezruby closed this as completed May 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants