librdkafka hang during reference count problem #1872

blackeyepanda · 2018-07-02T13:59:29Z

Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ

Description

Hello Magnus,

We are using librdkafka actively especially the consumer end. For some reason, our system will sometimes destroy the rdkafka instance right after creating it immediately.

We are using librdkafka 0.11.3 and seems It causes a hang during the teardown stage. We have investigated it a little bit and found it seems stuck in the while loop inside the rd_kafka_broker_thread_main function.

The exit condition for the while loop is:
while (!rd_kafka_broker_terminating(rkb)) {...}
rd_kafka_broker_terminating(rkb) will return true if rkb->rkb_refcnt <= 1
We found the refcnt going down to 2 and then stuck there and never decreased to 1.

We have core dump here captured the hang. Here's the related thread backtrace:

App thread:
#0 0x00007f2690520f47 in pthread_join () from ./tmp/lib64/libpthread.so.0
Bugfix in example code #1 0x00007f25ffb55342 in thrd_join (thr=, res=0x0) at tinycthread.c:749
Program received signal SIGPIPE, Broken pipe. #2 0x00007f25ffafc2de in rd_kafka_destroy_app (blocking=1, rk=0x7f243836cf80) at rdkafka.c:708
excessive memory allocation == poor performance? #3 rd_kafka_destroy (rk=0x7f243836cf80) at rdkafka.c:718
ZooKeeper integration #4 0x00007f25ffae985e in RdKafka::KafkaConsumerImpl::close (this=0x7f2438272c50) at KafkaConsumerImpl.cpp:251
......
The internal Consumer thread:
#0 0x00007f2690520f47 in pthread_join () from ./tmp/lib64/libpthread.so.0
Bugfix in example code #1 0x00007f25ffb55342 in thrd_join (thr=, res=0x0) at tinycthread.c:749
Program received signal SIGPIPE, Broken pipe. #2 0x00007f25ffaf2e9d in rd_kafka_destroy_internal (rk=rk@entry=0x7f243836cf80) at rdkafka.c:819
excessive memory allocation == poor performance? #3 0x00007f25ffaf35a2 in rd_kafka_thread_main (arg=arg@entry=0x7f243836cf80) at rdkafka.c:1251
ZooKeeper integration #4 0x00007f25ffb54ec7 in _thrd_wrapper_function (aArg=) at tinycthread.c:624
rd_kafka_destroy doesn't seem to close the socket properly #5 0x00007f269051fdd5 in start_thread () from ./tmp/lib64/libpthread.so.0
kafka v0.6 support #6 0x00007f268fe3db3d in clone () from ./tmp/lib64/libc.so.6
broker thread:
#0 0x00007f268fe32e9d in poll () from ./tmp/lib64/libc.so.6
Bugfix in example code #1 0x00007f25ffb1b75d in rd_kafka_transport_poll (rktrans=rktrans@entry=0x7f24fc1f92e0, tmout=tmout@entry=1) at rdkafka_transport.c:1472
Program received signal SIGPIPE, Broken pipe. #2 0x00007f25ffb1b7db in rd_kafka_transport_io_serve (rktrans=0x7f24fc1f92e0, timeout_ms=1) at rdkafka_transport.c:1331
excessive memory allocation == poor performance? #3 0x00007f25ffb0a598 in rd_kafka_broker_serve (rkb=rkb@entry=0x7f243833a410, abs_timeout=abs_timeout@entry=323930263456) at rdkafka_broker.c:2233
ZooKeeper integration #4 0x00007f25ffb0a9ef in rd_kafka_broker_ua_idle (rkb=rkb@entry=0x7f243833a410, timeout_ms=, timeout_ms@entry=-1) at rdkafka_broker.c:2295
rd_kafka_destroy doesn't seem to close the socket properly #5 0x00007f25ffb0aeb8 in rd_kafka_broker_thread_main (arg=arg@entry=0x7f243833a410) at rdkafka_broker.c:3150
kafka v0.6 support #6 0x00007f25ffb54ec7 in _thrd_wrapper_function (aArg=) at tinycthread.c:624
make error: rdkafka_example #7 0x00007f269051fdd5 in start_thread () from ./tmp/lib64/libpthread.so.0
Created a small c++ librdkafka produce wrapper. #8 0x00007f268fe3db3d in clone () from ./tmp/lib64/libc.so.6

We can notice that both App thread and internal consumer thread are in destroy call and waiting for broker thread to be finished. Broker thread shows in poll() is because the poll take 1ms so mostly cause the backtrace stack will mostly hit there.

Under broker thread, it entered into the rd_kafka_broker_ua_idle() under case RD_KAFKA_BROKER_STATE_APIVERSION_QUERY, this seems part of startup process? we noticed that the rkb->rkb_rk->rk_terminate has been set to true in the rd_kafka_broker_ua_idle() function so it seems the broker didn't finish its startup process and the shutdown flag has already been set to true (we mentioned before that we destroy the rdkafka immediately after it create).

Do you have any idea where can a race condition happened to cause this cause?
And how should we proceed next step?

Thank you very much!

How to reproduce

Currently it takes our approximately ~50 hours to reproduce this issue in our long running test. It is reproduced quite reliably, we have reproduced it every long running test recently.

The long running test will keep creatingrdkafka instance to consume from a kafka cluster and destroy it afterwards. There exist some of them destroyed the instance immediately after creation due to some internal calculation. We suspect this the reason to trigger the hang. We are trying to narrow down the scope and keep running this 0 duration operation repeatedly but unfortunately we haven't successfully get a short time reproducer yet.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

librdkafka version (release number or git tag): v0.11.3
Apache Kafka version: 1.1.0
librdkafka client configuration: queued.min.messages=10000
Operating system: RHEL 7
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

edenhill · 2018-07-02T14:40:52Z

This is a great bug report!

Could you try to reproduce on v0.11.4 or preferably master? There's been quite a big number of changes since v0.11.3.

blackeyepanda · 2018-07-02T15:31:46Z

We are planning to upgrade to 0.11.4, but currently there are some SSL compatibility issues in our system preventing doing it, we can try later while we resolve it, but for right now is there anything we can try to consolidate the problem in current version? We have also noticed a very similar hang problem in 0.9.2 and 0.9.5.

Thank you very much

edenhill · 2018-07-05T16:08:18Z

If you could provide a minimal reproducible test case / program I could try to reproduce it on the latest version

blackeyepanda · 2018-07-06T13:25:50Z

I am trying to reproduce it with a much smaller reproducer and we are simultaneously working on resolving the SSL conflicts in our 0.11.4 upgrade branch. Will let you know if we have updates

edenhill added the GREAT REPORT label Jul 2, 2018

rgerhards mentioned this issue Oct 22, 2018

Deadlock happens when imkafka is used on non-interactive mode rsyslog/rsyslog#3154

Closed

edenhill closed this as completed Dec 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

librdkafka hang during reference count problem #1872

librdkafka hang during reference count problem #1872

blackeyepanda commented Jul 2, 2018 •

edited

Loading

edenhill commented Jul 2, 2018

blackeyepanda commented Jul 2, 2018

edenhill commented Jul 5, 2018

blackeyepanda commented Jul 6, 2018

librdkafka hang during reference count problem #1872

librdkafka hang during reference count problem #1872

Comments

blackeyepanda commented Jul 2, 2018 • edited Loading

Description

How to reproduce

Checklist

edenhill commented Jul 2, 2018

blackeyepanda commented Jul 2, 2018

edenhill commented Jul 5, 2018

blackeyepanda commented Jul 6, 2018

blackeyepanda commented Jul 2, 2018 •

edited

Loading