You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using librdkafka actively especially the consumer end. For some reason, our system will sometimes destroy the rdkafka instance right after creating it immediately.
We are using librdkafka 0.11.3 and seems It causes a hang during the teardown stage. We have investigated it a little bit and found it seems stuck in the while loop inside the rd_kafka_broker_thread_main function.
The exit condition for the while loop is:
while (!rd_kafka_broker_terminating(rkb)) {...}
rd_kafka_broker_terminating(rkb) will return true if rkb->rkb_refcnt <= 1
We found the refcnt going down to 2 and then stuck there and never decreased to 1.
We have core dump here captured the hang. Here's the related thread backtrace:
We can notice that both App thread and internal consumer thread are in destroy call and waiting for broker thread to be finished. Broker thread shows in poll() is because the poll take 1ms so mostly cause the backtrace stack will mostly hit there.
Under broker thread, it entered into the rd_kafka_broker_ua_idle() under case RD_KAFKA_BROKER_STATE_APIVERSION_QUERY, this seems part of startup process? we noticed that the rkb->rkb_rk->rk_terminate has been set to true in the rd_kafka_broker_ua_idle() function so it seems the broker didn't finish its startup process and the shutdown flag has already been set to true (we mentioned before that we destroy the rdkafka immediately after it create).
Do you have any idea where can a race condition happened to cause this cause?
And how should we proceed next step?
Thank you very much!
How to reproduce
Currently it takes our approximately ~50 hours to reproduce this issue in our long running test. It is reproduced quite reliably, we have reproduced it every long running test recently.
The long running test will keep creatingrdkafka instance to consume from a kafka cluster and destroy it afterwards. There exist some of them destroyed the instance immediately after creation due to some internal calculation. We suspect this the reason to trigger the hang. We are trying to narrow down the scope and keep running this 0 duration operation repeatedly but unfortunately we haven't successfully get a short time reproducer yet.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
librdkafka version (release number or git tag): v0.11.3
We are planning to upgrade to 0.11.4, but currently there are some SSL compatibility issues in our system preventing doing it, we can try later while we resolve it, but for right now is there anything we can try to consolidate the problem in current version? We have also noticed a very similar hang problem in 0.9.2 and 0.9.5.
I am trying to reproduce it with a much smaller reproducer and we are simultaneously working on resolving the SSL conflicts in our 0.11.4 upgrade branch. Will let you know if we have updates
Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ
Description
Hello Magnus,
We are using librdkafka actively especially the consumer end. For some reason, our system will sometimes destroy the rdkafka instance right after creating it immediately.
We are using librdkafka 0.11.3 and seems It causes a hang during the teardown stage. We have investigated it a little bit and found it seems stuck in the while loop inside the rd_kafka_broker_thread_main function.
The exit condition for the while loop is:
while (!rd_kafka_broker_terminating(rkb)) {...}
rd_kafka_broker_terminating(rkb) will return true if rkb->rkb_refcnt <= 1
We found the refcnt going down to 2 and then stuck there and never decreased to 1.
We have core dump here captured the hang. Here's the related thread backtrace:
App thread:
#0 0x00007f2690520f47 in pthread_join () from ./tmp/lib64/libpthread.so.0
Bugfix in example code #1 0x00007f25ffb55342 in thrd_join (thr=, res=0x0) at tinycthread.c:749
Program received signal SIGPIPE, Broken pipe. #2 0x00007f25ffafc2de in rd_kafka_destroy_app (blocking=1, rk=0x7f243836cf80) at rdkafka.c:708
excessive memory allocation == poor performance? #3 rd_kafka_destroy (rk=0x7f243836cf80) at rdkafka.c:718
ZooKeeper integration #4 0x00007f25ffae985e in RdKafka::KafkaConsumerImpl::close (this=0x7f2438272c50) at KafkaConsumerImpl.cpp:251
......
The internal Consumer thread:
#0 0x00007f2690520f47 in pthread_join () from ./tmp/lib64/libpthread.so.0
Bugfix in example code #1 0x00007f25ffb55342 in thrd_join (thr=, res=0x0) at tinycthread.c:749
Program received signal SIGPIPE, Broken pipe. #2 0x00007f25ffaf2e9d in rd_kafka_destroy_internal (rk=rk@entry=0x7f243836cf80) at rdkafka.c:819
excessive memory allocation == poor performance? #3 0x00007f25ffaf35a2 in rd_kafka_thread_main (arg=arg@entry=0x7f243836cf80) at rdkafka.c:1251
ZooKeeper integration #4 0x00007f25ffb54ec7 in _thrd_wrapper_function (aArg=) at tinycthread.c:624
rd_kafka_destroy doesn't seem to close the socket properly #5 0x00007f269051fdd5 in start_thread () from ./tmp/lib64/libpthread.so.0
kafka v0.6 support #6 0x00007f268fe3db3d in clone () from ./tmp/lib64/libc.so.6
broker thread:
#0 0x00007f268fe32e9d in poll () from ./tmp/lib64/libc.so.6
Bugfix in example code #1 0x00007f25ffb1b75d in rd_kafka_transport_poll (rktrans=rktrans@entry=0x7f24fc1f92e0, tmout=tmout@entry=1) at rdkafka_transport.c:1472
Program received signal SIGPIPE, Broken pipe. #2 0x00007f25ffb1b7db in rd_kafka_transport_io_serve (rktrans=0x7f24fc1f92e0, timeout_ms=1) at rdkafka_transport.c:1331
excessive memory allocation == poor performance? #3 0x00007f25ffb0a598 in rd_kafka_broker_serve (rkb=rkb@entry=0x7f243833a410, abs_timeout=abs_timeout@entry=323930263456) at rdkafka_broker.c:2233
ZooKeeper integration #4 0x00007f25ffb0a9ef in rd_kafka_broker_ua_idle (rkb=rkb@entry=0x7f243833a410, timeout_ms=, timeout_ms@entry=-1) at rdkafka_broker.c:2295
rd_kafka_destroy doesn't seem to close the socket properly #5 0x00007f25ffb0aeb8 in rd_kafka_broker_thread_main (arg=arg@entry=0x7f243833a410) at rdkafka_broker.c:3150
kafka v0.6 support #6 0x00007f25ffb54ec7 in _thrd_wrapper_function (aArg=) at tinycthread.c:624
make error: rdkafka_example #7 0x00007f269051fdd5 in start_thread () from ./tmp/lib64/libpthread.so.0
Created a small c++ librdkafka produce wrapper. #8 0x00007f268fe3db3d in clone () from ./tmp/lib64/libc.so.6
We can notice that both App thread and internal consumer thread are in destroy call and waiting for broker thread to be finished. Broker thread shows in poll() is because the poll take 1ms so mostly cause the backtrace stack will mostly hit there.
Under broker thread, it entered into the rd_kafka_broker_ua_idle() under case RD_KAFKA_BROKER_STATE_APIVERSION_QUERY, this seems part of startup process? we noticed that the rkb->rkb_rk->rk_terminate has been set to true in the rd_kafka_broker_ua_idle() function so it seems the broker didn't finish its startup process and the shutdown flag has already been set to true (we mentioned before that we destroy the rdkafka immediately after it create).
Do you have any idea where can a race condition happened to cause this cause?
And how should we proceed next step?
Thank you very much!
How to reproduce
Currently it takes our approximately ~50 hours to reproduce this issue in our long running test. It is reproduced quite reliably, we have reproduced it every long running test recently.
The long running test will keep creatingrdkafka instance to consume from a kafka cluster and destroy it afterwards. There exist some of them destroyed the instance immediately after creation due to some internal calculation. We suspect this the reason to trigger the hang. We are trying to narrow down the scope and keep running this 0 duration operation repeatedly but unfortunately we haven't successfully get a short time reproducer yet.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
v0.11.3
1.1.0
queued.min.messages=10000
RHEL 7
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: