Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librdkafka hang during reference count problem #1872

Closed
4 of 7 tasks
blackeyepanda opened this issue Jul 2, 2018 · 4 comments
Closed
4 of 7 tasks

librdkafka hang during reference count problem #1872

blackeyepanda opened this issue Jul 2, 2018 · 4 comments

Comments

@blackeyepanda
Copy link
Contributor

blackeyepanda commented Jul 2, 2018

Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ

Description

Hello Magnus,

We are using librdkafka actively especially the consumer end. For some reason, our system will sometimes destroy the rdkafka instance right after creating it immediately.

We are using librdkafka 0.11.3 and seems It causes a hang during the teardown stage. We have investigated it a little bit and found it seems stuck in the while loop inside the rd_kafka_broker_thread_main function.

The exit condition for the while loop is:
while (!rd_kafka_broker_terminating(rkb)) {...}
rd_kafka_broker_terminating(rkb) will return true if rkb->rkb_refcnt <= 1
We found the refcnt going down to 2 and then stuck there and never decreased to 1.

We have core dump here captured the hang. Here's the related thread backtrace:

  1. App thread:
    #0 0x00007f2690520f47 in pthread_join () from ./tmp/lib64/libpthread.so.0
    Bugfix in example code #1 0x00007f25ffb55342 in thrd_join (thr=, res=0x0) at tinycthread.c:749
    Program received signal SIGPIPE, Broken pipe. #2 0x00007f25ffafc2de in rd_kafka_destroy_app (blocking=1, rk=0x7f243836cf80) at rdkafka.c:708
    excessive memory allocation == poor performance? #3 rd_kafka_destroy (rk=0x7f243836cf80) at rdkafka.c:718
    ZooKeeper integration #4 0x00007f25ffae985e in RdKafka::KafkaConsumerImpl::close (this=0x7f2438272c50) at KafkaConsumerImpl.cpp:251
    ......

  2. The internal Consumer thread:
    #0 0x00007f2690520f47 in pthread_join () from ./tmp/lib64/libpthread.so.0
    Bugfix in example code #1 0x00007f25ffb55342 in thrd_join (thr=, res=0x0) at tinycthread.c:749
    Program received signal SIGPIPE, Broken pipe. #2 0x00007f25ffaf2e9d in rd_kafka_destroy_internal (rk=rk@entry=0x7f243836cf80) at rdkafka.c:819
    excessive memory allocation == poor performance? #3 0x00007f25ffaf35a2 in rd_kafka_thread_main (arg=arg@entry=0x7f243836cf80) at rdkafka.c:1251
    ZooKeeper integration #4 0x00007f25ffb54ec7 in _thrd_wrapper_function (aArg=) at tinycthread.c:624
    rd_kafka_destroy doesn't seem to close the socket properly #5 0x00007f269051fdd5 in start_thread () from ./tmp/lib64/libpthread.so.0
    kafka v0.6 support #6 0x00007f268fe3db3d in clone () from ./tmp/lib64/libc.so.6

  3. broker thread:
    #0 0x00007f268fe32e9d in poll () from ./tmp/lib64/libc.so.6
    Bugfix in example code #1 0x00007f25ffb1b75d in rd_kafka_transport_poll (rktrans=rktrans@entry=0x7f24fc1f92e0, tmout=tmout@entry=1) at rdkafka_transport.c:1472
    Program received signal SIGPIPE, Broken pipe. #2 0x00007f25ffb1b7db in rd_kafka_transport_io_serve (rktrans=0x7f24fc1f92e0, timeout_ms=1) at rdkafka_transport.c:1331
    excessive memory allocation == poor performance? #3 0x00007f25ffb0a598 in rd_kafka_broker_serve (rkb=rkb@entry=0x7f243833a410, abs_timeout=abs_timeout@entry=323930263456) at rdkafka_broker.c:2233
    ZooKeeper integration #4 0x00007f25ffb0a9ef in rd_kafka_broker_ua_idle (rkb=rkb@entry=0x7f243833a410, timeout_ms=, timeout_ms@entry=-1) at rdkafka_broker.c:2295
    rd_kafka_destroy doesn't seem to close the socket properly #5 0x00007f25ffb0aeb8 in rd_kafka_broker_thread_main (arg=arg@entry=0x7f243833a410) at rdkafka_broker.c:3150
    kafka v0.6 support #6 0x00007f25ffb54ec7 in _thrd_wrapper_function (aArg=) at tinycthread.c:624
    make error: rdkafka_example #7 0x00007f269051fdd5 in start_thread () from ./tmp/lib64/libpthread.so.0
    Created a small c++ librdkafka produce wrapper.  #8 0x00007f268fe3db3d in clone () from ./tmp/lib64/libc.so.6

We can notice that both App thread and internal consumer thread are in destroy call and waiting for broker thread to be finished. Broker thread shows in poll() is because the poll take 1ms so mostly cause the backtrace stack will mostly hit there.

Under broker thread, it entered into the rd_kafka_broker_ua_idle() under case RD_KAFKA_BROKER_STATE_APIVERSION_QUERY, this seems part of startup process? we noticed that the rkb->rkb_rk->rk_terminate has been set to true in the rd_kafka_broker_ua_idle() function so it seems the broker didn't finish its startup process and the shutdown flag has already been set to true (we mentioned before that we destroy the rdkafka immediately after it create).

Do you have any idea where can a race condition happened to cause this cause?
And how should we proceed next step?

Thank you very much!

How to reproduce

Currently it takes our approximately ~50 hours to reproduce this issue in our long running test. It is reproduced quite reliably, we have reproduced it every long running test recently.

The long running test will keep creatingrdkafka instance to consume from a kafka cluster and destroy it afterwards. There exist some of them destroyed the instance immediately after creation due to some internal calculation. We suspect this the reason to trigger the hang. We are trying to narrow down the scope and keep running this 0 duration operation repeatedly but unfortunately we haven't successfully get a short time reproducer yet.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • librdkafka version (release number or git tag): v0.11.3
  • Apache Kafka version: 1.1.0
  • librdkafka client configuration: queued.min.messages=10000
  • Operating system: RHEL 7
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue
@edenhill
Copy link
Contributor

edenhill commented Jul 2, 2018

This is a great bug report!

Could you try to reproduce on v0.11.4 or preferably master? There's been quite a big number of changes since v0.11.3.

@blackeyepanda
Copy link
Contributor Author

We are planning to upgrade to 0.11.4, but currently there are some SSL compatibility issues in our system preventing doing it, we can try later while we resolve it, but for right now is there anything we can try to consolidate the problem in current version? We have also noticed a very similar hang problem in 0.9.2 and 0.9.5.

Thank you very much

@edenhill
Copy link
Contributor

edenhill commented Jul 5, 2018

If you could provide a minimal reproducible test case / program I could try to reproduce it on the latest version

@blackeyepanda
Copy link
Contributor Author

I am trying to reproduce it with a much smaller reproducer and we are simultaneously working on resolving the SSL conflicts in our 0.11.4 upgrade branch. Will let you know if we have updates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants