-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspicious misuse of thread lock #464
Comments
Which version (sha1) of librdkafka are you using?
|
Consumer (legacy) config: #auto.commit.enable=true "auto.commit.enable" is default true, isn't it? More symptoms:
|
I use the latest master branch. md5: sha1: |
Any idea? |
After I replaced master branch (master.zip) with v0.9.0 release, it seems everything is OK. |
Now with librdkafka v0.9.0, it also occurred that local offset files weren't written. At the same time, there were some threads having the following stacks: #0 0x0000003af8a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0 or #0 0x0000003af8a0e264 in _lll_lock_wait () from /lib64/libpthread.so.0 Are the symptoms and stacks are correlated? |
When I used the latest master branch, both local offset file writing and Kafka message consuming stopped. |
When both local offset file writing and Kafka message consuming stopped, all consuming related threads had stacks like the following: #0 0x0000003af8a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0 When only local offset file writing stopped, consuming related threads had two types of stacks: #0 0x0000003af8a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0 or #0 0x0000003af8a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 |
I am hitting the same issue with 0.8.6 when using consume batch queue. The kafka threads keeps waiting on lock to be released, thus causing deadlock - exactly the same stack trace as above by microwish. |
@edenhill Happy X'mas! Any news about the issue? Thanks. |
Now I am spawning separate threads for each of the partition queried from metadata and creating consumer batch for them individually. This works fine. Its a pity though that consumer batch queue doesn't work well. |
Any plan to fix this bug? |
Yep, working on it. |
@edenhill |
I've done a number of fixes and all the regression tests in tests/ pass without valgrind or helgrind warnings, so could you please try to reproduce this issue again and if it happens again provide the full output from helgrind? |
Sure. |
Yes, latest master
|
After I use the latest master branch (sha1: fc7bee6f65556179454b515fc05e96cde5dcf3bd librdkafka-master.zip), the symptom is almost the same. The following is snippets of process stack: Thread 51 (Thread 0x7fa751475700 (LWP 20533)): Later I'll be emailing you the helgrind result. Thanks. |
Thanks for all the troubleshooting, I think I've nailed this issue now. |
So far my program has been running for about 3 hours and everything seems okay. |
Superb, let me know how it goes and we'll reopen if there is still an issue |
@edenhill I emailed you the process stack. Thanks. |
I used valgrind --tool=helgrind to debug program, and there were many errors like "lock order "0x636BCC8 before 0x636BC50" violated", e.g.:
==13891== Thread #11: lock order "0x636BCC8 before 0x636BC50" violated
==13891==
==13891== Observed (incorrect) order is: acquisition of lock at 0x636BC50
==13891== at 0x4A0B069: pthread_mutex_lock (hg_intercepts.c:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthread.c:135)
==13891== by 0x5EEA1DC: rd_kafka_toppar_fetch_decide (rdkafka_partition.c:1208)
==13891== by 0x5ECDE1E: rd_kafka_broker_thread_main (rdkafka_broker.c:2098)
==13891== by 0x5EF0AEE: thrd_wrapper_function (tinycthread.c:596)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-2.12.so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-2.12.so)
==13891==
==13891== followed by a later acquisition of lock at 0x636BCC8
==13891== at 0x4A0B069: pthread_mutex_lock (hg_intercepts.c:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthread.c:135)
==13891== by 0x5EE9D59: rd_kafka_q_len (rdkafka_queue.h:193)
==13891== by 0x5EEA45C: rd_kafka_toppar_fetch_decide (rdkafka_partition.c:1245)
==13891== by 0x5ECDE1E: rd_kafka_broker_thread_main (rdkafka_broker.c:2098)
==13891== by 0x5EF0AEE: thrd_wrapper_function (tinycthread.c:596)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-2.12.so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-2.12.so)
==13891==
==13891== Required order was established by acquisition of lock at 0x636BCC8
==13891== at 0x4A0B069: pthread_mutex_lock (hg_intercepts.c:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthread.c:135)
==13891== by 0x5EDB6EF: rd_kafka_q_serve_rkmessages (rdkafka_queue.c:437)
==13891== by 0x5EC592B: rd_kafka_consume_batch (rdkafka.c:1387)
==13891== by 0x4C1598C: consume_messages(rd_kafka_s, rd_kafka_topic_s, int, long, void ()(rd_kafka_message_s, void_)) (PyKafkaClient.cpp:606)
==13891== by 0x406C33: consume_to_local(void_) (kafka2hdfs.cpp:583)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-2.12.so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-2.12.so)
==13891==
==13891== followed by a later acquisition of lock at 0x636BC50
==13891== at 0x4A0B069: pthread_mutex_lock (hg_intercepts.c:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthread.c:135)
==13891== by 0x5EDB8F4: rd_kafka_q_serve_rkmessages (rdkafka_offset.h:48)
==13891== by 0x5EC592B: rd_kafka_consume_batch (rdkafka.c:1387)
==13891== by 0x4C1598C: consume_messages(rd_kafka_s_, rd_kafka_topic_s_, int, long, void ()(rd_kafka_message_s, void_)) (PyKafkaClient.cpp:606)
==13891== by 0x406C33: consume_to_local(void_) (kafka2hdfs.cpp:583)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-2.12.so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-2.12.so)
Is that normal?
The text was updated successfully, but these errors were encountered: