Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspicious misuse of thread lock #464

Closed
microwish opened this issue Dec 20, 2015 · 22 comments
Closed

Suspicious misuse of thread lock #464

microwish opened this issue Dec 20, 2015 · 22 comments

Comments

@microwish
Copy link

I used valgrind --tool=helgrind to debug program, and there were many errors like "lock order "0x636BCC8 before 0x636BC50" violated", e.g.:

==13891== Thread #11: lock order "0x636BCC8 before 0x636BC50" violated
==13891==
==13891== Observed (incorrect) order is: acquisition of lock at 0x636BC50
==13891== at 0x4A0B069: pthread_mutex_lock (hg_intercepts.c:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthread.c:135)
==13891== by 0x5EEA1DC: rd_kafka_toppar_fetch_decide (rdkafka_partition.c:1208)
==13891== by 0x5ECDE1E: rd_kafka_broker_thread_main (rdkafka_broker.c:2098)
==13891== by 0x5EF0AEE: thrd_wrapper_function (tinycthread.c:596)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-2.12.so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-2.12.so)
==13891==
==13891== followed by a later acquisition of lock at 0x636BCC8
==13891== at 0x4A0B069: pthread_mutex_lock (hg_intercepts.c:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthread.c:135)
==13891== by 0x5EE9D59: rd_kafka_q_len (rdkafka_queue.h:193)
==13891== by 0x5EEA45C: rd_kafka_toppar_fetch_decide (rdkafka_partition.c:1245)
==13891== by 0x5ECDE1E: rd_kafka_broker_thread_main (rdkafka_broker.c:2098)
==13891== by 0x5EF0AEE: thrd_wrapper_function (tinycthread.c:596)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-2.12.so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-2.12.so)
==13891==
==13891== Required order was established by acquisition of lock at 0x636BCC8
==13891== at 0x4A0B069: pthread_mutex_lock (hg_intercepts.c:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthread.c:135)
==13891== by 0x5EDB6EF: rd_kafka_q_serve_rkmessages (rdkafka_queue.c:437)
==13891== by 0x5EC592B: rd_kafka_consume_batch (rdkafka.c:1387)
==13891== by 0x4C1598C: consume_messages(rd_kafka_s
, rd_kafka_topic_s
, int, long, void ()(rd_kafka_message_s, void_)) (PyKafkaClient.cpp:606)
==13891== by 0x406C33: consume_to_local(void_) (kafka2hdfs.cpp:583)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-2.12.so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-2.12.so)
==13891==
==13891== followed by a later acquisition of lock at 0x636BC50
==13891== at 0x4A0B069: pthread_mutex_lock (hg_intercepts.c:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthread.c:135)
==13891== by 0x5EDB8F4: rd_kafka_q_serve_rkmessages (rdkafka_offset.h:48)
==13891== by 0x5EC592B: rd_kafka_consume_batch (rdkafka.c:1387)
==13891== by 0x4C1598C: consume_messages(rd_kafka_s_, rd_kafka_topic_s_, int, long, void ()(rd_kafka_message_s, void_)) (PyKafkaClient.cpp:606)
==13891== by 0x406C33: consume_to_local(void_) (kafka2hdfs.cpp:583)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-2.12.so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-2.12.so)

Is that normal?

@edenhill
Copy link
Contributor

Which version (sha1) of librdkafka are you using?
Den 20 dec 2015 05:50 skrev "microwish" notifications@github.com:

I used valgrind --tool=helgrind to debug program, and there were many
errors like "lock order "0x636BCC8 before 0x636BC50" violated", eg:

==13891== Thread #11 #11:
lock order "0x636BCC8 before 0x636BC50" violated
==13891==
==13891== Observed (incorrect) order is: acquisition of lock at 0x636BC50
==13891== at 0x4A0B069: pthread_mutex_lock (hg_interceptsc:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthreadc:135)
==13891== by 0x5EEA1DC: rd_kafka_toppar_fetch_decide
(rdkafka_partitionc:1208)
==13891== by 0x5ECDE1E: rd_kafka_broker_thread_main (rdkafka_brokerc:2098)
==13891== by 0x5EF0AEE: thrd_wrapper_function (tinycthreadc:596)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_interceptsc:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-212so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-212so)
==13891==
==13891== followed by a later acquisition of lock at 0x636BCC8
==13891== at 0x4A0B069: pthread_mutex_lock (hg_interceptsc:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthreadc:135)
==13891== by 0x5EE9D59: rd_kafka_q_len (rdkafka_queueh:193)
==13891== by 0x5EEA45C: rd_kafka_toppar_fetch_decide
(rdkafka_partitionc:1245)
==13891== by 0x5ECDE1E: rd_kafka_broker_thread_main (rdkafka_brokerc:2098)
==13891== by 0x5EF0AEE: thrd_wrapper_function (tinycthreadc:596)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_interceptsc:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-212so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-212so)
==13891==
==13891== Required order was established by acquisition of lock at
0x636BCC8
==13891== at 0x4A0B069: pthread_mutex_lock (hg_interceptsc:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthreadc:135)
==13891== by 0x5EDB6EF: rd_kafka_q_serve_rkmessages (rdkafka_queuec:437)
==13891== by 0x5EC592B: rd_kafka_consume_batch (rdkafkac:1387)
==13891== by 0x4C1598C: consume_messages(rd_kafka_s
, rd_kafka_topic_s
,
int, long, void ()(rd_kafka_message_s, void
)) (PyKafkaClientcpp:606) ==13891== by 0x406C33: consume_to_local(void)
(kafka2hdfscpp:583)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_interceptsc:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-212so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-212so)
==13891==
==13891== followed by a later acquisition of lock at 0x636BC50
==13891== at 0x4A0B069: pthread_mutex_lock (hg_interceptsc:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthreadc:135)
==13891== by 0x5EDB8F4: rd_kafka_q_serve_rkmessages (rdkafka_offseth:48)
==13891== by 0x5EC592B: rd_kafka_consume_batch (rdkafkac:1387)
==13891== by 0x4C1598C: consume_messages(rd_kafka_s_, rd_kafka_topic_s_,
int, long, void ()(rd_kafka_message_s, void
)) (PyKafkaClientcpp:606) ==13891== by 0x406C33: consume_to_local(void)
(kafka2hdfscpp:583)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_interceptsc:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-212so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-212so)

Is that normal?


Reply to this email directly or view it on GitHub
#464.

@microwish
Copy link
Author

Consumer (legacy) config:

#auto.commit.enable=true
offset.store.method=file
offset.store.path=/data/users/data-infra/kafka2hdfs/offset

"auto.commit.enable" is default true, isn't it?

More symptoms:

  1. local offset files are sometimes written, while not written at all sometimes.
  2. many threads stay in the following status for long. Does it agree with what helgrind showed?
    #0 0x0000003edee0e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0
    Bugfix in example code #1 0x0000003edee09588 in _L_lock_854 () from /lib64/libpthread.so.0
    Program received signal SIGPIPE, Broken pipe. #2 0x0000003edee09457 in pthread_mutex_lock () from /lib64/libpthread.so.0
    excessive memory allocation == poor performance? #3 0x00007fd4926e8e59 in mtx_lock () from /usr/local/lib/librdkafka.so.1
    ZooKeeper integration #4 0x00007fd4926e1d5a in rd_kafka_q_len () from /usr/local/lib/librdkafka.so.1
    rd_kafka_destroy doesn't seem to close the socket properly #5 0x00007fd4926e245d in rd_kafka_toppar_fetch_decide () from /usr/local/lib/librdkafka.so.1
    kafka v0.6 support #6 0x00007fd4926c5e1f in rd_kafka_broker_thread_main () from /usr/local/lib/librdkafka.so.1

@microwish
Copy link
Author

I use the latest master branch.

md5:
4bcf256e36c0158fde3b1506792a8c27 librdkafka-master.zip

sha1:
652796c96e68dc0711c01d82f0d39e352ba563b9 librdkafka-master.zip

@microwish
Copy link
Author

Any idea?

@microwish
Copy link
Author

After I replaced master branch (master.zip) with v0.9.0 release, it seems everything is OK.

@microwish
Copy link
Author

Now with librdkafka v0.9.0, it also occurred that local offset files weren't written. At the same time, there were some threads having the following stacks:

#0 0x0000003af8a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003af8a09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003af8a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fe5fe52e4b9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fe5fe50b008 in rd_kafka_topic_scan_all () from /usr/local/lib/librdkafka.so.1
#5 0x00007fe5fe4f6a86 in rd_kafka_topic_scan_tmr_cb () from /usr/local/lib/librdkafka.so.1
#6 0x00007fe5fe50cb50 in rd_kafka_timers_run () from /usr/local/lib/librdkafka.so.1
#7 0x00007fe5fe4fc6c5 in rd_kafka_thread_main () from /usr/local/lib/librdkafka.so.1
#8 0x00007fe5fe52e407 in _thrd_wrapper_function () from /usr/local/lib/librdkafka.so.1
#9 0x0000003af8a079d1 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003af7ee88fd in clone () from /lib64/libc.so.6

or

#0 0x0000003af8a0e264 in _lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003af8a09508 in L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003af8a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fe5fe52e4b9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fe5fe5150ec in rd_kafka_q_serve_rkmessages () from /usr/local/lib/librdkafka.so.1
#5 0x00007fe5fe4fbf8d in rd_kafka_consume_batch () from /usr/local/lib/librdkafka.so.1
#6 0x00007fe5ff85be9a in consume_messages(rd_kafka_s
, rd_kafka_topic_s
, int, long, void ()(rd_kafka_message_s, void_)) () from /data/use
rs/data-infra/kafkaclient/cpp/libpykafkaclient.so
#7 0x000000000040aae6 in consume_to_local(void_) ()
#8 0x0000003af8a079d1 in start_thread () from /lib64/libpthread.so.0
#9 0x0000003af7ee88fd in clone () from /lib64/libc.so.6

Are the symptoms and stacks are correlated?

@microwish
Copy link
Author

When I used the latest master branch, both local offset file writing and Kafka message consuming stopped.
When I used the v0.9.0, only local file writing stopped. Kafka message consuming still went on.

@microwish
Copy link
Author

When both local offset file writing and Kafka message consuming stopped, all consuming related threads had stacks like the following:

#0 0x0000003af8a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003af8a09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003af8a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fe5fe52e4b9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fe5fe5150ec in rd_kafka_q_serve_rkmessages () from /usr/local/lib/librdkafka.so.1
#5 0x00007fe5fe4fbf8d in rd_kafka_consume_batch () from /usr/local/lib/librdkafka.so.1

When only local offset file writing stopped, consuming related threads had two types of stacks:

#0 0x0000003af8a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003af8a09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003af8a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fc2e37c74b9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fc2e37ae0ec in rd_kafka_q_serve_rkmessages () from /usr/local/lib/librdkafka.so.1
#5 0x00007fc2e3794f8d in rd_kafka_consume_batch () from /usr/local/lib/librdkafka.so.1

or

#0 0x0000003af8a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fc2e37c768e in cnd_timedwait_ms () from /usr/local/lib/librdkafka.so.1
#2 0x00007fc2e37adbb4 in rd_kafka_q_serve () from /usr/local/lib/librdkafka.so.1

@AppliedSoul
Copy link

I am hitting the same issue with 0.8.6 when using consume batch queue. The kafka threads keeps waiting on lock to be released, thus causing deadlock - exactly the same stack trace as above by microwish.

@microwish
Copy link
Author

@edenhill Happy X'mas! Any news about the issue? Thanks.

@AppliedSoul
Copy link

Now I am spawning separate threads for each of the partition queried from metadata and creating consumer batch for them individually. This works fine. Its a pity though that consumer batch queue doesn't work well.

@edenhill edenhill added the bug label Dec 26, 2015
@microwish
Copy link
Author

Any plan to fix this bug?

@edenhill
Copy link
Contributor

Yep, working on it.

@microwish
Copy link
Author

@edenhill
Hi Magnus,
Does everything go well?
Thanks.

@edenhill
Copy link
Contributor

I've done a number of fixes and all the regression tests in tests/ pass without valgrind or helgrind warnings, so could you please try to reproduce this issue again and if it happens again provide the full output from helgrind?
Thanks

@microwish
Copy link
Author

Sure.
Should I use the latest master branch?

@edenhill
Copy link
Contributor

Yes, latest master
Den 11 jan 2016 01:45 skrev "microwish" notifications@github.com:

Sure.
Should I use the latest master branch?


Reply to this email directly or view it on GitHub
#464 (comment)
.

@microwish
Copy link
Author

After I use the latest master branch (sha1: fc7bee6f65556179454b515fc05e96cde5dcf3bd librdkafka-master.zip), the symptom is almost the same.

The following is snippets of process stack:

Thread 51 (Thread 0x7fa751475700 (LWP 20533)):
#0 0x0000003edee0e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003edee09588 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003edee09457 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fa7525047d9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fa7524e6125 in rd_kafka_topic_scan_all () from /usr/local/lib/librdkafka.so.1
#5 0x00007fa7524d45d7 in rd_kafka_topic_scan_tmr_cb () from /usr/local/lib/librdkafka.so.1
#6 0x00007fa7524e8dd9 in rd_kafka_timers_run () from /usr/local/lib/librdkafka.so.1
#7 0x00007fa7524da825 in rd_kafka_thread_main () from /usr/local/lib/librdkafka.so.1
#8 0x00007fa75250446f in _thrd_wrapper_function () from /usr/local/lib/librdkafka.so.1
#9 0x0000003edee07a51 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003ede6e893d in clone () from /lib64/libc.so.6
Thread 50 (Thread 0x7fa750074700 (LWP 20534)):
#0 0x0000003edee0ba0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fa752504677 in cnd_timedwait_ms () from /usr/local/lib/librdkafka.so.1
#2 0x00007fa7524ef825 in rd_kafka_q_pop () from /usr/local/lib/librdkafka.so.1
#3 0x00007fa7524e01bd in rd_kafka_broker_serve () from /usr/local/lib/librdkafka.so.1
#4 0x00007fa7524e0c29 in rd_kafka_broker_ua_idle () from /usr/local/lib/librdkafka.so.1
#5 0x00007fa7524e3a90 in rd_kafka_broker_thread_main () from /usr/local/lib/librdkafka.so.1
#6 0x00007fa75250446f in _thrd_wrapper_function () from /usr/local/lib/librdkafka.so.1
#7 0x0000003edee07a51 in start_thread () from /lib64/libpthread.so.0
#8 0x0000003ede6e893d in clone () from /lib64/libc.so.6
Thread 49 (Thread 0x7fa74ec73700 (LWP 20535)):
#0 0x0000003edee0e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003edee09588 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003edee09457 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fa7525047d9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fa7524fd8da in rd_kafka_q_len () from /usr/local/lib/librdkafka.so.1
#5 0x00007fa7524fdfe8 in rd_kafka_toppar_fetch_decide () from /usr/local/lib/librdkafka.so.1
#6 0x00007fa7524e3c0f in rd_kafka_broker_thread_main () from /usr/local/lib/librdkafka.so.1
#7 0x00007fa75250446f in _thrd_wrapper_function () from /usr/local/lib/librdkafka.so.1
#8 0x0000003edee07a51 in start_thread () from /lib64/libpthread.so.0
#9 0x0000003ede6e893d in clone () from /lib64/libc.so.6
Thread 48 (Thread 0x7fa74d872700 (LWP 20536)):
#0 0x0000003edee0e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003edee09588 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003edee09457 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fa7525047d9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fa7524fd8da in rd_kafka_q_len () from /usr/local/lib/librdkafka.so.1
#5 0x00007fa7524fdfe8 in rd_kafka_toppar_fetch_decide () from /usr/local/lib/librdkafka.so.1
#6 0x00007fa7524e3c0f in rd_kafka_broker_thread_main () from /usr/local/lib/librdkafka.so.1
#7 0x00007fa75250446f in _thrd_wrapper_function () from /usr/local/lib/librdkafka.so.1
#8 0x0000003edee07a51 in start_thread () from /lib64/libpthread.so.0
#9 0x0000003ede6e893d in clone () from /lib64/libc.so.6

Later I'll be emailing you the helgrind result.

Thanks.

@edenhill
Copy link
Contributor

Thanks for all the troubleshooting, I think I've nailed this issue now.
Please update to latest master and verify on your end

@microwish
Copy link
Author

So far my program has been running for about 3 hours and everything seems okay.
I'll be keeping an eye on it.
Thanks a lot.

@edenhill
Copy link
Contributor

Superb, let me know how it goes and we'll reopen if there is still an issue

@microwish
Copy link
Author

@edenhill
The problem seems existing.
After around 5 minutes when the process consuming from Kafka brokers had started, consuming threads failed to consume more messages.
rd_kafka_consume_batch kept returning 0.

I emailed you the process stack.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants