Suspicious misuse of thread lock #464

microwish · 2015-12-20T04:50:18Z

I used valgrind --tool=helgrind to debug program, and there were many errors like "lock order "0x636BCC8 before 0x636BC50" violated", e.g.:

==13891== Thread #11: lock order "0x636BCC8 before 0x636BC50" violated
==13891==
==13891== Observed (incorrect) order is: acquisition of lock at 0x636BC50
==13891== at 0x4A0B069: pthread_mutex_lock (hg_intercepts.c:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthread.c:135)
==13891== by 0x5EEA1DC: rd_kafka_toppar_fetch_decide (rdkafka_partition.c:1208)
==13891== by 0x5ECDE1E: rd_kafka_broker_thread_main (rdkafka_broker.c:2098)
==13891== by 0x5EF0AEE: thrd_wrapper_function (tinycthread.c:596)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-2.12.so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-2.12.so)
==13891==
==13891== followed by a later acquisition of lock at 0x636BCC8
==13891== at 0x4A0B069: pthread_mutex_lock (hg_intercepts.c:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthread.c:135)
==13891== by 0x5EE9D59: rd_kafka_q_len (rdkafka_queue.h:193)
==13891== by 0x5EEA45C: rd_kafka_toppar_fetch_decide (rdkafka_partition.c:1245)
==13891== by 0x5ECDE1E: rd_kafka_broker_thread_main (rdkafka_broker.c:2098)
==13891== by 0x5EF0AEE: thrd_wrapper_function (tinycthread.c:596)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-2.12.so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-2.12.so)
==13891==
==13891== Required order was established by acquisition of lock at 0x636BCC8
==13891== at 0x4A0B069: pthread_mutex_lock (hg_intercepts.c:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthread.c:135)
==13891== by 0x5EDB6EF: rd_kafka_q_serve_rkmessages (rdkafka_queue.c:437)
==13891== by 0x5EC592B: rd_kafka_consume_batch (rdkafka.c:1387)
==13891== by 0x4C1598C: consume_messages(rd_kafka_s, rd_kafka_topic_s, int, long, void ()(rd_kafka_message_s, void_)) (PyKafkaClient.cpp:606)
==13891== by 0x406C33: consume_to_local(void_) (kafka2hdfs.cpp:583)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-2.12.so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-2.12.so)
==13891==
==13891== followed by a later acquisition of lock at 0x636BC50
==13891== at 0x4A0B069: pthread_mutex_lock (hg_intercepts.c:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthread.c:135)
==13891== by 0x5EDB8F4: rd_kafka_q_serve_rkmessages (rdkafka_offset.h:48)
==13891== by 0x5EC592B: rd_kafka_consume_batch (rdkafka.c:1387)
==13891== by 0x4C1598C: consume_messages(rd_kafka_s_, rd_kafka_topic_s_, int, long, void ()(rd_kafka_message_s, void_)) (PyKafkaClient.cpp:606)
==13891== by 0x406C33: consume_to_local(void_) (kafka2hdfs.cpp:583)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-2.12.so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-2.12.so)

Is that normal?

edenhill · 2015-12-20T07:34:28Z

Which version (sha1) of librdkafka are you using?
Den 20 dec 2015 05:50 skrev "microwish" notifications@github.com:

I used valgrind --tool=helgrind to debug program, and there were many
errors like "lock order "0x636BCC8 before 0x636BC50" violated", eg:

==13891== Thread #11 #11:
lock order "0x636BCC8 before 0x636BC50" violated
==13891==
==13891== Observed (incorrect) order is: acquisition of lock at 0x636BC50
==13891== at 0x4A0B069: pthread_mutex_lock (hg_interceptsc:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthreadc:135)
==13891== by 0x5EEA1DC: rd_kafka_toppar_fetch_decide
(rdkafka_partitionc:1208)
==13891== by 0x5ECDE1E: rd_kafka_broker_thread_main (rdkafka_brokerc:2098)
==13891== by 0x5EF0AEE: thrd_wrapper_function (tinycthreadc:596)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_interceptsc:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-212so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-212so)
==13891==
==13891== followed by a later acquisition of lock at 0x636BCC8
==13891== at 0x4A0B069: pthread_mutex_lock (hg_interceptsc:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthreadc:135)
==13891== by 0x5EE9D59: rd_kafka_q_len (rdkafka_queueh:193)
==13891== by 0x5EEA45C: rd_kafka_toppar_fetch_decide
(rdkafka_partitionc:1245)
==13891== by 0x5ECDE1E: rd_kafka_broker_thread_main (rdkafka_brokerc:2098)
==13891== by 0x5EF0AEE: thrd_wrapper_function (tinycthreadc:596)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_interceptsc:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-212so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-212so)
==13891==
==13891== Required order was established by acquisition of lock at
0x636BCC8
==13891== at 0x4A0B069: pthread_mutex_lock (hg_interceptsc:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthreadc:135)
==13891== by 0x5EDB6EF: rd_kafka_q_serve_rkmessages (rdkafka_queuec:437)
==13891== by 0x5EC592B: rd_kafka_consume_batch (rdkafkac:1387)
==13891== by 0x4C1598C: consume_messages(rd_kafka_s, rd_kafka_topic_s,
int, long, void ()(rd_kafka_message_s, void
)) (PyKafkaClientcpp:606) ==13891== by 0x406C33: consume_to_local(void)
(kafka2hdfscpp:583)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_interceptsc:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-212so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-212so)
==13891==
==13891== followed by a later acquisition of lock at 0x636BC50
==13891== at 0x4A0B069: pthread_mutex_lock (hg_interceptsc:495)
==13891== by 0x5EF0E58: mtx_lock (tinycthreadc:135)
==13891== by 0x5EDB8F4: rd_kafka_q_serve_rkmessages (rdkafka_offseth:48)
==13891== by 0x5EC592B: rd_kafka_consume_batch (rdkafkac:1387)
==13891== by 0x4C1598C: consume_messages(rd_kafka_s_, rd_kafka_topic_s_,
int, long, void ()(rd_kafka_message_s, void
)) (PyKafkaClientcpp:606) ==13891== by 0x406C33: consume_to_local(void)
(kafka2hdfscpp:583)
==13891== by 0x4A0C0D4: mythread_wrapper (hg_interceptsc:219)
==13891== by 0x3EDEE07A50: start_thread (in /lib64/libpthread-212so)
==13891== by 0x3EDE6E893C: clone (in /lib64/libc-212so)

Is that normal?

—
Reply to this email directly or view it on GitHub
#464.

microwish · 2015-12-20T07:35:46Z

Consumer (legacy) config:

#auto.commit.enable=true
offset.store.method=file
offset.store.path=/data/users/data-infra/kafka2hdfs/offset

"auto.commit.enable" is default true, isn't it?

More symptoms:

local offset files are sometimes written, while not written at all sometimes.
many threads stay in the following status for long. Does it agree with what helgrind showed?
#0 0x0000003edee0e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0
Bugfix in example code #1 0x0000003edee09588 in _L_lock_854 () from /lib64/libpthread.so.0
Program received signal SIGPIPE, Broken pipe. #2 0x0000003edee09457 in pthread_mutex_lock () from /lib64/libpthread.so.0
excessive memory allocation == poor performance? #3 0x00007fd4926e8e59 in mtx_lock () from /usr/local/lib/librdkafka.so.1
ZooKeeper integration #4 0x00007fd4926e1d5a in rd_kafka_q_len () from /usr/local/lib/librdkafka.so.1
rd_kafka_destroy doesn't seem to close the socket properly #5 0x00007fd4926e245d in rd_kafka_toppar_fetch_decide () from /usr/local/lib/librdkafka.so.1
kafka v0.6 support #6 0x00007fd4926c5e1f in rd_kafka_broker_thread_main () from /usr/local/lib/librdkafka.so.1

microwish · 2015-12-20T07:36:52Z

I use the latest master branch.

md5:
4bcf256e36c0158fde3b1506792a8c27 librdkafka-master.zip

sha1:
652796c96e68dc0711c01d82f0d39e352ba563b9 librdkafka-master.zip

microwish · 2015-12-21T03:05:03Z

Any idea?

microwish · 2015-12-21T05:15:20Z

After I replaced master branch (master.zip) with v0.9.0 release, it seems everything is OK.

microwish · 2015-12-21T06:27:27Z

Now with librdkafka v0.9.0, it also occurred that local offset files weren't written. At the same time, there were some threads having the following stacks:

#0 0x0000003af8a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003af8a09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003af8a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fe5fe52e4b9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fe5fe50b008 in rd_kafka_topic_scan_all () from /usr/local/lib/librdkafka.so.1
#5 0x00007fe5fe4f6a86 in rd_kafka_topic_scan_tmr_cb () from /usr/local/lib/librdkafka.so.1
#6 0x00007fe5fe50cb50 in rd_kafka_timers_run () from /usr/local/lib/librdkafka.so.1
#7 0x00007fe5fe4fc6c5 in rd_kafka_thread_main () from /usr/local/lib/librdkafka.so.1
#8 0x00007fe5fe52e407 in _thrd_wrapper_function () from /usr/local/lib/librdkafka.so.1
#9 0x0000003af8a079d1 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003af7ee88fd in clone () from /lib64/libc.so.6

or

#0 0x0000003af8a0e264 in _lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003af8a09508 in L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003af8a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fe5fe52e4b9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fe5fe5150ec in rd_kafka_q_serve_rkmessages () from /usr/local/lib/librdkafka.so.1
#5 0x00007fe5fe4fbf8d in rd_kafka_consume_batch () from /usr/local/lib/librdkafka.so.1
#6 0x00007fe5ff85be9a in consume_messages(rd_kafka_s, rd_kafka_topic_s, int, long, void ()(rd_kafka_message_s, void_)) () from /data/use
rs/data-infra/kafkaclient/cpp/libpykafkaclient.so
#7 0x000000000040aae6 in consume_to_local(void_) ()
#8 0x0000003af8a079d1 in start_thread () from /lib64/libpthread.so.0
#9 0x0000003af7ee88fd in clone () from /lib64/libc.so.6

Are the symptoms and stacks are correlated?

microwish · 2015-12-21T06:47:44Z

When I used the latest master branch, both local offset file writing and Kafka message consuming stopped.
When I used the v0.9.0, only local file writing stopped. Kafka message consuming still went on.

microwish · 2015-12-21T15:08:56Z

When both local offset file writing and Kafka message consuming stopped, all consuming related threads had stacks like the following:

#0 0x0000003af8a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003af8a09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003af8a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fe5fe52e4b9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fe5fe5150ec in rd_kafka_q_serve_rkmessages () from /usr/local/lib/librdkafka.so.1
#5 0x00007fe5fe4fbf8d in rd_kafka_consume_batch () from /usr/local/lib/librdkafka.so.1

When only local offset file writing stopped, consuming related threads had two types of stacks:

#0 0x0000003af8a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003af8a09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003af8a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fc2e37c74b9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fc2e37ae0ec in rd_kafka_q_serve_rkmessages () from /usr/local/lib/librdkafka.so.1
#5 0x00007fc2e3794f8d in rd_kafka_consume_batch () from /usr/local/lib/librdkafka.so.1

or

#0 0x0000003af8a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fc2e37c768e in cnd_timedwait_ms () from /usr/local/lib/librdkafka.so.1
#2 0x00007fc2e37adbb4 in rd_kafka_q_serve () from /usr/local/lib/librdkafka.so.1

AppliedSoul · 2015-12-24T08:27:38Z

I am hitting the same issue with 0.8.6 when using consume batch queue. The kafka threads keeps waiting on lock to be released, thus causing deadlock - exactly the same stack trace as above by microwish.

microwish · 2015-12-26T02:49:14Z

@edenhill Happy X'mas! Any news about the issue? Thanks.

AppliedSoul · 2015-12-26T12:45:32Z

Now I am spawning separate threads for each of the partition queried from metadata and creating consumer batch for them individually. This works fine. Its a pity though that consumer batch queue doesn't work well.

microwish · 2015-12-30T06:34:57Z

Any plan to fix this bug?

edenhill · 2015-12-30T09:03:31Z

Yep, working on it.

microwish · 2016-01-10T13:16:54Z

@edenhill
Hi Magnus,
Does everything go well?
Thanks.

edenhill · 2016-01-10T22:35:02Z

I've done a number of fixes and all the regression tests in tests/ pass without valgrind or helgrind warnings, so could you please try to reproduce this issue again and if it happens again provide the full output from helgrind?
Thanks

microwish · 2016-01-11T00:45:09Z

Sure.
Should I use the latest master branch?

edenhill · 2016-01-11T06:13:07Z

Yes, latest master
Den 11 jan 2016 01:45 skrev "microwish" notifications@github.com:

Sure.
Should I use the latest master branch?

—
Reply to this email directly or view it on GitHub
#464 (comment)
.

microwish · 2016-01-11T11:41:01Z

After I use the latest master branch (sha1: fc7bee6f65556179454b515fc05e96cde5dcf3bd librdkafka-master.zip), the symptom is almost the same.

The following is snippets of process stack:

Thread 51 (Thread 0x7fa751475700 (LWP 20533)):
#0 0x0000003edee0e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003edee09588 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003edee09457 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fa7525047d9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fa7524e6125 in rd_kafka_topic_scan_all () from /usr/local/lib/librdkafka.so.1
#5 0x00007fa7524d45d7 in rd_kafka_topic_scan_tmr_cb () from /usr/local/lib/librdkafka.so.1
#6 0x00007fa7524e8dd9 in rd_kafka_timers_run () from /usr/local/lib/librdkafka.so.1
#7 0x00007fa7524da825 in rd_kafka_thread_main () from /usr/local/lib/librdkafka.so.1
#8 0x00007fa75250446f in _thrd_wrapper_function () from /usr/local/lib/librdkafka.so.1
#9 0x0000003edee07a51 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003ede6e893d in clone () from /lib64/libc.so.6
Thread 50 (Thread 0x7fa750074700 (LWP 20534)):
#0 0x0000003edee0ba0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fa752504677 in cnd_timedwait_ms () from /usr/local/lib/librdkafka.so.1
#2 0x00007fa7524ef825 in rd_kafka_q_pop () from /usr/local/lib/librdkafka.so.1
#3 0x00007fa7524e01bd in rd_kafka_broker_serve () from /usr/local/lib/librdkafka.so.1
#4 0x00007fa7524e0c29 in rd_kafka_broker_ua_idle () from /usr/local/lib/librdkafka.so.1
#5 0x00007fa7524e3a90 in rd_kafka_broker_thread_main () from /usr/local/lib/librdkafka.so.1
#6 0x00007fa75250446f in _thrd_wrapper_function () from /usr/local/lib/librdkafka.so.1
#7 0x0000003edee07a51 in start_thread () from /lib64/libpthread.so.0
#8 0x0000003ede6e893d in clone () from /lib64/libc.so.6
Thread 49 (Thread 0x7fa74ec73700 (LWP 20535)):
#0 0x0000003edee0e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003edee09588 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003edee09457 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fa7525047d9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fa7524fd8da in rd_kafka_q_len () from /usr/local/lib/librdkafka.so.1
#5 0x00007fa7524fdfe8 in rd_kafka_toppar_fetch_decide () from /usr/local/lib/librdkafka.so.1
#6 0x00007fa7524e3c0f in rd_kafka_broker_thread_main () from /usr/local/lib/librdkafka.so.1
#7 0x00007fa75250446f in _thrd_wrapper_function () from /usr/local/lib/librdkafka.so.1
#8 0x0000003edee07a51 in start_thread () from /lib64/libpthread.so.0
#9 0x0000003ede6e893d in clone () from /lib64/libc.so.6
Thread 48 (Thread 0x7fa74d872700 (LWP 20536)):
#0 0x0000003edee0e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003edee09588 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003edee09457 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fa7525047d9 in mtx_lock () from /usr/local/lib/librdkafka.so.1
#4 0x00007fa7524fd8da in rd_kafka_q_len () from /usr/local/lib/librdkafka.so.1
#5 0x00007fa7524fdfe8 in rd_kafka_toppar_fetch_decide () from /usr/local/lib/librdkafka.so.1
#6 0x00007fa7524e3c0f in rd_kafka_broker_thread_main () from /usr/local/lib/librdkafka.so.1
#7 0x00007fa75250446f in _thrd_wrapper_function () from /usr/local/lib/librdkafka.so.1
#8 0x0000003edee07a51 in start_thread () from /lib64/libpthread.so.0
#9 0x0000003ede6e893d in clone () from /lib64/libc.so.6

Later I'll be emailing you the helgrind result.

Thanks.

edenhill · 2016-01-11T15:32:20Z

Thanks for all the troubleshooting, I think I've nailed this issue now.
Please update to latest master and verify on your end

microwish · 2016-01-12T12:21:25Z

So far my program has been running for about 3 hours and everything seems okay.
I'll be keeping an eye on it.
Thanks a lot.

edenhill · 2016-01-13T21:43:48Z

Superb, let me know how it goes and we'll reopen if there is still an issue

microwish · 2016-01-15T03:54:22Z

@edenhill
The problem seems existing.
After around 5 minutes when the process consuming from Kafka brokers had started, consuming threads failed to consume more messages.
rd_kafka_consume_batch kept returning 0.

I emailed you the process stack.

Thanks.

edenhill added the bug label Dec 26, 2015

edenhill added a commit that referenced this issue Jan 11, 2016

Fixed locking when serving queue (issue #464)

c8b998f

edenhill added the consumer label Jan 12, 2016

edenhill closed this as completed Jan 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suspicious misuse of thread lock #464

Suspicious misuse of thread lock #464

microwish commented Dec 20, 2015

edenhill commented Dec 20, 2015

microwish commented Dec 20, 2015

microwish commented Dec 20, 2015

microwish commented Dec 21, 2015

microwish commented Dec 21, 2015

microwish commented Dec 21, 2015

microwish commented Dec 21, 2015

microwish commented Dec 21, 2015

AppliedSoul commented Dec 24, 2015

microwish commented Dec 26, 2015

AppliedSoul commented Dec 26, 2015

microwish commented Dec 30, 2015

edenhill commented Dec 30, 2015

microwish commented Jan 10, 2016

edenhill commented Jan 10, 2016

microwish commented Jan 11, 2016

edenhill commented Jan 11, 2016

microwish commented Jan 11, 2016

edenhill commented Jan 11, 2016

microwish commented Jan 12, 2016

edenhill commented Jan 13, 2016

microwish commented Jan 15, 2016

Suspicious misuse of thread lock #464

Suspicious misuse of thread lock #464

Comments

microwish commented Dec 20, 2015

edenhill commented Dec 20, 2015

microwish commented Dec 20, 2015

microwish commented Dec 20, 2015

microwish commented Dec 21, 2015

microwish commented Dec 21, 2015

microwish commented Dec 21, 2015

microwish commented Dec 21, 2015

microwish commented Dec 21, 2015

AppliedSoul commented Dec 24, 2015

microwish commented Dec 26, 2015

AppliedSoul commented Dec 26, 2015

microwish commented Dec 30, 2015

edenhill commented Dec 30, 2015

microwish commented Jan 10, 2016

edenhill commented Jan 10, 2016

microwish commented Jan 11, 2016

edenhill commented Jan 11, 2016

microwish commented Jan 11, 2016

edenhill commented Jan 11, 2016

microwish commented Jan 12, 2016

edenhill commented Jan 13, 2016

microwish commented Jan 15, 2016