Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on exit #72

Closed
winbatch opened this issue Feb 5, 2014 · 28 comments
Closed

Crash on exit #72

winbatch opened this issue Feb 5, 2014 · 28 comments
Labels

Comments

@winbatch
Copy link

winbatch commented Feb 5, 2014

I'm trying to exit cleanly, and periodically a get a core dump that looks like this:
#0 0x00007f4f51cea8d8 in _fini () from /lib64/libresolv.so.2
#1 0x00000039fa60eb6c in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#2 0x00000039fae35d92 in exit () from /lib64/libc.so.6

(more stuff below)

This is my shutdown code. Is it valid? should I be doing something else to wait for kafka to exit cleanly? (note PluginReturn is just an enum)

PluginReturn LaraTargetShutdownFunc()
{
CommonConfig::rLog( LWARNING, "[%s] Shutting down kafka components\n", cfg->section.c_str());
rd_kafka_topic_destroy (rkt);
rd_kafka_destroy (rk);
int ret=0;
while ( 1 )
{
CommonConfig::rLog( LWARNING, "[%s] Waiting for kafka components to shutdown...\n", cfg->section.c_str());
ret = rd_kafka_wait_destroyed (1000);
if ( ret == 0 )
break;
}
CommonConfig::rLog( LWARNING, "[%s] Shutdown of kafka components complete\n", cfg->section.c_str());
return PLUGIN_SUCCESS;
}

@winbatch
Copy link
Author

winbatch commented Feb 5, 2014

And there are cases where it is stuck over and over in the 'Waiting' loop....

@edenhill
Copy link
Contributor

edenhill commented Feb 6, 2014

I fixed a topic refcnt problem on a private branch a couple of days ago. This is probably what you are seeing.
Will push it soon.

@edenhill edenhill added the bug label Feb 6, 2014
edenhill added a commit that referenced this issue Feb 6, 2014
@winbatch
Copy link
Author

winbatch commented Feb 6, 2014

That's be great. Is my logic ok?

On Wednesday, February 5, 2014, Magnus Edenhill notifications@github.com
wrote:

I fixed a topic refcnt problem on a private branch a couple of days ago.
This is probably what you are seeing.
Will push it soon.

Reply to this email directly or view it on GitHubhttps://github.com//issues/72#issuecomment-34286662
.

@edenhill
Copy link
Contributor

edenhill commented Feb 6, 2014

Almost!

For a producer you will also want to check that all messages have been sent (unless you control this yourself by the dr_cb):

 /* Shutdown */

 /* Wait for messages to be sent */
 while (rd_kafka_outq_len(rk) > 0)
   rd_kafka_poll(rk, 1000);

rd_kafka_topic_destroy(rkt);
rd_kafka_destroy(rkt);

/* And if you want a clean (all memory freed) shutdown, also wait for the
  * background destroy to finish */

rd_kafka_wait_destroyed(5000);

@edenhill edenhill closed this as completed Feb 6, 2014
@winbatch
Copy link
Author

winbatch commented Feb 6, 2014

I'm already doing the dr_db bit. My question was more around the loop. If I want to guarantee that everything shuts down cleanly, should I do a loop or just put a very high number in the wait_destroyed() function?

@edenhill
Copy link
Contributor

edenhill commented Feb 6, 2014

The only reason for calling wait_destroyed() repeatedly is if your application wants to do something, otherwise just put a larger number for the wait_destroyed().
Having said that, the destroy should be fairly instant, the only reason for a slow shutdown I can think of is if a broker is completely unreachable so the connect(2) call has to time out, but that should be fairly rare.

@winbatch
Copy link
Author

winbatch commented Feb 6, 2014

There were cases where it was in a constant loop, even at 1000 ms
intervals. Might be (hopefully) related to the refcount thing?

On Thu, Feb 6, 2014 at 6:50 AM, Magnus Edenhill notifications@github.comwrote:

The only reason for calling wait_destroyed() repeatedly is if your
application wants to do something, otherwise just put a larger number for
the wait_destroyed().
Having said that, the destroy should be fairly instant, the only reason
for a slow shutdown I can think of is if a broker is completely unreachable
so the connect(2) call has to time out, but that should be fairly rare.

Reply to this email directly or view it on GitHubhttps://github.com//issues/72#issuecomment-34316003
.

@edenhill
Copy link
Contributor

edenhill commented Feb 6, 2014

Yep, as the topic refcnt was not properly decreased in all situations it
would lead to rdkafka never terminating completely.
Should be fixed.

2014-02-06 winbatch notifications@github.com:

There were cases where it was in a constant loop, even at 1000 ms
intervals. Might be (hopefully) related to the refcount thing?

On Thu, Feb 6, 2014 at 6:50 AM, Magnus Edenhill <notifications@github.com

wrote:

The only reason for calling wait_destroyed() repeatedly is if your
application wants to do something, otherwise just put a larger number for
the wait_destroyed().
Having said that, the destroy should be fairly instant, the only reason
for a slow shutdown I can think of is if a broker is completely
unreachable
so the connect(2) call has to time out, but that should be fairly rare.

Reply to this email directly or view it on GitHub<
https://github.com/edenhill/librdkafka/issues/72#issuecomment-34316003>

.

Reply to this email directly or view it on GitHubhttps://github.com//issues/72#issuecomment-34316167
.

@winbatch
Copy link
Author

Looks like there are still cases where it won't shut down regardless of wait time

@winbatch
Copy link
Author

And cases where it still crashes. replicator: rdkafka_topic.c:323: rd_kafka_topic_destroy0: Assertion `rkt->rkt_refcnt == 0' failed.

Program terminated with signal 11, Segmentation fault.

#0 0x00000039fae7788a in _int_free () from /lib64/libc.so.6

Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.9.x86_64 libgcc-4.4.6-3.el6.x86_64 libstdc++-4.4.6-3.el6.x86_64 zlib-1.2.3-27.el6.x86_64

(gdb) where

#0 0x00000039fae7788a in _int_free () from /lib64/libc.so.6

#1 0x00007fa3974813a6 in rd_kafka_destroy0 (rk=0x39fb18be80) at rdkafka.c:570

#2 0x00007fa39748beff in rd_kafka_topic_destroy0 (rkt=0x26e3ba0) at rdkafka_topic.c:326

#3 0x00007fa39748c0ee in ?? () at rdkafka_msg.h:100 from ../lib/liblarakafka.so

#4 0x00007fa388004000 in ?? ()

#5 0x00007fa3974884c5 in rd_kafka_broker_fail (rkb=0x7fa388002ec0, err=RD_KAFKA_RESP_ERR__DESTROY, fmt=0x0) at rdkafka_broker.c:386

#6 0x00007fa39748a79b in rd_kafka_broker_thread_main (arg=0x7fa388002ec0) at rdkafka_broker.c:3278

#7 0x00000039fb6077f1 in start_thread () from /lib64/libpthread.so.0

#8 0x00000039faee5ccd in clone () from /lib64/libc.so.6

@edenhill
Copy link
Contributor

Can you call rd_kafka_dump(rk); prior to destroying your topic(s) and the main handle=

And can you provide an example program that reproduces this? (e.g, patch on rdkafka_example)

@winbatch
Copy link
Author

So this is interesting. There are certain processes when killed, they
create the dump file (I open the FILE handle before giving it to the dump
program), but the file is empty. It's stuck in some sort of tight loop:

strace shows this:

futex(0xcdc2cc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 30599,
{1392089159, 435281000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

futex(0xcdc2a0, FUTEX_WAKE_PRIVATE, 1) = 0

futex(0xcdc2cc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 30601,
{1392089159, 445411000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

futex(0xcdc2a0, FUTEX_WAKE_PRIVATE, 1) = 0

gcore reveals:
Core was generated by `./replicator -c ../devconfig/superkafka.ini'.
#0 0x000000345700dff4 in __lll_lock_wait () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.9.x86_64 libgcc-4.4.6-3.el6.x86_64 libstdc++-4.4.6-3.el6.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) where
#0 0x000000345700dff4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003457009328 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x00000034570091f7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f181cfe533a in rd_kafka_topic_scan_all (rk=0xcdc2a0, now=16795998647784) at rdkafka_topic.c:931
#4 0x00007f181cfdb116 in rd_kafka_topic_scan_tmr_cb (rk=0xcdc2a0, arg=) at rdkafka.c:816
#5 0x00007f181cfe7ab5 in rd_kafka_timers_run (rk=0xcdc2a0, timeout=) at rdkafka_timer.c:155
#6 0x00007f181cfda8bd in rd_kafka_thread_main (arg=0xcdc2a0) at rdkafka.c:840
#7 0x00000034570077f1 in start_thread () from /lib64/libpthread.so.0
#8 0x00000034568e5ccd in clone () from /lib64/libc.so.6

On Mon, Feb 10, 2014 at 9:01 PM, Magnus Edenhill
notifications@github.comwrote:

Can you call rd_kafka_dump(rk); prior to destroying your topic(s) and the
main handle=

And can you provide an example program that reproduces this? (e.g, patch
on rdkafka_example)

Reply to this email directly or view it on GitHubhttps://github.com//issues/72#issuecomment-34719780
.

@winbatch
Copy link
Author

Another core inside dump:
Core was generated by `./replicator -c ../devconfig/superkafka.ini'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000034570091c0 in pthread_mutex_lock () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.9.x86_64 libgcc-4.4.6-3.el6.x86_64 libstdc++-4.4.6-3.el6.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) where
#0 0x00000034570091c0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1 0x00007fa9eea75b7f in rd_kafka_dump (fp=0x1b121f0, rk=0x1b121f0) at rdkafka.c:1355
#2 0x00007fa9eea70dc4 in LaraTargetShutdownFunc () at kafka/kafka.cc:419
#3 0x000000000040fc06 in slowdown (num=26) at replicator.cc:625
#4
#5 0x0000003456009d4c in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
#6 0x000000345600dfb0 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#7 0x0000003456014675 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
#8 0x0000003457005666 in __do_global_dtors_aux () from /lib64/libpthread.so.0
#9 0x0000000000000000 in ?? ()

@edenhill
Copy link
Contributor

Seems like a lock ordering problem, can you do this in gdb:
thread apply all bt

@winbatch
Copy link
Author

which core do you want me to apply that to?

@edenhill
Copy link
Contributor

The one where dump blocks on mutex_lock

@edenhill
Copy link
Contributor

output is truncated, do set height 0

@edenhill
Copy link
Contributor

It will be quite big so you can mail it to me directly.

@winbatch
Copy link
Author

I re-pasted. hopefully better?

On Mon, Feb 10, 2014 at 10:44 PM, Magnus Edenhill
notifications@github.comwrote:

output is truncated, do set height 0

Reply to this email directly or view it on GitHubhttps://github.com//issues/72#issuecomment-34724076
.

@edenhill
Copy link
Contributor

oh, huhm, I think I removed your comment at the same time, sorry :| Again please?

@winbatch
Copy link
Author

Core was generated by `./replicator -c ../devconfig/superkafka.ini'.
#0 0x000000345700dff4 in __lll_lock_wait () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.9.x86_64 libgcc-4.4.6-3.el6.x86_64 libstdc++-4.4.6-3.el6.x86_64 zlib- 1.2.3-27.el6.x86_64
(gdb) where
#0 0x000000345700dff4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003457009328 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x00000034570091f7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fa9eea8033a in rd_kafka_topic_scan_all (rk=0x1b125b0, now=16796575924155) at rdkafka_topic.c:931
#4 0x00007fa9eea76116 in rd_kafka_topic_scan_tmr_cb (rk=0x1b125b0, arg=) at rdkafka.c:816
#5 0x00007fa9eea82ab5 in rd_kafka_timers_run (rk=0x1b125b0, timeout=) at rdkafka_timer.c:155
#6 0x00007fa9eea758bd in rd_kafka_thread_main (arg=0x1b125b0) at rdkafka.c:840
#7 0x00000034570077f1 in start_thread () from /lib64/libpthread.so.0
#8 0x00000034568e5ccd in clone () from /lib64/libc.so.6
(gdb) thread apply all bt

Thread 8 (Thread 0x7fa9eec98720 (LWP 7703)):
#0 0x000000345700b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fa9eea74537 in pthread_cond_timedwait_ms (cond=0x1b125d8, mutex=0x1b125b0, timeout_ms=) at rdkafka.c:107
#2 0x00007fa9eea750a8 in rd_kafka_q_serve (rk=0x1b125b0, rkq=0x1b125b0, timeout_ms=10, callback=0x7fa9eea74700 <rd_kafka_poll_cb>,
opaque=0x1b125b0) at rdkafka.c:344
#3 0x00007fa9eea70c0a in LaraTargetAppendFileFunc (pathOnDisk=0x7fffa91005c0 "/local/scratch/hoffda/beacon/beacons.log.o",
path=0x1b27408 "/d111530-004.masked.com/NOMORELOGS4/local/scratch/hoffda/beacon/beacons.log.o",
content=0x7fa9e9a09010 "J,0,gemini1,d10976-018.masked.com,10.205.59.11,LJTM,1.2.13,1/31/12,1,Sat Jan 25 06:45:50 EST 2014\r\nJ,0,asp_qa1,ln1 25060-003.masked.com,10.238.226.104,LJTM,1.2.12,06/01/10,1,Sat Jan 25 11:45:50 GMT 2014\r"..., length=1048492) at kafka/kafka.cc:387
#4 0x000000000040de39 in handleMoreData (fw=0x1b0b950, file=0x7fa9ef0ef990,
path=0x7fffa91005c0 "/local/scratch/hoffda/beacon/beacons.log.o", cfg=0x1b0f950) at replicator.cc:259
#5 0x000000000040e9fe in myCallback (fw=0x1b0b950, file=0x7fa9ef0ef990, type=MORE_DATA, data=0x1b0f950, info=0x0) at replicator.cc:398
#6 0x00007fa9eeecb4e7 in checkFileEvent (handle=0x1b0b950, file=0x7fa9ef0ef990) at filewatch.c:2088
#7 0x00007fa9eeecb651 in checkEvents (handle=0x1b0b950) at filewatch.c:2130
#8 0x000000000040cdfe in checkForNewData (data=0x1b0b950) at replicator.cc:96
#9 0x000000000041c27c in TimerHandlerEventProc (evPtr=, flags=)
at /home/infradev/resources/Tcl-8.4.19_CUSTOM_BUILD/source/tcl8.4.19/unix/../generic/tclTimer.c:546
#10 0x000000000041b67a in Tcl_ServiceEvent (flags=-3)
at /home/infradev/resources/Tcl-8.4.19_CUSTOM_BUILD/source/tcl8.4.19/unix/../generic/tclNotify.c:640
#11 0x000000000041b9e7 in Tcl_DoOneEvent (flags=-3)
at /home/infradev/resources/Tcl-8.4.19_CUSTOM_BUILD/source/tcl8.4.19/unix/../generic/tclNotify.c:945
#12 0x0000000000411dfa in startSection (section="NOMORELOGS4") at replicator.cc:859
#13 0x000000000041242b in runSections () at replicator.cc:911
#14 0x000000000041393b in handleGlobalSection (configfile=0x1af01d8 "../devconfig/superkafka.ini",
---Type to continue, or q to quit---
overrides=std::vector of length 0, capacity 0) at replicator.cc:1057
#15 0x0000000000413b44 in main (argc=3, argv=0x7fffa9102908) at replicator.cc:1102

Thread 7 (Thread 0x7fa9e5e02700 (LWP 7885)):
#0 0x000000345700dff4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003457009328 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x00000034570091f7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fa9eea7a373 in rd_kafka_broker_update (rkb=, req_rkt=0x0, buf=,
size=) at rdkafka_broker.c:3522
#4 rd_kafka_metadata_handle (rkb=, req_rkt=0x0, buf=, size=)
at rdkafka_broker.c:808
#5 0x00007fa9eea7a72b in rd_kafka_broker_metadata_reply (rkb=0x7fa9dc003550, err=0, reply=0x7fa9c8001310, request=0x7fa9c8001440, opaque=0x0)
at rdkafka_broker.c:864
#6 0x00007fa9eea7cac3 in rd_kafka_req_response (rkb=0x7fa9dc003550) at rdkafka_broker.c:1101
#7 rd_kafka_recv (rkb=0x7fa9dc003550) at rdkafka_broker.c:1287
#8 0x00007fa9eea7d558 in rd_kafka_broker_io_serve (rkb=0x7fa9dc003550) at rdkafka_broker.c:2141
#9 0x00007fa9eea7dd90 in rd_kafka_broker_producer_serve (rkb=0x7fa9dc003550) at rdkafka_broker.c:2261
#10 0x00007fa9eea7ed58 in rd_kafka_broker_thread_main (arg=0x7fa9dc003550) at rdkafka_broker.c:3270
#11 0x00000034570077f1 in start_thread () from /lib64/libpthread.so.0
#12 0x00000034568e5ccd in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fa9e6803700 (LWP 7884)):
#0 0x000000345700dff4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003457009328 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x00000034570091f7 in pthread_mutex_lock () from /lib64/libpthread.so.0
---Type to continue, or q to quit---
#3 0x00007fa9eea75b7f in rd_kafka_dump (fp=0x7fa9d41430a0, rk=0x1b125b0) at rdkafka.c:1355
#4 0x00007fa9eea70dc4 in LaraTargetShutdownFunc () at kafka/kafka.cc:419
#5 0x000000000040fc06 in slowdown (num=26) at replicator.cc:625
#6
#7 0x00000034574044ed in ?? () from /lib64/libz.so.1
#8 0x0000003457404ec1 in ?? () from /lib64/libz.so.1
#9 0x0000003457405a62 in deflate () from /lib64/libz.so.1
#10 0x00007fa9eea7e367 in rd_kafka_broker_produce_toppar (rkb=0x7fa9dc002ec0) at rdkafka_broker.c:1931
#11 rd_kafka_broker_producer_serve (rkb=0x7fa9dc002ec0) at rdkafka_broker.c:2236
#12 0x00007fa9eea7ed58 in rd_kafka_broker_thread_main (arg=0x7fa9dc002ec0) at rdkafka_broker.c:3270
#13 0x00000034570077f1 in start_thread () from /lib64/libpthread.so.0
#14 0x00000034568e5ccd in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fa9e7204700 (LWP 7883)):
#0 0x000000345700dff4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003457009328 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x00000034570091f7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fa9eea7a373 in rd_kafka_broker_update (rkb=, req_rkt=0x0, buf=,
size=) at rdkafka_broker.c:3522
#4 rd_kafka_metadata_handle (rkb=, req_rkt=0x0, buf=, size=)
at rdkafka_broker.c:808
#5 0x00007fa9eea7a72b in rd_kafka_broker_metadata_reply (rkb=0x7fa9dc002830, err=0, reply=0x7fa9d0001310, request=0x7fa9d0001440, opaque=0x0)
at rdkafka_broker.c:864
#6 0x00007fa9eea7cac3 in rd_kafka_req_response (rkb=0x7fa9dc002830) at rdkafka_broker.c:1101
#7 rd_kafka_recv (rkb=0x7fa9dc002830) at rdkafka_broker.c:1287
---Type to continue, or q to quit---
#8 0x00007fa9eea7d558 in rd_kafka_broker_io_serve (rkb=0x7fa9dc002830) at rdkafka_broker.c:2141
#9 0x00007fa9eea7dd90 in rd_kafka_broker_producer_serve (rkb=0x7fa9dc002830) at rdkafka_broker.c:2261
#10 0x00007fa9eea7ed58 in rd_kafka_broker_thread_main (arg=0x7fa9dc002830) at rdkafka_broker.c:3270
#11 0x00000034570077f1 in start_thread () from /lib64/libpthread.so.0
#12 0x00000034568e5ccd in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fa9e7c05700 (LWP 7805)):
#0 0x000000345700dff4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003457009328 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x00000034570091f7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fa9eea7a373 in rd_kafka_broker_update (rkb=, req_rkt=0x0, buf=,
size=) at rdkafka_broker.c:3522
#4 rd_kafka_metadata_handle (rkb=, req_rkt=0x0, buf=, size=)
at rdkafka_broker.c:808
#5 0x00007fa9eea7a72b in rd_kafka_broker_metadata_reply (rkb=0x1b138d0, err=0, reply=0x7fa9d8001310, request=0x7fa9d8001440, opaque=0x0)
at rdkafka_broker.c:864
#6 0x00007fa9eea7cac3 in rd_kafka_req_response (rkb=0x1b138d0) at rdkafka_broker.c:1101
#7 rd_kafka_recv (rkb=0x1b138d0) at rdkafka_broker.c:1287
#8 0x00007fa9eea7d558 in rd_kafka_broker_io_serve (rkb=0x1b138d0) at rdkafka_broker.c:2141
#9 0x00007fa9eea7eca5 in rd_kafka_broker_ua_idle (arg=0x1b138d0) at rdkafka_broker.c:2160
#10 rd_kafka_broker_thread_main (arg=0x1b138d0) at rdkafka_broker.c:3268
#11 0x00000034570077f1 in start_thread () from /lib64/libpthread.so.0
#12 0x00000034568e5ccd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fa9e8606700 (LWP 7804)):
---Type to continue, or q to quit---
#0 0x000000345700dff4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003457009328 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x00000034570091f7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fa9eea7a373 in rd_kafka_broker_update (rkb=, req_rkt=0x0, buf=,
size=) at rdkafka_broker.c:3522
#4 rd_kafka_metadata_handle (rkb=, req_rkt=0x0, buf=, size=)
at rdkafka_broker.c:808
#5 0x00007fa9eea7a72b in rd_kafka_broker_metadata_reply (rkb=0x1b13240, err=0, reply=0x7fa9dc001310, request=0x7fa9dc001440, opaque=0x0)
at rdkafka_broker.c:864
#6 0x00007fa9eea7cac3 in rd_kafka_req_response (rkb=0x1b13240) at rdkafka_broker.c:1101
#7 rd_kafka_recv (rkb=0x1b13240) at rdkafka_broker.c:1287
#8 0x00007fa9eea7d558 in rd_kafka_broker_io_serve (rkb=0x1b13240) at rdkafka_broker.c:2141
#9 0x00007fa9eea7eca5 in rd_kafka_broker_ua_idle (arg=0x1b13240) at rdkafka_broker.c:2160
#10 rd_kafka_broker_thread_main (arg=0x1b13240) at rdkafka_broker.c:3268
#11 0x00000034570077f1 in start_thread () from /lib64/libpthread.so.0
#12 0x00000034568e5ccd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fa9e9007700 (LWP 7803)):
#0 0x000000345700dff4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003457009328 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x00000034570091f7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fa9eea7a373 in rd_kafka_broker_update (rkb=, req_rkt=0x0, buf=,
size=) at rdkafka_broker.c:3522
#4 rd_kafka_metadata_handle (rkb=, req_rkt=0x0, buf=, size=)
at rdkafka_broker.c:808
---Type to continue, or q to quit---
#5 0x00007fa9eea7a72b in rd_kafka_broker_metadata_reply (rkb=0x1b12bb0, err=0, reply=0x7fa9e0001310, request=0x7fa9e0001460, opaque=0x0)
at rdkafka_broker.c:864
#6 0x00007fa9eea7cac3 in rd_kafka_req_response (rkb=0x1b12bb0) at rdkafka_broker.c:1101
#7 rd_kafka_recv (rkb=0x1b12bb0) at rdkafka_broker.c:1287
#8 0x00007fa9eea7d558 in rd_kafka_broker_io_serve (rkb=0x1b12bb0) at rdkafka_broker.c:2141
#9 0x00007fa9eea7eca5 in rd_kafka_broker_ua_idle (arg=0x1b12bb0) at rdkafka_broker.c:2160
#10 rd_kafka_broker_thread_main (arg=0x1b12bb0) at rdkafka_broker.c:3268
#11 0x00000034570077f1 in start_thread () from /lib64/libpthread.so.0
#12 0x00000034568e5ccd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fa9e9a08700 (LWP 7802)):
#0 0x000000345700dff4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003457009328 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x00000034570091f7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fa9eea8033a in rd_kafka_topic_scan_all (rk=0x1b125b0, now=16796575924155) at rdkafka_topic.c:931
#4 0x00007fa9eea76116 in rd_kafka_topic_scan_tmr_cb (rk=0x1b125b0, arg=) at rdkafka.c:816
#5 0x00007fa9eea82ab5 in rd_kafka_timers_run (rk=0x1b125b0, timeout=) at rdkafka_timer.c:155
#6 0x00007fa9eea758bd in rd_kafka_thread_main (arg=0x1b125b0) at rdkafka.c:840
#7 0x00000034570077f1 in start_thread () from /lib64/libpthread.so.0
#8 0x00000034568e5ccd in clone () from /lib64/libc.so.6

@edenhill
Copy link
Contributor

This thread looks weird:
#3 0x00007fa9eea75b7f in rd_kafka_dump (fp=0x7fa9d41430a0, rk=0x1b125b0) at rdkafka.c:1355
#4 0x00007fa9eea70dc4 in LaraTargetShutdownFunc () at kafka/kafka.cc:419
#5 0x000000000040fc06 in slowdown (num=26) at replicator.cc:625
#6
#7 0x00000034574044ed in ?? () from /lib64/libz.so.1
#8 0x0000003457404ec1 in ?? () from /lib64/libz.so.1
#9 0x0000003457405a62 in deflate () from /lib64/libz.so.1
#10 0x00007fa9eea7e367 in rd_kafka_broker_produce_toppar (rkb=0x7fa9dc002ec0) at rdkafka_broker.c:1931

First it's rdkafka compression a message and then we're back in your code calling dump.
Is that the signal handler perhaps? And are you calling rdkafka functions from your signal handler?

@winbatch
Copy link
Author

I am. Is that not allowed?

On Mon, Feb 10, 2014 at 10:52 PM, Magnus Edenhill
notifications@github.comwrote:

This thread looks weird:
#3 #3 0x00007fa9eea75b7f
in rd_kafka_dump (fp=0x7fa9d41430a0, rk=0x1b125b0) at rdkafka.c:1355
#4 #4 0x00007fa9eea70dc4
in LaraTargetShutdownFunc () at kafka/kafka.cc:419
#5 #5 0x000000000040fc06
in slowdown (num=26) at replicator.cc:625
#6 #6
#7 #7 0x00000034574044ed
in ?? () from /lib64/libz.so.1
#8 #8 0x0000003457404ec1 in
?? () from /lib64/libz.so.1
#9 #9 0x0000003457405a62 in
deflate () from /lib64/libz.so.1
#10 #10 0x00007fa9eea7e367
in rd_kafka_broker_produce_toppar (rkb=0x7fa9dc002ec0) at
rdkafka_broker.c:1931

First it's rdkafka compression a message and then we're back in your code
calling dump.
Is that the signal handler perhaps? And are you calling rdkafka functions
from your signal handler?

Reply to this email directly or view it on GitHubhttps://github.com//issues/72#issuecomment-34724380
.

@winbatch
Copy link
Author

Should I be setting a simple boolean in the signal handler and then calling
the rdkafka functions later outside of the signal based on the boolean?

On Mon, Feb 10, 2014 at 10:53 PM, Dan Hoffman hoffmandan@gmail.com wrote:

I am. Is that not allowed?

On Mon, Feb 10, 2014 at 10:52 PM, Magnus Edenhill <
notifications@github.com> wrote:

This thread looks weird:
#3 #3 0x00007fa9eea75b7f
in rd_kafka_dump (fp=0x7fa9d41430a0, rk=0x1b125b0) at rdkafka.c:1355
#4 #4 0x00007fa9eea70dc4
in LaraTargetShutdownFunc () at kafka/kafka.cc:419
#5 #5 0x000000000040fc06
in slowdown (num=26) at replicator.cc:625
#6 #6
#7 #7 0x00000034574044ed
in ?? () from /lib64/libz.so.1
#8 #8 0x0000003457404ec1 in
?? () from /lib64/libz.so.1
#9 #9 0x0000003457405a62 in
deflate () from /lib64/libz.so.1
#10 #10 0x00007fa9eea7e367
in rd_kafka_broker_produce_toppar (rkb=0x7fa9dc002ec0) at
rdkafka_broker.c:1931

First it's rdkafka compression a message and then we're back in your code
calling dump.
Is that the signal handler perhaps? And are you calling rdkafka functions
from your signal handler?

Reply to this email directly or view it on GitHubhttps://github.com//issues/72#issuecomment-34724380
.

@edenhill
Copy link
Contributor

Generally it is a very bad idea to do anything from a signal handler, except for setting some "amIsupposedToRun" variable to false, since the signal handler can be called at any time in any thread.

And I think thats whats happening here.

rdkafka should block signals in its own threads though, which will make this situation somewhat better for rdkafka calls, and I will fix that. But I strongly recommend you to move out your code from the signal handler.

@edenhill
Copy link
Contributor

Spot on! :)

@winbatch
Copy link
Author

I'll make that change and see how it goes.

On Mon, Feb 10, 2014 at 10:59 PM, Magnus Edenhill
notifications@github.comwrote:

Spot on! :)

Reply to this email directly or view it on GitHubhttps://github.com//issues/72#issuecomment-34724608
.

@winbatch
Copy link
Author

Looking good so far. Consider this closed (again ;) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants