-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash on exit #72
Comments
And there are cases where it is stuck over and over in the 'Waiting' loop.... |
I fixed a topic refcnt problem on a private branch a couple of days ago. This is probably what you are seeing. |
That's be great. Is my logic ok? On Wednesday, February 5, 2014, Magnus Edenhill notifications@github.com
|
Almost! For a producer you will also want to check that all messages have been sent (unless you control this yourself by the dr_cb):
|
I'm already doing the dr_db bit. My question was more around the loop. If I want to guarantee that everything shuts down cleanly, should I do a loop or just put a very high number in the wait_destroyed() function? |
The only reason for calling wait_destroyed() repeatedly is if your application wants to do something, otherwise just put a larger number for the wait_destroyed(). |
There were cases where it was in a constant loop, even at 1000 ms On Thu, Feb 6, 2014 at 6:50 AM, Magnus Edenhill notifications@github.comwrote:
|
Yep, as the topic refcnt was not properly decreased in all situations it 2014-02-06 winbatch notifications@github.com:
|
Looks like there are still cases where it won't shut down regardless of wait time |
And cases where it still crashes. replicator: rdkafka_topic.c:323: rd_kafka_topic_destroy0: Assertion `rkt->rkt_refcnt == 0' failed. Program terminated with signal 11, Segmentation fault. #0 0x00000039fae7788a in _int_free () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.9.x86_64 libgcc-4.4.6-3.el6.x86_64 libstdc++-4.4.6-3.el6.x86_64 zlib-1.2.3-27.el6.x86_64 (gdb) where #0 0x00000039fae7788a in _int_free () from /lib64/libc.so.6 #1 0x00007fa3974813a6 in rd_kafka_destroy0 (rk=0x39fb18be80) at rdkafka.c:570 #2 0x00007fa39748beff in rd_kafka_topic_destroy0 (rkt=0x26e3ba0) at rdkafka_topic.c:326 #3 0x00007fa39748c0ee in ?? () at rdkafka_msg.h:100 from ../lib/liblarakafka.so #4 0x00007fa388004000 in ?? () #5 0x00007fa3974884c5 in rd_kafka_broker_fail (rkb=0x7fa388002ec0, err=RD_KAFKA_RESP_ERR__DESTROY, fmt=0x0) at rdkafka_broker.c:386 #6 0x00007fa39748a79b in rd_kafka_broker_thread_main (arg=0x7fa388002ec0) at rdkafka_broker.c:3278 #7 0x00000039fb6077f1 in start_thread () from /lib64/libpthread.so.0 #8 0x00000039faee5ccd in clone () from /lib64/libc.so.6 |
Can you call And can you provide an example program that reproduces this? (e.g, patch on rdkafka_example) |
So this is interesting. There are certain processes when killed, they strace shows this: futex(0xcdc2cc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 30599, futex(0xcdc2a0, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0xcdc2cc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 30601, futex(0xcdc2a0, FUTEX_WAKE_PRIVATE, 1) = 0 gcore reveals: On Mon, Feb 10, 2014 at 9:01 PM, Magnus Edenhill
|
Another core inside dump: |
Seems like a lock ordering problem, can you do this in gdb: |
which core do you want me to apply that to? |
The one where dump blocks on mutex_lock |
output is truncated, do |
It will be quite big so you can mail it to me directly. |
I re-pasted. hopefully better? On Mon, Feb 10, 2014 at 10:44 PM, Magnus Edenhill
|
oh, huhm, I think I removed your comment at the same time, sorry :| Again please? |
Core was generated by `./replicator -c ../devconfig/superkafka.ini'. Thread 8 (Thread 0x7fa9eec98720 (LWP 7703)): Thread 7 (Thread 0x7fa9e5e02700 (LWP 7885)): Thread 6 (Thread 0x7fa9e6803700 (LWP 7884)): Thread 5 (Thread 0x7fa9e7204700 (LWP 7883)): Thread 4 (Thread 0x7fa9e7c05700 (LWP 7805)): Thread 3 (Thread 0x7fa9e8606700 (LWP 7804)): Thread 2 (Thread 0x7fa9e9007700 (LWP 7803)): Thread 1 (Thread 0x7fa9e9a08700 (LWP 7802)): |
This thread looks weird: First it's rdkafka compression a message and then we're back in your code calling dump. |
I am. Is that not allowed? On Mon, Feb 10, 2014 at 10:52 PM, Magnus Edenhill
|
Should I be setting a simple boolean in the signal handler and then calling On Mon, Feb 10, 2014 at 10:53 PM, Dan Hoffman hoffmandan@gmail.com wrote:
|
Generally it is a very bad idea to do anything from a signal handler, except for setting some "amIsupposedToRun" variable to false, since the signal handler can be called at any time in any thread. And I think thats whats happening here. rdkafka should block signals in its own threads though, which will make this situation somewhat better for rdkafka calls, and I will fix that. But I strongly recommend you to move out your code from the signal handler. |
Spot on! :) |
I'll make that change and see how it goes. On Mon, Feb 10, 2014 at 10:59 PM, Magnus Edenhill
|
Looking good so far. Consider this closed (again ;) ) |
I'm trying to exit cleanly, and periodically a get a core dump that looks like this:
#0 0x00007f4f51cea8d8 in _fini () from /lib64/libresolv.so.2
#1 0x00000039fa60eb6c in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#2 0x00000039fae35d92 in exit () from /lib64/libc.so.6
(more stuff below)
This is my shutdown code. Is it valid? should I be doing something else to wait for kafka to exit cleanly? (note PluginReturn is just an enum)
PluginReturn LaraTargetShutdownFunc()
{
CommonConfig::rLog( LWARNING, "[%s] Shutting down kafka components\n", cfg->section.c_str());
rd_kafka_topic_destroy (rkt);
rd_kafka_destroy (rk);
int ret=0;
while ( 1 )
{
CommonConfig::rLog( LWARNING, "[%s] Waiting for kafka components to shutdown...\n", cfg->section.c_str());
ret = rd_kafka_wait_destroyed (1000);
if ( ret == 0 )
break;
}
CommonConfig::rLog( LWARNING, "[%s] Shutdown of kafka components complete\n", cfg->section.c_str());
return PLUGIN_SUCCESS;
}
The text was updated successfully, but these errors were encountered: