Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdkafka stopping connection retries on RdKafka::ERR__ALL_BROKERS_DOWN #373

Closed
DEvil0000 opened this issue Sep 18, 2015 · 11 comments
Closed

Comments

@DEvil0000
Copy link
Contributor

rdkafka is trying to connect to all brokers in the list, if they all fail I get a event_cb (ERR__ALL_BROKERS_DOWN) but no more connection attempts are done..
I would expect rdkafka to report it but still try to connect to the brokers.
In case you do not think this should be the default behaviour I would suggest a config value for retry after all brokers down...

@edenhill
Copy link
Contributor

librdkafka should try connecting to all brokers it knows about forever:
https://github.com/edenhill/librdkafka/blob/master/src/rdkafka_broker.c#L4216

What it does do is suppress log messages of failed connection attempts if the error is the same as the last attempt (e.g., Connection refused), so maybe that is what you are seeing (or not seeing), lack of log messages?

As soon as brokers start coming back up it should reconnect to them within a couple of seconds.

@DEvil0000
Copy link
Contributor Author

if a single broker is going down it reconnects to it like I expect. (reconnecting)
but after all brokers down message all rdkafka threads exit. (exiting-.-)
example log:

ERROR (Local: All broker connections are down): 4/4 brokers are down
LOG-3-FAIL: test-zkkafka-857feb05-1b9c-4c04-9918-a6727dc263cb:9092/bootstrap: Failed to connect to broker at test-zkkafka-857feb05-1b9c-4c04-9918-a6727dc263cb:9092: Connection refused
ERROR (Local: Broker transport failure): test-zkkafka-857feb05-1b9c-4c04-9918-a6727dc263cb:9092/bootstrap: Failed to connect to broker at test-zkkafka-857feb05-1b9c-4c04-9918-a6727dc263cb:9092: Connection refused
[Threads exiting]

code to produce logs:

switch (iraEvent.type())
{
case RdKafka::Event::EVENT_ERROR:
L_ERROR << "ERROR (" << RdKafka::err2str(iraEvent.err()) << "): " <<
iraEvent.str() << LEnd;
if (iraEvent.err() == RdKafka::ERR__ALL_BROKERS_DOWN) {
setConnected(false);
}
break;
case RdKafka::Event::EVENT_STATS:
L_INFO << ""STATS": " << iraEvent.str() << LEnd;
break;
case RdKafka::Event::EVENT_LOG:
L_DEBUG << "LOG-" << iraEvent.severity() << "-" <<
iraEvent.fac().c_str() << ": " << iraEvent.str().c_str() << LEnd;
break;
default:
L_ERROR << "EVENT " << iraEvent.type() <<
" (" << RdKafka::err2str(iraEvent.err()) << "): " <<
iraEvent.str() << LEnd;
break;
}

@edenhill
Copy link
Contributor

The librdkafka broker threads should not exit unless the rd_kafka_t handle is is marked for destruction (through rd_kafka_destroy() and rk->rk_terminate).

You could set a breakpoint at the last line of rd_kafka_broker_thread_main() to help figure out why it is exiting.
At the breakpoint, try: p rkb->rkb_rk->rk_terminate

@DEvil0000
Copy link
Contributor Author

I was trying to backtrack it with gdb:
break rd_kafka_broker_thread_main
run
record
fin
(to do later "reverse-step" but it was seg faulting)
I am not sure if this is the real problem or caused by using gdb
more on this: http://pastebin.com/jPNpRSAp

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb00f6b40 (LWP 619)]
rd_kafka_broker_thread_main (arg=0x2) at rdkafka_broker.c:4121
4121 in rdkafka_broker.c
(gdb) bt
#0 rd_kafka_broker_thread_main (arg=0x2) at rdkafka_broker.c:4121
#1 0xb490aba0 in ?? ()
#2 0x000228a3 in ?? ()
#3 0xd5c2e800 in ?? ()
#4 0x2de8ffff in ?? ()
#5 0xe8fffffd in ?? ()
#6 0xffffd608 in ?? ()
#7 0x448bfff0 in ?? ()
#8 0x0f000001 in ?? ()
#9 0xc084c094 in ?? ()
#10 0x016f850f in ?? ()
#11 0x458b0000 in ?? ()
#12 0x0db88008 in ?? ()
#13 0x00000002 in ?? ()
#14 0x0133850f in ?? ()
#15 0x458b0000 in ?? ()
#16 0x8883f008 in ?? ()
#17 0x00000084 in ?? ()
#18 0xd4f6e810 in ?? ()
#19 0x758bffff in ?? ()
#20 0x70968b08 in ?? ()
#21 0xf7000002 in ?? ()
#22 0x89c189d8 in ?? ()
#23 0x21d029e8 in ?? ()
#24 0x74863bc8 in ?? ()
#25 0x0f000002 in ?? ()
#26 0x0000da83 in ?? ()
#27 0x40003d00 in ?? ()
#28 0x870f0000 in ?? ()
#29 0x000000b1 in ?? ()
#30 0x8b08458b in ?? ()
#31 0x00022480 in ?? ()
#32 0x08453900 in ?? ()
#33 0x0f08458b in ?? ()
#34 0x00009284 in ?? ()
#35 0x8480f600 in ?? ()
#36 0x40000000 in ?? ()
#37 0x01b84f75 in ?? ()
#38 0x90000000 in ?? ()
#39 0x80cddb31 in ?? ()
#40 0xd231faeb in ?? ()
#41 0xbd8d20b1 in ?? ()
#42 0xffffff68 in ?? ()
#43 0xbd89d089 in ?? ()
#44 0xffffff64 in ?? ()
#45 0x000008be in ?? ()
#46 0xbfabf300 in ?? ()

@edenhill
Copy link
Contributor

Ah, it dies due to segmentation fault. Try to figure out how, if gdb doesn't help you try running it with valgrind.

@DEvil0000
Copy link
Contributor Author

looks like this segfault is only coming up when trying to debug with record.
when using the following gdb steps it is not segfaulting:

break rd_kafka_broker_thread_main
finish
(Value returned is $2 = (void *) 0x0)

@edenhill
Copy link
Contributor

Check rkb->rkb_rk->rk_terminate

@DEvil0000
Copy link
Contributor Author

(gdb) print ((rd_kafka_broker_)arg)->rkb_state
No symbol "rd_kafka_broker" in current context.
(gdb) print ((rd_kafka_broker_t_)arg)->rkb_state
$7 = RD_KAFKA_BROKER_STATE_DOWN
(gdb) print ((rd_kafka_broker_t_)arg)->rkb_source
$8 = RD_KAFKA_LEARNED
(gdb) print ((rd_kafka_broker_t_)arg)->rkb_rk->rk_terminate
$9 = 0

calling the broker_connect, after that sleeping once..
and looks like shutting down then...

@edenhill
Copy link
Contributor

That's very weird, as you see in rd_kafka_broker_thread_main() the only case where it returns is if rk_terminate is set to non-zero.

Try running it in valgrind to see if there is some memory corruption going on.
I suggest you use the suppressions file available in the tests/ directory.
Something like:

valgrind --suppressions=librdkafka/tests/librdkafka.suppressions ./your-program ..

@DEvil0000
Copy link
Contributor Author

thanks helping me to track this down.
misunderstanding the lib behaviour at this & my bug in the wrapper:
if there is no broker left I stop my producer (the only one using this kafka cluster).
stopping it includes destruction of the producer (reuse was provoking rdkafka issues).
destruction of the (only) producer means rdkafka will shutdown (sounds valid).

Hardware watchpoint 2: -location ((rd_kafka_broker_t*)arg)->rkb_rk->rk_terminate
Old value = 0
New value = 1
rd_kafka_destroy (rk=0xb5b04b70) at rdkafka.c:850
850 rdkafka.c: No such file or directory.
(gdb) bt
#0 rd_kafka_destroy (rk=0xb5b04b70) at rdkafka.c:850
#1 0xb7882380 in ~ProducerImpl (this=0xb5b01878, __in_chrg=, __vtt_parm=) at rdkafkacpp_int.h:339
#2 RdKafka::ProducerImpl::~ProducerImpl (this=0xb5b01878, __in_chrg=, __vtt_parm=) at rdkafkacpp_int.h:339

@edenhill
Copy link
Contributor

Glad you found it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants