-
Notifications
You must be signed in to change notification settings - Fork 914
Why might I get ALL_BROKERS_DOWN at 10 second intervals on a consumer? #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looks like you have issues with your broker. |
With "debug": "broker", I get: some normal looking init stuff: %7|1501855630.618|BROKER|rdkafka#consumer-1| [thrd:app]: 198.18.130.41:9092/bootstrap: Added new broker with N Then for each partition, one of these lines: %7|1501855640.776|TOPBRK|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Topic empvoc_notifications [0]: joining broker (rktp 0xf4801e30) But then: %7|1501855649.619|REQERR|rdkafka#consumer-1| [thrd:main]: 198.18.130.41:9092/1: HeartbeatRequest failed: Broker: Group rebalance in progress: explicit actions 0x0 Then one of these for each partition: %7|1501855653.779|REQERR|rdkafka#consumer-1| [thrd:main]: 198.18.130.41:9092/1: OffsetRequest failed: Local: Timed out: explicit actions 0x0 Followed by what looks like a retry: %7|1501855653.879|CONNECT|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: broker in state DOWN connecting Once again, I'm confident that there are no connectivity issues between the broker and consumer, and the broker is up and running smoothly. |
Please provide:
Thanks |
confluent_kafka.version ('0.9.2', 590080), confluent_kafka.libversion ('0.9.5', 591359) |
I wonder if this is an issue with socket.timeout.ms being smaller than session.timeout.ms. |
Well, well, well. That seems to have done the trick. Many thanks. But why? I think I decreased socket.timeout.ms to 3000 a long time ago because I didn't want such a big delay before I found out about some true network connectivity problem. My app wanted to notice sooner than 30 seconds. Should I decrease the session.timeout.ms instead of increasing socket.timeout.ms? Maybe they should be the same, somewhere in the middle, like 10 or 15 seconds? Intuitively, it seems to me that it is unlikely that you would ever have a "legitimate" network delay of more than a few seconds. I just made them both 30000 and it fixed my problem. Is it OK if they are equal? Here is what CONFIGURATION.md says: socket.timeout.ms: Timeout for network requests. But this doesn't make it clear why I should make socket.timeout.ms > session.timeout.ms. |
The JoinGroupRequest can block up to session.timeout.ms on the broker, so if the client uses a lower timeout than that to time out its request waiting for a response this might happen. Now, this isnt exactly what is happening because the client knows that the JoinGroupRequest can take up to session.timeout.ms (and some grace period above that), but any other requests in-flight behind will be head-of-line-blocked and not processed by the broker until the JoinGroupRequest is done, so it is likely some other request that is actually timing out, such as a Metadata request, Fetch, etc. The workaround is simple enough: keep socket.timeout.ms above session.timeout.ms. And I would advise against having a low socket timeout anyway, a TCP connection may stall for more than a couple of seconds if any of the two systems connected are swapping, GC:ing, IO:peaking, etc, or if there is a shorter network outage / re-routing it may take some time for TCP to recover from retransmitting lost packets. |
So definitely, socket.timeout.ms greater than session.timeout.ms, not merely equal? |
Yes, I would recommend session.timeout.ms + 10s or more. |
I am a consumer in a test lab, and the producer is on the same machine as my single Kafka server. That is, it is a single-node cluster, with zookeeper running on the same machine as well. The consumer is on the same network, and I am sure there are no connectivity problems between consumer and broker. The consumer gets ALL_BROKERS_DOWN a lot, either exactly 10, 20, or 30 seconds after the last time, and when I say exactly, I mean almost to the thousandth of a second. Here is an excerpt from my log file:
[2017-08-03 20:24:12,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:24:22,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:24:42,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:24:52,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:25:02,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:25:22,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:25:32,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
There are occasional exceptions to this 10-second rule, but look at all the 652's in the lines above. These are the thousandths of a second.
I get MSG_TIMED_OUT as well (on my consumer). I also sometimes manage to receive some actual messages between times I get ALL_BROKERS_DOWN. What could this mean? Is there some periodic heartbeat/handshake or something that takes place on 10-second intervals between the consumer and the broker/cluster? The consumer is not doing anything every 10 seconds.
The text was updated successfully, but these errors were encountered: