Why might I get ALL_BROKERS_DOWN at 10 second intervals on a consumer? #227

johnrgregg3 · 2017-08-03T20:31:58Z

I am a consumer in a test lab, and the producer is on the same machine as my single Kafka server. That is, it is a single-node cluster, with zookeeper running on the same machine as well. The consumer is on the same network, and I am sure there are no connectivity problems between consumer and broker. The consumer gets ALL_BROKERS_DOWN a lot, either exactly 10, 20, or 30 seconds after the last time, and when I say exactly, I mean almost to the thousandth of a second. Here is an excerpt from my log file:

[2017-08-03 20:24:12,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:24:22,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:24:42,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:24:52,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:25:02,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:25:22,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:25:32,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)

There are occasional exceptions to this 10-second rule, but look at all the 652's in the lines above. These are the thousandths of a second.

I get MSG_TIMED_OUT as well (on my consumer). I also sometimes manage to receive some actual messages between times I get ALL_BROKERS_DOWN. What could this mean? Is there some periodic heartbeat/handshake or something that takes place on 10-second intervals between the consumer and the broker/cluster? The consumer is not doing anything every 10 seconds.

edenhill · 2017-08-04T07:30:54Z

Looks like you have issues with your broker.
Enable "debug": "broker" config to get some hints on what's going on, and also look at the broker logs.

johnrgregg3 · 2017-08-04T14:24:48Z

With "debug": "broker", I get:

some normal looking init stuff:

Then for each partition, one of these lines:

%7|1501855640.776|TOPBRK|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Topic empvoc_notifications [0]: joining broker (rktp 0xf4801e30)

But then:

Then one of these for each partition:

%7|1501855653.779|REQERR|rdkafka#consumer-1| [thrd:main]: 198.18.130.41:9092/1: OffsetRequest failed: Local: Timed out: explicit actions 0x0

Followed by what looks like a retry:

Once again, I'm confident that there are no connectivity issues between the broker and consumer, and the broker is up and running smoothly.

edenhill · 2017-08-07T08:34:32Z

Please provide:

confluent-kafka-python version
librdkafka version
broker version
client configuration

Thanks

johnrgregg3 · 2017-08-07T13:27:35Z

confluent_kafka.version ('0.9.2', 590080), confluent_kafka.libversion ('0.9.5', 591359)
NOTE: this is really confluent_kafka.version 0.9.4 - it misreports at 0.9.2.
broker version is Kafka 0.10.0.0
Client config:
{'api.version.request': True,
'error_cb': <function vocbe_kafka_error at 0xf5c6f374>,
'client.id': 'Master_BE_198.18.130.41',
'message.send.max.retries': 3,
'bootstrap.servers': '198.18.130.41',
'statistics.interval.ms': 3000,
'socket.max.fails': 1,
'stats_cb': <function kafka_stats_cb at 0xf5c6f33c>,
'default.topic.config': {'compression.codec': u'gzip', 'acks': 1, 'message.timeout.ms': 300000}, 'partition.assignment.strategy': 'roundrobin',
'socket.timeout.ms': 3000}

edenhill · 2017-08-09T14:09:23Z

I wonder if this is an issue with socket.timeout.ms being smaller than session.timeout.ms.
Try increasing socket.timeout.ms > session.timeout.ms (def 30000)

johnrgregg3 · 2017-08-09T15:24:12Z

Well, well, well. That seems to have done the trick. Many thanks. But why? I think I decreased socket.timeout.ms to 3000 a long time ago because I didn't want such a big delay before I found out about some true network connectivity problem. My app wanted to notice sooner than 30 seconds. Should I decrease the session.timeout.ms instead of increasing socket.timeout.ms? Maybe they should be the same, somewhere in the middle, like 10 or 15 seconds? Intuitively, it seems to me that it is unlikely that you would ever have a "legitimate" network delay of more than a few seconds. I just made them both 30000 and it fixed my problem. Is it OK if they are equal? Here is what CONFIGURATION.md says:

socket.timeout.ms: Timeout for network requests.
session.timeout.ms: Client group session and failure detection timeout.

But this doesn't make it clear why I should make socket.timeout.ms > session.timeout.ms.

edenhill · 2017-08-09T20:56:15Z

The JoinGroupRequest can block up to session.timeout.ms on the broker, so if the client uses a lower timeout than that to time out its request waiting for a response this might happen.

Now, this isnt exactly what is happening because the client knows that the JoinGroupRequest can take up to session.timeout.ms (and some grace period above that), but any other requests in-flight behind will be head-of-line-blocked and not processed by the broker until the JoinGroupRequest is done, so it is likely some other request that is actually timing out, such as a Metadata request, Fetch, etc.

The workaround is simple enough: keep socket.timeout.ms above session.timeout.ms.

And I would advise against having a low socket timeout anyway, a TCP connection may stall for more than a couple of seconds if any of the two systems connected are swapping, GC:ing, IO:peaking, etc, or if there is a shorter network outage / re-routing it may take some time for TCP to recover from retransmitting lost packets.

johnrgregg3 · 2017-08-09T21:51:23Z

So definitely, socket.timeout.ms greater than session.timeout.ms, not merely equal?

edenhill · 2017-08-10T05:47:00Z

Yes, I would recommend session.timeout.ms + 10s or more.

edenhill added the wait info label Aug 4, 2017

edenhill closed this as completed Sep 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why might I get ALL_BROKERS_DOWN at 10 second intervals on a consumer? #227

Why might I get ALL_BROKERS_DOWN at 10 second intervals on a consumer? #227

johnrgregg3 commented Aug 3, 2017

edenhill commented Aug 4, 2017

johnrgregg3 commented Aug 4, 2017

edenhill commented Aug 7, 2017

johnrgregg3 commented Aug 7, 2017

edenhill commented Aug 9, 2017

johnrgregg3 commented Aug 9, 2017

edenhill commented Aug 9, 2017

johnrgregg3 commented Aug 9, 2017

edenhill commented Aug 10, 2017

Why might I get ALL_BROKERS_DOWN at 10 second intervals on a consumer? #227

Why might I get ALL_BROKERS_DOWN at 10 second intervals on a consumer? #227

Comments

johnrgregg3 commented Aug 3, 2017

edenhill commented Aug 4, 2017

johnrgregg3 commented Aug 4, 2017

edenhill commented Aug 7, 2017

johnrgregg3 commented Aug 7, 2017

edenhill commented Aug 9, 2017

johnrgregg3 commented Aug 9, 2017

edenhill commented Aug 9, 2017

johnrgregg3 commented Aug 9, 2017

edenhill commented Aug 10, 2017