Skip to content

Why might I get ALL_BROKERS_DOWN at 10 second intervals on a consumer? #227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
johnrgregg3 opened this issue Aug 3, 2017 · 9 comments
Closed

Comments

@johnrgregg3
Copy link

I am a consumer in a test lab, and the producer is on the same machine as my single Kafka server. That is, it is a single-node cluster, with zookeeper running on the same machine as well. The consumer is on the same network, and I am sure there are no connectivity problems between consumer and broker. The consumer gets ALL_BROKERS_DOWN a lot, either exactly 10, 20, or 30 seconds after the last time, and when I say exactly, I mean almost to the thousandth of a second. Here is an excerpt from my log file:

[2017-08-03 20:24:12,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:24:22,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:24:42,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:24:52,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:25:02,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:25:22,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)
[2017-08-03 20:25:32,652] [18018] [vocfe_notify_server.py:634] [DEBUG] vocfe_kafka_error(198.18.130.41): code -187, name _ALL_BROKERS_DOWN (former kafka_error_code -192)

There are occasional exceptions to this 10-second rule, but look at all the 652's in the lines above. These are the thousandths of a second.

I get MSG_TIMED_OUT as well (on my consumer). I also sometimes manage to receive some actual messages between times I get ALL_BROKERS_DOWN. What could this mean? Is there some periodic heartbeat/handshake or something that takes place on 10-second intervals between the consumer and the broker/cluster? The consumer is not doing anything every 10 seconds.

@edenhill
Copy link
Contributor

edenhill commented Aug 4, 2017

Looks like you have issues with your broker.
Enable "debug": "broker" config to get some hints on what's going on, and also look at the broker logs.

@johnrgregg3
Copy link
Author

With "debug": "broker", I get:

some normal looking init stuff:

%7|1501855630.618|BROKER|rdkafka#consumer-1| [thrd:app]: 198.18.130.41:9092/bootstrap: Added new broker with N
odeId -1
%7|1501855630.618|BRKMAIN|rdkafka#consumer-1| [thrd::0/internal]: :0/internal: Enter main broker thread
%7|1501855630.618|STATE|rdkafka#consumer-1| [thrd::0/internal]: :0/internal: Broker changed state INIT -> UP
%7|1501855630.618|BRKMAIN|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/bootstrap: Enter main broker thread
%7|1501855630.618|CONNECT|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/bootstrap: broker in state INIT connecting
%7|1501855630.618|CONNECT|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/bootstrap: Connecting to ipv4#198.18.130.41:9092 (plaintext) with socket 9
%7|1501855630.618|STATE|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/bootstrap:
Broker changed state INIT -> CONNECT
%7|1501855630.628|CONNECT|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/bootstrap: Connected to ipv4#198.18.130.41:9092
%7|1501855630.628|CONNECTED|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/bootstrap: Connected (#1)
%7|1501855630.628|FEATURE|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/bootstrap: Updated enabled protocol features +ApiVersion to ApiVersion
%7|1501855630.628|STATE|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/bootstrap: Broker changed state CONNECT -> APIVERSION_QUERY
%7|1501855630.648|FEATURE|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/bootstrap: Updated enabled protocol features to MsgVer1,ApiVersion,BrokerBalancedConsumer,ThrottleTime,Sasl,SaslHandshake,BrokerGroupCoordinator,LZ4
%7|1501855630.648|STATE|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/bootstrap: Broker changed state APIVERSION_QUERY -> UP
%7|1501855630.669|UPDATE|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/bootstrap: NodeId changed from -1 to 1
%7|1501855630.669|UPDATE|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Name changed from 198.18.130.41:9092/bootstrap to 198.18.130.41:9092/1
%7|1501855630.669|LEADER|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Mapped 0 partition(s) to broker
%7|1501855630.669|STATE|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Broker changed state UP -> UPDATE
%7|1501855630.679|STATE|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Broker changed state UPDATE -> UP

Then for each partition, one of these lines:

%7|1501855640.776|TOPBRK|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Topic empvoc_notifications [0]: joining broker (rktp 0xf4801e30)

But then:

%7|1501855649.619|REQERR|rdkafka#consumer-1| [thrd:main]: 198.18.130.41:9092/1: HeartbeatRequest failed: Broker: Group rebalance in progress: explicit actions 0x0
%7|1501855653.779|REQTMOUT|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Timed out 32+64+0 requests
%7|1501855653.779|BROKERFAIL|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: failed: err: Local: Message timed out: (errno: Connection timed out)

Then one of these for each partition:

%7|1501855653.779|REQERR|rdkafka#consumer-1| [thrd:main]: 198.18.130.41:9092/1: OffsetRequest failed: Local: Timed out: explicit actions 0x0

Followed by what looks like a retry:

%7|1501855653.879|CONNECT|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: broker in state DOWN connecting
%7|1501855653.879|CONNECT|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Connecting to ipv4#198.18.130.41:9092 (plaintext) with socket 9
%7|1501855653.879|STATE|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Broker changed state DOWN -> CONNECT
%7|1501855653.889|CONNECT|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Connected to ipv4#198.18.130.41:9092
%7|1501855653.889|CONNECTED|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Connected (#2)
%7|1501855653.889|STATE|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Broker changed state CONNECT -> APIVERSION_QUERY
%7|1501855653.909|STATE|rdkafka#consumer-1| [thrd:198.18.130.41:9092/bootstrap]: 198.18.130.41:9092/1: Broker changed state APIVERSION_QUERY -> UP

Once again, I'm confident that there are no connectivity issues between the broker and consumer, and the broker is up and running smoothly.

@edenhill
Copy link
Contributor

edenhill commented Aug 7, 2017

Please provide:

  • confluent-kafka-python version
  • librdkafka version
  • broker version
  • client configuration

Thanks

@johnrgregg3
Copy link
Author

confluent_kafka.version ('0.9.2', 590080), confluent_kafka.libversion ('0.9.5', 591359)
NOTE: this is really confluent_kafka.version 0.9.4 - it misreports at 0.9.2.
broker version is Kafka 0.10.0.0
Client config:
{'api.version.request': True,
'error_cb': <function vocbe_kafka_error at 0xf5c6f374>,
'client.id': 'Master_BE_198.18.130.41',
'message.send.max.retries': 3,
'bootstrap.servers': '198.18.130.41',
'statistics.interval.ms': 3000,
'socket.max.fails': 1,
'stats_cb': <function kafka_stats_cb at 0xf5c6f33c>,
'default.topic.config': {'compression.codec': u'gzip', 'acks': 1, 'message.timeout.ms': 300000}, 'partition.assignment.strategy': 'roundrobin',
'socket.timeout.ms': 3000}

@edenhill
Copy link
Contributor

edenhill commented Aug 9, 2017

I wonder if this is an issue with socket.timeout.ms being smaller than session.timeout.ms.
Try increasing socket.timeout.ms > session.timeout.ms (def 30000)

@johnrgregg3
Copy link
Author

Well, well, well. That seems to have done the trick. Many thanks. But why? I think I decreased socket.timeout.ms to 3000 a long time ago because I didn't want such a big delay before I found out about some true network connectivity problem. My app wanted to notice sooner than 30 seconds. Should I decrease the session.timeout.ms instead of increasing socket.timeout.ms? Maybe they should be the same, somewhere in the middle, like 10 or 15 seconds? Intuitively, it seems to me that it is unlikely that you would ever have a "legitimate" network delay of more than a few seconds. I just made them both 30000 and it fixed my problem. Is it OK if they are equal? Here is what CONFIGURATION.md says:

socket.timeout.ms: Timeout for network requests.
session.timeout.ms: Client group session and failure detection timeout.

But this doesn't make it clear why I should make socket.timeout.ms > session.timeout.ms.

@edenhill
Copy link
Contributor

edenhill commented Aug 9, 2017

The JoinGroupRequest can block up to session.timeout.ms on the broker, so if the client uses a lower timeout than that to time out its request waiting for a response this might happen.

Now, this isnt exactly what is happening because the client knows that the JoinGroupRequest can take up to session.timeout.ms (and some grace period above that), but any other requests in-flight behind will be head-of-line-blocked and not processed by the broker until the JoinGroupRequest is done, so it is likely some other request that is actually timing out, such as a Metadata request, Fetch, etc.

The workaround is simple enough: keep socket.timeout.ms above session.timeout.ms.

And I would advise against having a low socket timeout anyway, a TCP connection may stall for more than a couple of seconds if any of the two systems connected are swapping, GC:ing, IO:peaking, etc, or if there is a shorter network outage / re-routing it may take some time for TCP to recover from retransmitting lost packets.

@johnrgregg3
Copy link
Author

So definitely, socket.timeout.ms greater than session.timeout.ms, not merely equal?

@edenhill
Copy link
Contributor

Yes, I would recommend session.timeout.ms + 10s or more.

@edenhill edenhill closed this as completed Sep 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants