Intermittent FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110] - when consumer starts #234

BadLambdaJamma · 2015-08-19T20:16:18Z

stack trace:FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110], at new FailedToRebalanceConsumerError (/Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/errors/FailedToRebalanceConsumerError.js:11:11), at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:167:51, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:413:17, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:251:17, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:148:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:248:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:612:34, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:393:29, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:148:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:383:41, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:251:17, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:148:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:248:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:612:34, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:376:49, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/zookeeper.js:362:22

I have reproduced this issue locally on my OSX 3 broker cluster as well as our 3 broker cluster in AWS (ubuntu based). One consumer will always start. Intermittently, when any subsequent consumers starts, (e.g. consumers 2 or 3) they will receive this message. If an error is NOT encountered when consumers 2 and 3 start (this is an intermittent issue so sometimes they start properly), all 3 consumers will processes 10's of millions of messages without any errors. I can stop and start them, no problems. Getting all 3 consumers to start on a consistent basis is the issue.

setup (using kafka-node 0.2.27)

Using consumer groups - group name DEVNULLCONSUMER
Client ID = DEVNULLCONSUMER' + shortid.generate();
1 Topic, 3 partitions, 3 consumers, 3 brokers, 1 zookeepr (OSX config, single machine )
1 Topic, 3 partitions, 3 consumers, 3 brokers, 3 zookeepers w/ executor (Ubuntu - 3 nodes)
Zookeeper zoo.cfg - maxSessionTimeout=5000 (based on previous issue discussions for this error)

suspects highLevelConsumer.js line #150

Nasty hack to retry 3 times to re-balance - TBD fix this
var oldTopicPayloads = self.topicPayloads;
var operation = retry.operation({
retries: 10,
factor: 2,
minTimeout: 1 * 100,
maxTimeout: 1 * 1000,
randomize: true
});

What would the proper way to "unhack" the above code be? Happy to dig deeper and do a PR if I can help.

Thanks
-Jonathan

BadLambdaJamma · 2015-08-20T05:04:12Z

After closer inspection it appears that this is not an easy fix as it is systemic to how consumers rebalance. But as a workaround, If my consumers exit upon error and then restart, and don't initially start in close proximity to one and another, I get proper operation from the kafka cluster. In some cases it is not the primary consumer I am starting that gets the error, instead, one of the other running consumers errors as it rebalances in the same timeframe as the joining consumer and other existing consumers. Again, an auto restart approach cleans this up with a restart that eventually does not fail. This can be cyclic with several consumers exiting and restarting before they can successfully rebalance

My consumer restarts are via upstart on ubuntu, This seems hackish if a "consumer supervisor" (in this case upstart) would be needed for every implementation. I would like instead, to code the supervisor directly into kafka-node. I have not reasoned about what that may look like but want opinions on the idea.

BadLambdaJamma · 2015-08-20T16:46:12Z

I have validated the above behavior in our production infrastructure (Amazon AWS, 3 Zookeepers, 3 Kafka Brokers). Just keep restarting the consumers. Eventually they will balance. This gave me the sadz.

mustafamamun · 2015-08-25T06:47:32Z

Having the same issue. Can be solved by restarting the agent However is there a more elegant way?

felipesabino · 2015-08-25T20:58:35Z

Isn't it a duplicate of #90?

BadLambdaJamma · 2015-08-25T22:00:26Z

It may be a duplicate of issue #90 but that reporter did not provide enough information for me to draw a strict correlation to that reported bug. mustafamamun: I do believe it could be solved in a more elegant way, with the kafka-node client as it stands today (0.2.27). You probably want to listen for all possible events, then reestablish the clients and consumers/producers. This would not prevent a rebalance error, it would provide wrapping code to gracefully keep retrying until the rebalance was successful.

The main theme is this: This is not a bug, This is a systemic behavior of Kafka and the high level consumer implementation (AKA, the split-brain problem) The rebalance code retries the operation 10 over a one second period. This is exactly one half the time interval for the zookeeper heartbeat (by default). This is intended to fail and be retried on the next heartbeat. Rememeber: High level consumers must all agree to partition assignments within the context of a single zookeeper heartbeat.

Also this: Start using a low level consumer, track offsets yourself. Manage partitions by yourself (a lot harder than it sounds) and this problem wont exist. High level consumer is bad-ass, it just has some prickly edges to be worked around.

Here are all the events you can code for:

HighLevelConsumer.on('rebalancing'

client.on('ready'

client.on('brokersChanged'

client.on('close'

consumer.on('error'

consumer.on('offsetOutOfRange'

consumer.on('message'

BadLambdaJamma · 2015-08-25T22:15:01Z

I forgot these two events (probably more) but you get the idea: plenty of events to wag the dog with. maybe I'll take a shot at HighLevelResilientConsumer > keeps rebalancing until the cows come home.

HighLevelConsumer.on('registered');

client.on('error'

Also dont forget:

You almost always get 1-N "node exists errors" during a rebalance.
Failed rebalances can tear down existing consumers (via error) but allow the new consumer to start.

bendpx · 2015-09-07T15:54:48Z

+1

jezzalaycock · 2015-09-08T16:48:10Z

Hopefully when a move is made to 0.9.0 of Kafka the split brain problem goes away! - see https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design though there will be some code to write.

Using a tool such as pm2 to restart consumers automatically takes all the heavy lifting out of failing to re-balance correctly. We have been using this along with https://www.npmjs.com/package/kafka-node-manager on our production system with no major issues. Happy to share config setup if you would like.

ersumitchh · 2015-09-24T09:40:55Z

@jezzalaycock Could you please share your code.

BadLambdaJamma · 2015-09-24T23:42:42Z

I hope the Confluence team leaves the split brain in place even when delivering .9. It seems to be the fastest pattern for processing with kafka. I have heard some of the suggestions for .9 regarding bringing more of the "brain" back into Kafka cluster and out of the consumers, and I understand. I just want backward support for full consumer rebalancing with all the power and prickly edges it may bring. Considering these factors, I tip my hat to the kafka-node contributors/team. Most kafka specifications for consumers and producers are axiomatically multithreaded. As such, coding up a client in node.js I am sure gave many grey hair. The good news for me: this stuff all works in AWS production under load. And it does not just work, it screams! Good work guys!

BadLambdaJamma · 2015-09-24T23:45:32Z

PS: Posting my first open source Node.js this weekend: SimplePartitioner : a Kafka Message partitioner.

It supports:

Real round Robin
Relational Follower

Also supports a pluggable backing provider for relational follower configurations. Support a message graph n-message deep. The tip of the graph is round robin and all subsequent entities in the graph are relational followers. Supports "sticky round robin" like kafka default but not sticky for so long......

BadLambdaJamma · 2015-09-24T23:46:26Z

If your intrested in a specific feature that you need for production, that I have not listed, Please hmu. Happy Coding!

mindfireon812 · 2016-02-25T17:12:43Z

I am facing this issue any update on this issue?

mindfireon812 · 2016-03-17T05:34:33Z

for this issue ,whenever your client's /consumer's close error get's called,just restart them again.

acoulon99 · 2016-05-18T15:06:41Z

+1

BadLambdaJamma · 2016-05-18T15:51:01Z

A word of caution about the - "just restart the consumer" approach that I and others have mentioned: If your consumer count reaches a high enough number, or you have significant network latency, "Just Restart" will fail to work. The "Just Restart" methodology works for me in production AWS when I have 3 consumers. If I crank up that number to 10 consumers, I started to see the inevitable "rebalance thrash". If you keep increasing the consumer count, you until finally you reach "endless re-balance" your consumers are always failing the rebalance. Keep this in mind when you let your consumers "Just restart":

consumer restarts - when a consumer restarts, all other consumers call re-balance. If they fail, they restart and the problem compounds.
data flow - all data flow stops during a rebalance - rebalance takes 5 minutes? no data for 5 minutes for you!
rebalance can take a long time. - 5 minutes, 10 minutes, never.

Final Word: This NPM library is not suitable for high volume production work when using the high level consumer. You must use the low level consumer and avoid rebalance altogether. Or/Also suggested that you switch to kafka .9 where the broker cluster manages the partition assignments and not the "split" brain" of the consumer.

panlilu · 2016-05-31T07:09:13Z

+1

BadLambdaJamma · 2016-06-10T19:02:46Z

The re-balance misery continues in Kafka .9 and an entirely new Kafka client (not kafka-node
). With the standard default Zookeeper epoch at 2 seconds, and a high consumer count, re-balances still fail with Kafka .9 broker clusters (never end). As such the guidance is still the same: use the low level consumer EVEN with Kafka .9. I have come to the conclusion that the method by which re-balances are coordinated (Zookeeper eventing essentially) are a failed software pattern for this type of auto-scaling. My efforts will now be shifting from this client lib to another that supports Kafka .9 The intent is to solve the re-balance problem in kafka .9 (conceptually and with various language bindings)

crzidea mentioned this issue May 24, 2018

HighLevelConsumer throw FailedToRebalanceConsumerError: NODE_EXISTS when rebalancing #981

Merged

hyperlink closed this as completed Sep 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110] - when consumer starts #234

Intermittent FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110] - when consumer starts #234

BadLambdaJamma commented Aug 19, 2015

BadLambdaJamma commented Aug 20, 2015

BadLambdaJamma commented Aug 20, 2015

mustafamamun commented Aug 25, 2015

felipesabino commented Aug 25, 2015

BadLambdaJamma commented Aug 25, 2015

BadLambdaJamma commented Aug 25, 2015

bendpx commented Sep 7, 2015

jezzalaycock commented Sep 8, 2015

ersumitchh commented Sep 24, 2015

BadLambdaJamma commented Sep 24, 2015

BadLambdaJamma commented Sep 24, 2015

BadLambdaJamma commented Sep 24, 2015

mindfireon812 commented Feb 25, 2016

mindfireon812 commented Mar 17, 2016

acoulon99 commented May 18, 2016

BadLambdaJamma commented May 18, 2016

panlilu commented May 31, 2016

BadLambdaJamma commented Jun 10, 2016

Intermittent FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110] - when consumer starts #234

Intermittent FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110] - when consumer starts #234

Comments

BadLambdaJamma commented Aug 19, 2015

BadLambdaJamma commented Aug 20, 2015

BadLambdaJamma commented Aug 20, 2015

mustafamamun commented Aug 25, 2015

felipesabino commented Aug 25, 2015

BadLambdaJamma commented Aug 25, 2015

HighLevelConsumer.on('rebalancing'

client.on('ready'

client.on('brokersChanged'

client.on('close'

consumer.on('error'

consumer.on('offsetOutOfRange'

consumer.on('message'

BadLambdaJamma commented Aug 25, 2015

HighLevelConsumer.on('registered');

client.on('error'

bendpx commented Sep 7, 2015

jezzalaycock commented Sep 8, 2015

ersumitchh commented Sep 24, 2015

BadLambdaJamma commented Sep 24, 2015

BadLambdaJamma commented Sep 24, 2015

BadLambdaJamma commented Sep 24, 2015

mindfireon812 commented Feb 25, 2016

mindfireon812 commented Mar 17, 2016

acoulon99 commented May 18, 2016

BadLambdaJamma commented May 18, 2016

panlilu commented May 31, 2016

BadLambdaJamma commented Jun 10, 2016