Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110] - when consumer starts #234

Closed
BadLambdaJamma opened this issue Aug 19, 2015 · 18 comments

Comments

@BadLambdaJamma
Copy link

stack trace:FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110], at new FailedToRebalanceConsumerError (/Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/errors/FailedToRebalanceConsumerError.js:11:11), at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:167:51, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:413:17, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:251:17, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:148:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:248:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:612:34, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:393:29, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:148:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:383:41, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:251:17, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:148:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:248:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:612:34, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:376:49, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/zookeeper.js:362:22

I have reproduced this issue locally on my OSX 3 broker cluster as well as our 3 broker cluster in AWS (ubuntu based). One consumer will always start. Intermittently, when any subsequent consumers starts, (e.g. consumers 2 or 3) they will receive this message. If an error is NOT encountered when consumers 2 and 3 start (this is an intermittent issue so sometimes they start properly), all 3 consumers will processes 10's of millions of messages without any errors. I can stop and start them, no problems. Getting all 3 consumers to start on a consistent basis is the issue.

setup (using kafka-node 0.2.27)

  1. Using consumer groups - group name DEVNULLCONSUMER
  2. Client ID = DEVNULLCONSUMER' + shortid.generate();
  3. 1 Topic, 3 partitions, 3 consumers, 3 brokers, 1 zookeepr (OSX config, single machine )
  4. 1 Topic, 3 partitions, 3 consumers, 3 brokers, 3 zookeepers w/ executor (Ubuntu - 3 nodes)
  5. Zookeeper zoo.cfg - maxSessionTimeout=5000 (based on previous issue discussions for this error)

suspects highLevelConsumer.js line #150

Nasty hack to retry 3 times to re-balance - TBD fix this
var oldTopicPayloads = self.topicPayloads;
var operation = retry.operation({
retries: 10,
factor: 2,
minTimeout: 1 * 100,
maxTimeout: 1 * 1000,
randomize: true
});

What would the proper way to "unhack" the above code be? Happy to dig deeper and do a PR if I can help.

Thanks
-Jonathan

@BadLambdaJamma
Copy link
Author

After closer inspection it appears that this is not an easy fix as it is systemic to how consumers rebalance. But as a workaround, If my consumers exit upon error and then restart, and don't initially start in close proximity to one and another, I get proper operation from the kafka cluster. In some cases it is not the primary consumer I am starting that gets the error, instead, one of the other running consumers errors as it rebalances in the same timeframe as the joining consumer and other existing consumers. Again, an auto restart approach cleans this up with a restart that eventually does not fail. This can be cyclic with several consumers exiting and restarting before they can successfully rebalance

My consumer restarts are via upstart on ubuntu, This seems hackish if a "consumer supervisor" (in this case upstart) would be needed for every implementation. I would like instead, to code the supervisor directly into kafka-node. I have not reasoned about what that may look like but want opinions on the idea.

@BadLambdaJamma
Copy link
Author

I have validated the above behavior in our production infrastructure (Amazon AWS, 3 Zookeepers, 3 Kafka Brokers). Just keep restarting the consumers. Eventually they will balance. This gave me the sadz.

@mustafamamun
Copy link

Having the same issue. Can be solved by restarting the agent However is there a more elegant way?

@felipesabino
Copy link

Isn't it a duplicate of #90?

@BadLambdaJamma
Copy link
Author

It may be a duplicate of issue #90 but that reporter did not provide enough information for me to draw a strict correlation to that reported bug. mustafamamun: I do believe it could be solved in a more elegant way, with the kafka-node client as it stands today (0.2.27). You probably want to listen for all possible events, then reestablish the clients and consumers/producers. This would not prevent a rebalance error, it would provide wrapping code to gracefully keep retrying until the rebalance was successful.

The main theme is this: This is not a bug, This is a systemic behavior of Kafka and the high level consumer implementation (AKA, the split-brain problem) The rebalance code retries the operation 10 over a one second period. This is exactly one half the time interval for the zookeeper heartbeat (by default). This is intended to fail and be retried on the next heartbeat. Rememeber: High level consumers must all agree to partition assignments within the context of a single zookeeper heartbeat.

Also this: Start using a low level consumer, track offsets yourself. Manage partitions by yourself (a lot harder than it sounds) and this problem wont exist. High level consumer is bad-ass, it just has some prickly edges to be worked around.

Here are all the events you can code for:

HighLevelConsumer.on('rebalancing'
client.on('ready'
client.on('brokersChanged'
client.on('close'
consumer.on('error'
consumer.on('offsetOutOfRange'
consumer.on('message'

@BadLambdaJamma
Copy link
Author

I forgot these two events (probably more) but you get the idea: plenty of events to wag the dog with. maybe I'll take a shot at HighLevelResilientConsumer > keeps rebalancing until the cows come home.

HighLevelConsumer.on('registered');
client.on('error'

Also dont forget:

  1. You almost always get 1-N "node exists errors" during a rebalance.
  2. Failed rebalances can tear down existing consumers (via error) but allow the new consumer to start.

@bendpx
Copy link

bendpx commented Sep 7, 2015

+1

@jezzalaycock
Copy link
Contributor

Hopefully when a move is made to 0.9.0 of Kafka the split brain problem goes away! - see https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design though there will be some code to write.

Using a tool such as pm2 to restart consumers automatically takes all the heavy lifting out of failing to re-balance correctly. We have been using this along with https://www.npmjs.com/package/kafka-node-manager on our production system with no major issues. Happy to share config setup if you would like.

@ersumitchh
Copy link

@jezzalaycock Could you please share your code.

@BadLambdaJamma
Copy link
Author

I hope the Confluence team leaves the split brain in place even when delivering .9. It seems to be the fastest pattern for processing with kafka. I have heard some of the suggestions for .9 regarding bringing more of the "brain" back into Kafka cluster and out of the consumers, and I understand. I just want backward support for full consumer rebalancing with all the power and prickly edges it may bring. Considering these factors, I tip my hat to the kafka-node contributors/team. Most kafka specifications for consumers and producers are axiomatically multithreaded. As such, coding up a client in node.js I am sure gave many grey hair. The good news for me: this stuff all works in AWS production under load. And it does not just work, it screams! Good work guys!

@BadLambdaJamma
Copy link
Author

PS: Posting my first open source Node.js this weekend: SimplePartitioner : a Kafka Message partitioner.

It supports:

  1. Real round Robin
  2. Relational Follower

Also supports a pluggable backing provider for relational follower configurations. Support a message graph n-message deep. The tip of the graph is round robin and all subsequent entities in the graph are relational followers. Supports "sticky round robin" like kafka default but not sticky for so long......

@BadLambdaJamma
Copy link
Author

If your intrested in a specific feature that you need for production, that I have not listed, Please hmu. Happy Coding!

@mindfireon812
Copy link

I am facing this issue any update on this issue?

@mindfireon812
Copy link

for this issue ,whenever your client's /consumer's close error get's called,just restart them again.

@acoulon99
Copy link

+1

@BadLambdaJamma
Copy link
Author

A word of caution about the - "just restart the consumer" approach that I and others have mentioned: If your consumer count reaches a high enough number, or you have significant network latency, "Just Restart" will fail to work. The "Just Restart" methodology works for me in production AWS when I have 3 consumers. If I crank up that number to 10 consumers, I started to see the inevitable "rebalance thrash". If you keep increasing the consumer count, you until finally you reach "endless re-balance" your consumers are always failing the rebalance. Keep this in mind when you let your consumers "Just restart":

  1. consumer restarts - when a consumer restarts, all other consumers call re-balance. If they fail, they restart and the problem compounds.
  2. data flow - all data flow stops during a rebalance - rebalance takes 5 minutes? no data for 5 minutes for you!
  3. rebalance can take a long time. - 5 minutes, 10 minutes, never.

Final Word: This NPM library is not suitable for high volume production work when using the high level consumer. You must use the low level consumer and avoid rebalance altogether. Or/Also suggested that you switch to kafka .9 where the broker cluster manages the partition assignments and not the "split" brain" of the consumer.

@panlilu
Copy link

panlilu commented May 31, 2016

+1

@BadLambdaJamma
Copy link
Author

The re-balance misery continues in Kafka .9 and an entirely new Kafka client (not kafka-node
). With the standard default Zookeeper epoch at 2 seconds, and a high consumer count, re-balances still fail with Kafka .9 broker clusters (never end). As such the guidance is still the same: use the low level consumer EVEN with Kafka .9. I have come to the conclusion that the method by which re-balances are coordinated (Zookeeper eventing essentially) are a failed software pattern for this type of auto-scaling. My efforts will now be shifting from this client lib to another that supports Kafka .9 The intent is to solve the re-balance problem in kafka .9 (conceptually and with various language bindings)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants