-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110] - when consumer starts #234
Comments
After closer inspection it appears that this is not an easy fix as it is systemic to how consumers rebalance. But as a workaround, If my consumers exit upon error and then restart, and don't initially start in close proximity to one and another, I get proper operation from the kafka cluster. In some cases it is not the primary consumer I am starting that gets the error, instead, one of the other running consumers errors as it rebalances in the same timeframe as the joining consumer and other existing consumers. Again, an auto restart approach cleans this up with a restart that eventually does not fail. This can be cyclic with several consumers exiting and restarting before they can successfully rebalance My consumer restarts are via upstart on ubuntu, This seems hackish if a "consumer supervisor" (in this case upstart) would be needed for every implementation. I would like instead, to code the supervisor directly into kafka-node. I have not reasoned about what that may look like but want opinions on the idea. |
I have validated the above behavior in our production infrastructure (Amazon AWS, 3 Zookeepers, 3 Kafka Brokers). Just keep restarting the consumers. Eventually they will balance. This gave me the sadz. |
Having the same issue. Can be solved by restarting the agent However is there a more elegant way? |
Isn't it a duplicate of #90? |
It may be a duplicate of issue #90 but that reporter did not provide enough information for me to draw a strict correlation to that reported bug. mustafamamun: I do believe it could be solved in a more elegant way, with the kafka-node client as it stands today (0.2.27). You probably want to listen for all possible events, then reestablish the clients and consumers/producers. This would not prevent a rebalance error, it would provide wrapping code to gracefully keep retrying until the rebalance was successful. The main theme is this: This is not a bug, This is a systemic behavior of Kafka and the high level consumer implementation (AKA, the split-brain problem) The rebalance code retries the operation 10 over a one second period. This is exactly one half the time interval for the zookeeper heartbeat (by default). This is intended to fail and be retried on the next heartbeat. Rememeber: High level consumers must all agree to partition assignments within the context of a single zookeeper heartbeat. Also this: Start using a low level consumer, track offsets yourself. Manage partitions by yourself (a lot harder than it sounds) and this problem wont exist. High level consumer is bad-ass, it just has some prickly edges to be worked around. Here are all the events you can code for: HighLevelConsumer.on('rebalancing'client.on('ready'client.on('brokersChanged'client.on('close'consumer.on('error'consumer.on('offsetOutOfRange'consumer.on('message' |
I forgot these two events (probably more) but you get the idea: plenty of events to wag the dog with. maybe I'll take a shot at HighLevelResilientConsumer > keeps rebalancing until the cows come home. HighLevelConsumer.on('registered');client.on('error'Also dont forget:
|
+1 |
Hopefully when a move is made to 0.9.0 of Kafka the split brain problem goes away! - see https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design though there will be some code to write. Using a tool such as pm2 to restart consumers automatically takes all the heavy lifting out of failing to re-balance correctly. We have been using this along with https://www.npmjs.com/package/kafka-node-manager on our production system with no major issues. Happy to share config setup if you would like. |
@jezzalaycock Could you please share your code. |
I hope the Confluence team leaves the split brain in place even when delivering .9. It seems to be the fastest pattern for processing with kafka. I have heard some of the suggestions for .9 regarding bringing more of the "brain" back into Kafka cluster and out of the consumers, and I understand. I just want backward support for full consumer rebalancing with all the power and prickly edges it may bring. Considering these factors, I tip my hat to the kafka-node contributors/team. Most kafka specifications for consumers and producers are axiomatically multithreaded. As such, coding up a client in node.js I am sure gave many grey hair. The good news for me: this stuff all works in AWS production under load. And it does not just work, it screams! Good work guys! |
PS: Posting my first open source Node.js this weekend: SimplePartitioner : a Kafka Message partitioner. It supports:
Also supports a pluggable backing provider for relational follower configurations. Support a message graph n-message deep. The tip of the graph is round robin and all subsequent entities in the graph are relational followers. Supports "sticky round robin" like kafka default but not sticky for so long...... |
If your intrested in a specific feature that you need for production, that I have not listed, Please hmu. Happy Coding! |
I am facing this issue any update on this issue? |
for this issue ,whenever your client's /consumer's close error get's called,just restart them again. |
+1 |
A word of caution about the - "just restart the consumer" approach that I and others have mentioned: If your consumer count reaches a high enough number, or you have significant network latency, "Just Restart" will fail to work. The "Just Restart" methodology works for me in production AWS when I have 3 consumers. If I crank up that number to 10 consumers, I started to see the inevitable "rebalance thrash". If you keep increasing the consumer count, you until finally you reach "endless re-balance" your consumers are always failing the rebalance. Keep this in mind when you let your consumers "Just restart":
Final Word: This NPM library is not suitable for high volume production work when using the high level consumer. You must use the low level consumer and avoid rebalance altogether. Or/Also suggested that you switch to kafka .9 where the broker cluster manages the partition assignments and not the "split" brain" of the consumer. |
+1 |
The re-balance misery continues in Kafka .9 and an entirely new Kafka client (not kafka-node |
stack trace:FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110], at new FailedToRebalanceConsumerError (/Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/errors/FailedToRebalanceConsumerError.js:11:11), at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:167:51, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:413:17, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:251:17, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:148:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:248:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:612:34, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:393:29, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:148:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:383:41, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:251:17, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:148:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:248:21, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/node_modules/async/lib/async.js:612:34, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/highLevelConsumer.js:376:49, at /Users/jonathannewell/workspace/kafkacop/node_modules/kafka-node/lib/zookeeper.js:362:22
I have reproduced this issue locally on my OSX 3 broker cluster as well as our 3 broker cluster in AWS (ubuntu based). One consumer will always start. Intermittently, when any subsequent consumers starts, (e.g. consumers 2 or 3) they will receive this message. If an error is NOT encountered when consumers 2 and 3 start (this is an intermittent issue so sometimes they start properly), all 3 consumers will processes 10's of millions of messages without any errors. I can stop and start them, no problems. Getting all 3 consumers to start on a consistent basis is the issue.
setup (using kafka-node 0.2.27)
suspects highLevelConsumer.js line #150
Nasty hack to retry 3 times to re-balance - TBD fix this
var oldTopicPayloads = self.topicPayloads;
var operation = retry.operation({
retries: 10,
factor: 2,
minTimeout: 1 * 100,
maxTimeout: 1 * 1000,
randomize: true
});
What would the proper way to "unhack" the above code be? Happy to dig deeper and do a PR if I can help.
Thanks
-Jonathan
The text was updated successfully, but these errors were encountered: