-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High-level consumer rebalance algorithm having issues #124
Comments
Rebalancing in kafka relies on an assumption that all consumers try at the same time: see below for an extract from the kafka docs. "Each consumer does the following during rebalancing:
When rebalancing is triggered at one consumer, rebalancing should be triggered in other consumers *within the same group about the same time." Essentially zk must know about all consumers in a group at the same time to ensure partitions are allocated appropriately. Rebalncing in kafka is a pain - see issues such as https://issues.apache.org/jira/browse/KAFKA-242 I would suggest that the client code manually manage offsets - though this in itself is complicated as you must keep track of rebalances and commits. |
Ah - this is very interesting – I searched for a while but could not find that document on this rebalancing algorithm. However – that still doesn't explain the issue I was seeing: when I ran |
I have issued a pull request using retry to retry the rebalance. The management of the offsets I do in my client - I've got a multinode zk and broker which has been running 24/7 with no issues. I will try to share my client code with you. I heavily rely on pm2 to manage restarts as there are some circumstances that I simply want the client to die and re-connect - for instance when I lose connection or brokers die. Using pm2 it will simply retry and then reconnect and consume messages from the last committed offset. If you want the consumer to co-exist with other consumers (not necessarily a node client) then you must follow the regime set out in the kafka docs. |
Ahh, very interesting. I searched for a while but could not find the documentation on how consumers are supposed to handle rebalance. I actually ended up writing a Java client that makes an HTTP request to node, and that seems to be a lot more stable. PM2 sounds interesting - thanks. |
@jezzalaycock Can you share the code to me? thx. |
Hey all, I saw a number of issues with the rebalancing algorithm in the high-level consumer. Namely, when several new consumers joined the cluster at the same time, the algorithm had trouble deciding how many partitions each consumer should be responsible for. This resulted in issues wherein partitions would become unassigned, and, more importantly, issues where somehow the consumer would not resolve it's conflicts within Zookeeper but read messages from its partitions in Kafka anyway, so multiple consumers would end up reading messages from the same partition(s). This was not good.
Additionally, I ran into #90, wherein multiple consumers repeatedly try to claim the same partition – I think this is simply due to the fact that the number of retries is not high enough, and all the consumers retry at the same time, creating contention. Using the retry library, I was able to increase the number of retries without more nesting, and I was able to randomize the timeout to hopefully reduce contention.
I also ran into #112, which arises due to the fact that in the rebalance code, the consumer will unassign itself from its partition successfully, but fail to claim a new partition. It will then retry the rebalance code, and try again to unassign itself from its partitions, even though it has already been assigned. I solved this by retrying each part of the rebalance code separately.
Anyway, due to these four issues (unassigned partitions, double-assigned partitions, NODE_EXISTS, and NO_NODE), I rewrote the rebalance code for my purposes and it seems to work quite well. I do not have the time to make it a full pull request, but I am attaching the code here in hopes that it will be useful to you. Perhaps at some point in the future I can make a more formal PR.
Two brief notes: (a) I used promises instead of bluebird for async flow for consistency with the rest of my codebase, (b) in my logic, each consumer only claims one partition from each topic. I simply run the same number of consumers as I have partitions – though this definitely would not be ideal for everyone. Perhaps the logic here can be extended for those who need more partitions than consumers.
The text was updated successfully, but these errors were encountered: