Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HighLevelConsumer throw FailedToRebalanceConsumerError: NODE_EXISTS when rebalancing #981

Merged
merged 2 commits into from
May 24, 2018

Conversation

sodawy
Copy link
Contributor

@sodawy sodawy commented May 23, 2018

  • env

    • kafka version: 0.8
    • kafka-node version: ^2.3.0
    • consumer process count: 16 (in different pc)
    • partition count: 46
  • reproduce:

    • consumer1 and consumer2 is in different processes
    • consumer1 is owner of partition1+2, consumerChanged, rebalancing and rebalanceAttempt:
      • determining to own partition3+4
      • update ZK partition3 success
      • update ZK partition4 failed
      • retry rebalanceAttempt
    • consumer2 registered when consumer1 is rebalancing, start to rebalancing and rebalanceAttempt:
      • determining to own partition3
      • update ZK failed by: FailedToRebalanceConsumerError:NODE_EXISTS
      • retry continuously..cannot emit rebalanced
    • consumer1: retry rebalanceAttempt
      • found consumer2 and determining to own partition4
      • update ZK partition4 success
      • rebalanced
  • result:

    consumer1 is owner of partition3 in ZK and never consume partition3, but only own partition4 in memory topicPayloads.
    consumer2 cannot complete the rebalance process and throw FailedToRebalanceConsumerError , beacuse the owner node in ZK exists which created by consumer1.

  • solution:

    If a tp update ZK failed when rebalanceAttempt, release the tp which already updated success. In another word, when the long rebalancing process failed, some ZK updated need to rollback.
    The PR has proved could solved the issues in our product env,and all consumers is working, all partitions is consuming~

  • about testcase:
    Sorry for missing it. The simulation reappeared is difficult. Glad to see someone would add it, thanks.

Fixes #487 #449

sodawang added 2 commits May 23, 2018 18:50
…eacuse not own the partition in memory. when the right owner highLevelConsumerB rebalancing throw FailedToRebalanceConsumerError: Exception: NODE_EXISTS
@crzidea
Copy link
Member

crzidea commented May 24, 2018

Good job! After fixing this bug, we can finally run a cluster with 150 partitions and 16 consumers. This bug exists for 5 years until now, and this is the main reason why ConsumerGroup is more stable than HighLevelConsumer and why nobody is using this HighLevelConsumer implementation in (very large cluster) production.

@hyperlink check this PR ASAP, please!

This PR also fix the following issue:

@hyperlink hyperlink merged commit 68dbc90 into SOHU-Co:master May 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants