kafka-4295: ConsoleConsumer does not delete the temporary group in zookeeper#2054
kafka-4295: ConsoleConsumer does not delete the temporary group in zookeeper#2054huxihx wants to merge 27 commits intoapache:trunkfrom huxihx:kafka-4295_ConsoleConsumer_fail_to_remove_zknode_onexit
Conversation
…roup in zookeeper Author: huxi Since consumer stop logic and zk node removal code are in separate threads, so when two threads execute in an interleaving manner, persistent node '/consumers/<consumer-group>' might not be removed for those console consumer groups which do not specify "group.id". This will pollute Zookeeper with lots of inactive console consumer offset information.
|
@mjsax seems the failure is not relevant to the commit, how should I handle this situation? Please advice. thanks. |
|
@amethystic Could you re-open the PR so that a Jenkins build can be triggered again? cc @hachikuji @ijuma for reviews. |
|
@hachikuji @ijuma please review this PR. Thanks. |
|
|
||
| // if we generated a random group id (as none specified explicitly) then avoid polluting zookeeper with persistent group data, this is a hack | ||
| if (!conf.groupIdPassed && conf.options.has(conf.zkConnectOpt)) | ||
| ZkUtils.maybeDeletePath(conf.options.valueOf(conf.zkConnectOpt), "/consumers/" + conf.consumerProps.get("group.id")) |
There was a problem hiding this comment.
I think we have a utility in AdminUtils for this.
There was a problem hiding this comment.
@hachikuji Yes, although AdminUtils already offers 'deleteConsumerGroupInZK' and 'deleteConsumerGroupInfoForTopicInZK' to implement this, I notice that ConsoleConsumer originally employ this snippet of code to delete inactive console consumer group.
The key point here is we must add this kind of code in addShutdownHook otherwise this thread might not delete the unused zk nodes if the first auto-commit task did not get started.
There was a problem hiding this comment.
Makes sense. It looks like Adminutils.deleteConsumerGroupInZK is the one we want. Perhaps we could replace both usages? Actually if we've added this to the shutdown hook, do we still need it in the finally?
There was a problem hiding this comment.
The finally block already contains this cleanup call:
consumer.cleanup()
conf.formatter.close()
reportRecordCount()
// if we generated a random group id (as none specified explicitly) then avoid polluting zookeeper with persistent group data, this is a hack if (!conf.groupIdPassed) ZkUtils.maybeDeletePath(conf.options.valueOf(conf.zkConnectOpt), "/consumers/" + conf.consumerProps.get("group.id")) shutdownLatch.countDown()
ZkUtils.maybeDeletePath is idempotent, which can be called many times without any negative impact. I call it from the shutdown hook thread to make sure this thread will try to delete the zkNode after the main thread fails to do this.
As for the question whether we should use deleteConsumerGroupInZK to replace ZkUtils.maybeDeletePath, I recommend we use the original one since only adding two lines in the shutdown hook thread is enough to fix the bug, we do not have to do any other regression test to make sure the new deleteConsumerGroupInZK behaves as expected. But if you insist, I will close this PR and commit a new one with AdminUtils.deleteConsumerGroupInZK.
@hachikuji , what do you think?
There was a problem hiding this comment.
As far as I can tell, both methods delegate to ZkClient.deleteRecursive, and they both handle ZkNoNodeException, so I'm not sure I see the concern about idempotence. One difference is that maybeDeletePath catches a Throwable, but I don't see a good reason why we need to preserve that, especially since we're shutting down. Maybe I'm missing something?
Another question I have is about the shutdownLatch.await() below. If we're hitting a shutdown path where the maybeDeletePath is not being executed, wouldn't that mean we end up blocking in await()? Can you clarify the specific case that you're trying to handle in this patch?
Also, you don't need to close the PR if you want to change something. Just push a new commit. Your commits will get squashed when we merge anyway.
There was a problem hiding this comment.
The reason why zk node failed to be removed is not that 'maybeDeletePath' is not executed, but because delete(path) got failed in deleteRecursive where parent znode got failed to be deleted if there still existed some children znode before calling delete(path). So if ConsoleConsumer ran exactly long enough to let autocommit thread creates the znode /consumers/<consumer_group>/offsets, everything is good. But if not, when shutting down the consumer, it commits the offset in the interval between deleting children znodes(namely, ids and owners) and deleting the parent znode (/consumers/***). Do I make myself clear?
Besides, I have already tested the patch, with such scenarios:
- new consumer case: have to make sure there is no impact for the new consumer
- old short-running consumer: successfully deleted
- old long-running consumer: successfully deleted
- old short/long-running consumer with shut down by 'kill -9': Same behaviour as before, since if JVM process is shut down in this manner, event JVM shut down hook thread cannot be ensured to have a chance to run.
- old consumer with groupid specified: have no impact since we only want to clear up those inactive groups without setting groupid
@hachikuji Does it make sense?
There was a problem hiding this comment.
Just one question: do we need the additional call to maybeDeletePath in the finally block anymore? Adding it to the shutdown hook seems sufficient based on some testing. Is there any advantage to having it in both locations?
And just one note: it is actually possible to shutdown the console consumer before the shutdown hook gets registered, in which case you can still end up with the leftover node in Zk. You can try trigger this by shutting down the console consumer just after you see it register in zookeeper. This edge case is pretty tricky to address, so it's probably more important to get the main paths.
|
@joestein Could you help review this PR? Thanks. |
|
@hachikuji Could you address the comments above? Thanks. |
|
@amethystic Apologies for the delay. I'll get to it today or tomorrow. |
|
@hachikuji Any chances to address the comments these days? Please review the pull request. Thank you! |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
…' of https://github.com/amethystic/kafka into kafka-4295_ConsoleConsumer_fail_to_remove_zknode_onexit
…Consumer_fail_to_remove_zknode_onexit
1. Remove same cleanup code from within the JVM shutdown hook code block 2. Refine ZkUtils.maybeDeletePath to capture ZkException if to-be-delete znode is not empty
…okeeper remove useless imports in ZkUtils.scala
|
@hachikuji yes, you are right. As per your comments, I removed some unnecessary code for resource cleanup. ZkUtils.maybeDelete was also refined to capture the ZkException when the to-be-delete path is not empty. The fix only applies for the ConsoleConsumer shutdown via sending SIGINT and SIGTERM. For "kill -9", seems we cannot guarantee it works since application cannot capture it. Do all these make senses? |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
During Acceptor initialization, if "Address already in use" error is encountered, the shutdown latch in each Processor is never counted down. Consequently, the Kafka server hangs when `Processor.shutdown` is called. Author: huxi <huxi@zhenrongbao.com> Author: amethystic <huxi_2b@hotmail.com> Reviewers: Jun Rao <junrao@gmail.com>, Ismael Juma <ismael@juma.me.uk> Closes #2156 from amethystic/kafka-4428_Kafka_noexit_for_port_already_use
Some of the recent changes to `kafka-run-class.sh` have not been applied to `kafka-run-class.bat`. These recent changes include setting proper streams or connect classpaths. So any streams or connect use case that leverages `kafka-run-class.bat` would fail with an error like ``` Error: Could not find or load main class org.apache.kafka.streams.??? ``` Author: Vahid Hashemian <vahidhashemian@us.ibm.com> Reviewers: Ewen Cheslack-Postava <ewen@confluent.io> Closes #2238 from vahidhashemian/minor/sync_up_kafka-run-class.bat
Mx4jLoader.scala should explicitly `return true` if the class is successfully loaded and started, otherwise it will return false even if the class is loaded. Author: Edward Ribeiro <edward.ribeiro@gmail.com> Reviewers: Ewen Cheslack-Postava <ewen@confluent.io> Closes #2295 from eribeiro/mx4jloader-bug
The original Javadoc description for `ConsumerRecord` is slightly confusing in that it can be read in a way such that an object is a key value pair received from Kafka, but (only) consists of the metadata associated with the record. This PR makes it clearer that the metadata is _included_ with the record, and moves the comma so that the phrase "topic name and partition number" in the sentence is more closely associated with the phrase "from which the record is being received". Author: LoneRifle <LoneRifle@users.noreply.github.com> Reviewers: Ismael Juma <ismael@juma.me.uk>, Ewen Cheslack-Postava <ewen@confluent.io> Closes #2290 from LoneRifle/patch-1
…ted regex This makes it consistent with MirrorMaker with the old consumer. Author: huxi <huxi@zhenrongbao.com> Author: amethystic <huxi_2b@hotmail.com> Reviewers: Vahid Hashemian <vahidhashemian@us.ibm.com>, Ismael Juma <ismael@juma.me.uk> Closes #2072 from amethystic/kafka-4351_Regex_behavior_change_for_new_consumer
…to be removed does not exist Author: Vahid Hashemian <vahidhashemian@us.ibm.com> Reviewers: Guozhang Wang <wangguoz@gmail.com> Closes #2218 from vahidhashemian/KAFKA-4480
|
@hachikuji please kindly take some time to review. Thanks. |
Author: Himani Arora <1himani.arora@gmail.com> Reviewers: Ismael Juma <ismael@juma.me.uk> Closes #2297 from himani1/refactored_code
zookeeper Addressed Ijuma's comments 1. Restored ZkUtils to trunk code 2. Restored ConsoleConsumerTest to trunk code 3. Restored ZkUtils.maybeDeletePath to trunk code 4. Replaced ZkUtils.maybeDeletePath with AdminUtils.deleteConsumerGroupInZK
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
@ijuma Please review the PR again. Thanks. |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
@ijuma Please take time to review this PR. Thanks. |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
@ijuma @hachikuji Please take some time to review the PR. Thanks. |
|
@guozhangwang Do you know how to retest this fix since all the checks have failed although it is not known why it failed. |
ijuma
left a comment
There was a problem hiding this comment.
Do I understand correctly that this PR now does the following:
- Fix issue where we tried to delete a path in ZK for the new consumer.
- Use
AdminUtils.deleteConsumerGroupInZKinstead ofZKUtils.maybeDeletePath
And the issue described in the JIRA remains since it's an edge case for the old consumer (that we intend to deprecate and remove)?
| assertTrue("Consumer group should be created.", zkUtils.getChildren(ZkUtils.ConsumersPath).head == groupID) | ||
| } finally { | ||
| consumer.stop() | ||
| ConsoleConsumer.deleteZkPathForConsumerGroup(conf.options.valueOf(conf.zkConnectOpt), conf.consumerProps.getProperty("group.id")) |
There was a problem hiding this comment.
It seems like the only thing this test is checking is that this call works. And since that is just calling AdminUtils, not sure if the benefit is worth it given the cost of starting up Kafka and ZK.
There was a problem hiding this comment.
Agreed, but is there any way we could use to simulate sending terminate signals? This issue is caused by the fact ZkUtils.maybeDeletePath failed to delete the whole directory when ConsoleConsumer exits after receiving a INT or TERM signal , so what we can do is to test if the replaced method deleteConsumerGroupInZK works as expected. Any good idea to test this fix?
|
For your questions: As what I said above, I am not sure if you guys still want to check in any code for fixing old consumer problems, especially when community is planning to remove it recently. It's up to you. If you don't, I am free to close the PR. |
|
@ijuma What's the status for this PR? Do I address all your comments already? |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
@huxihx sorry for the delay. I was going to look at this PR, but it seems like it includes other changes now. Maybe you could simply revive the part where we fix the new consumer not to call ZK unnecessarily? |
Since consumer stop logic and zk node removal code are in separate threads, so when two threads execute in an interleaving manner, persistent node '/consumers/' might not be removed for those console consumer groups which do not specify "group.id". This will pollute Zookeeper with lots of inactive console consumer offset information.