-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increased number of rd_kafka_cgrp_terminated
with 2.5.0 and shutdown stability degradation
#4792
Comments
After adding an |
I still don't know what the issue is. I mitigated it by doing an unsubscribe + waiting for it to finish (does not finish always) and then after a short wait doing a shutdown. Then it does not emerge. |
@mensfeld that's great, is it possible to gather debug logs from the run. Not sure if it's related to 2.5.0 or just something that happens with some particular timing. The assertion suggests that after entering the RD_KAFKA_CGRP_STATE_TERM state one of these variables were increased. return rk->rk_consumer.wait_commit_cnt > 0 ||
rk->rk_consumer.assignment.wait_stop_cnt > 0 ||
rk->rk_consumer.assignment.pending->cnt > 0 ||
rk->rk_consumer.assignment.queried->cnt > 0 ||
rk->rk_consumer.assignment.removed->cnt > 0; Maybe from the logs it's possible to detect what happens. |
Thank you @emasab I will dive deeper into this in the upcoming weeks. So far I mitigated this as I described above and I do not see this happening on my rather extensive test suite. When I have some time I will rollback those stability fixes and try to crash it with logs. |
I didn't have time but I see the same in 2.6.0 once every few weeks in production. So far I was not able to replicate this in a stable manner :( |
Description
I do not know yet the reason, but after the upgrade of librdkafka to 2.5.0 (without any more changes) the Karafka ecosystem CI crashes more often with:
on shutdown and the consumer destroy also hangs, which has not happened with 2.4.0.
librdkafka/src/rdkafka_cgrp.c
Line 3312 in 6eaf89f
I recall a different issue a while ago having a similar problem where the solution was to gracefully unsubscribe the consumer prior to the shutdown. I wonder if this would mitigate this as well 🤔
update: I can now trigger segfaults. Not sure yet exactly why but at least I can crash it on my machine.
How to reproduce
I was unable to reproduce it so far in an isolated environment, and my stress tests on my test setup do not show this behavior. However, all of my previous reports (including fixed) would always have some specs failing on valid issues. I will keep investigating and will provide more details when available.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
2.5.0
(not happening with2.4.0
)confluentinc/cp-kafka:7.6.1
debug=..
as necessary) from librdkafka (will be provided, trying to repro with logs) - I cannot at this stage because only crashes in CI which runs without log collection (will work on this)The text was updated successfully, but these errors were encountered: