Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased number of rd_kafka_cgrp_terminated with 2.5.0 and shutdown stability degradation #4792

Open
4 of 7 tasks
mensfeld opened this issue Jul 22, 2024 · 5 comments
Open
4 of 7 tasks

Comments

@mensfeld
Copy link

mensfeld commented Jul 22, 2024

Description

I do not know yet the reason, but after the upgrade of librdkafka to 2.5.0 (without any more changes) the Karafka ecosystem CI crashes more often with:

 *** rdkafka_cgrp.c:3312:rd_kafka_cgrp_terminated: assert: !rd_kafka_assignment_in_progress(rkcg->rkcg_rk) ***
Aborted (core dumped)

on shutdown and the consumer destroy also hangs, which has not happened with 2.4.0.

rd_kafka_assert(NULL, !rd_kafka_assignment_in_progress(rkcg->rkcg_rk));

I recall a different issue a while ago having a similar problem where the solution was to gracefully unsubscribe the consumer prior to the shutdown. I wonder if this would mitigate this as well 🤔

update: I can now trigger segfaults. Not sure yet exactly why but at least I can crash it on my machine.

How to reproduce

I was unable to reproduce it so far in an isolated environment, and my stress tests on my test setup do not show this behavior. However, all of my previous reports (including fixed) would always have some specs failing on valid issues. I will keep investigating and will provide more details when available.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • librdkafka version (release number or git tag): 2.5.0 (not happening with 2.4.0)
  • Apache Kafka version: confluentinc/cp-kafka:7.6.1
  • librdkafka client configuration: absolute defaults
  • Operating system: ubuntu-latest from Github Actions CI shared runner
  • Provide logs (with debug=.. as necessary) from librdkafka (will be provided, trying to repro with logs) - I cannot at this stage because only crashes in CI which runs without log collection (will work on this)
  • Provide broker log excerpts - same as above. When CI crashes, VM is shutdown. I will try introducing a mode with heavy debug tracing.
  • Critical issue
@mensfeld
Copy link
Author

After adding an unsubscribe invocation prior to shutdown it seems no longer to cause any issues (as of now).

@mensfeld
Copy link
Author

mensfeld commented Aug 5, 2024

I still don't know what the issue is. I mitigated it by doing an unsubscribe + waiting for it to finish (does not finish always) and then after a short wait doing a shutdown. Then it does not emerge.

@emasab
Copy link
Contributor

emasab commented Aug 8, 2024

update: I can now trigger segfaults. Not sure yet exactly why but at least I can crash it on my machine.

@mensfeld that's great, is it possible to gather debug logs from the run. Not sure if it's related to 2.5.0 or just something that happens with some particular timing.

The assertion suggests that after entering the RD_KAFKA_CGRP_STATE_TERM state one of these variables were increased.

        return rk->rk_consumer.wait_commit_cnt > 0 ||
               rk->rk_consumer.assignment.wait_stop_cnt > 0 ||
               rk->rk_consumer.assignment.pending->cnt > 0 ||
               rk->rk_consumer.assignment.queried->cnt > 0 ||
               rk->rk_consumer.assignment.removed->cnt > 0;

Maybe from the logs it's possible to detect what happens.

@mensfeld
Copy link
Author

mensfeld commented Aug 8, 2024

Thank you @emasab I will dive deeper into this in the upcoming weeks. So far I mitigated this as I described above and I do not see this happening on my rather extensive test suite. When I have some time I will rollback those stability fixes and try to crash it with logs.

@mensfeld
Copy link
Author

I didn't have time but I see the same in 2.6.0 once every few weeks in production. So far I was not able to replicate this in a stable manner :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants