-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in round-robin assignor #2121
Comments
Can you provide:
This way I can reproduce this locally. |
1 partition per topic.
2 members in total. The subscriptions seen in the gdb status seem reasonable to me, in the sense that In other words: The application startup works by incrementally adding subscriptions to the |
FWIW: I now have a specific container image and matching coredump, so I can again look into the state. I've also checked how often this happens for us, and it seems to be very frequent, so that for now I'll have to switch back to the "range" strategy for our production systems. |
It might be worthwhile to recompile librdkafka with --disable-optimization to get better backtraces, and then do the full gdb dance to provide enough information for me to create a reproducible test case |
I ran into this while using roundrobin assignment, it seemed to be caused by a race condition where a new topic is picked up by a subset of the consumers, and the code handling the case of an unsubscribed consumer overflows memory. Reproducing:
|
Okay, so it sounds like it is a problem with asymetric subscriptions, we should add a unit test to trigger this. |
When is this expected to be completed/closed? |
This will be addressed after the transactions PR has been merged. |
This also adds declarative unit tests of the assignors.
This also adds declarative unit tests of the assignors.
Description
We're using the 'roundrobin' partition assignment strategy, and sometimes I get segfaults in rdk:main. I looked at one of these in more detail:
Note:
It seems that in
rd_kafka_roundrobin_assignor_assign_cb
the code is running over the length of the members array:(cf. "2", an index, to the member_cnt "2" seen in frame 2).
I also looked the subscriptions involved for
members[0]
:... and
members[1]
:The crash doesn't always happen, and looking at the subscriptions there seems to be a good chance that it is caused by some form of concurrent modifications to the subscription list. This would not be impossible, our consumer does see changes in subscribed topics sometimes.
Looking at https://github.com/edenhill/librdkafka/blob/d7d58e5407852d17485cf8a82841523eb2b8f6d1/src/rdkafka_roundrobin_assignor.c#L85-L90: It seems that
next++
should be guarded over the length of the members, for example:I'm also a bit in doubt about the later part in https://github.com/edenhill/librdkafka/blob/d7d58e5407852d17485cf8a82841523eb2b8f6d1/src/rdkafka_roundrobin_assignor.c#L105: Instead of checking the eligible topic this should probably use
member_cnt
as well.Checklist
Please provide the following information:
v0.11.5
(from pkgconfig)2.0.0
partition.assignment.strategy=roundrobin
(and others)Fedora 29/x86_64
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: