Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in round-robin assignor #2121

Closed
3 of 7 tasks
ankon opened this issue Nov 27, 2018 · 8 comments
Closed
3 of 7 tasks

Segfault in round-robin assignor #2121

ankon opened this issue Nov 27, 2018 · 8 comments
Milestone

Comments

@ankon
Copy link
Contributor

ankon commented Nov 27, 2018

Description

We're using the 'roundrobin' partition assignment strategy, and sometimes I get segfaults in rdk:main. I looked at one of these in more detail:

#0  0x00007fc0f32d6fd3 in rd_kafka_group_member_find_subscription (rk=rk@entry=0x7fc0d4002e50, rkgm=rkgm@entry=0x7fc0b806a380, topic=0x7fc0b8076de0 "ml3-commands") at rdkafka_assignor.c:81
#1  0x00007fc0f32d7fb4 in rd_kafka_roundrobin_assignor_assign_cb (rk=0x7fc0d4002e50, member_id=<optimized out>, protocol_name=<optimized out>, metadata=<optimized out>, members=0x7fc0b806a2e0, 
    member_cnt=<optimized out>, eligible_topics=0x7fc0b805bfa0, eligible_topic_cnt=12, errstr=0x7fc0d97f9a00 "", errstr_size=8, opaque=0x0) at rdkafka_roundrobin_assignor.c:80
#2  0x00007fc0f32d74d8 in rd_kafka_assignor_run (rkcg=rkcg@entry=0x7fc0d4003b20, protocol_name=protocol_name@entry=0x6b8ec40 "roundrobin", metadata=0x7fc0b80769f0, members=members@entry=0x7fc0b806a2e0, 
    member_cnt=member_cnt@entry=2, errstr=errstr@entry=0x7fc0d97f9a00 "", errstr_size=512) at rdkafka_assignor.c:331
#3  0x00007fc0f32be5f7 in rd_kafka_cgrp_assignor_run (rkcg=0x7fc0d4003b20, protocol_name=0x6b8ec40 "roundrobin", err=<optimized out>, metadata=<optimized out>, members=0x7fc0b806a2e0, member_cnt=2)
    at rdkafka_cgrp.c:671
#4  0x00007fc0f32be7cf in rd_kafka_cgrp_assignor_handle_Metadata_op (rk=<optimized out>, rkq=<optimized out>, rko=<optimized out>) at rdkafka_cgrp.c:733
#5  0x00007fc0f32ab5fc in rd_kafka_op_call (rk=<optimized out>, rkq=<optimized out>, rko=0x7fc0b802d090) at rdkafka_op.c:519
#6  0x00007fc0f32ac7e5 in rd_kafka_op_handle (rk=0x7fc0d4002e50, rkq=0x7fc0d97f9cc0, rko=0x7fc0b802d090, cb_type=RD_KAFKA_Q_CB_CALLBACK, opaque=0x0, callback=0x0) at rdkafka_op.c:647
#7  0x00007fc0f32a58df in rd_kafka_q_serve (rkq=0x7fc0d40038f0, timeout_ms=<optimized out>, max_cnt=max_cnt@entry=0, cb_type=cb_type@entry=RD_KAFKA_Q_CB_CALLBACK, callback=callback@entry=0x0, 
    opaque=opaque@entry=0x0) at rdkafka_queue.c:481
#8  0x00007fc0f327d4e4 in rd_kafka_thread_main (arg=arg@entry=0x7fc0d4002e50) at rdkafka.c:1393
#9  0x00007fc0f32da907 in _thrd_wrapper_function (aArg=<optimized out>) at tinycthread.c:624
#10 0x00007fc0f5d8358e in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc0f5cb0653 in clone () from /lib64/libc.so.6

Note:

  1. This is librdkafka via node-rdkafka in NodeJS
  2. The commands in gdb below have been trimmed and reordered for readability :D

It seems that in rd_kafka_roundrobin_assignor_assign_cb the code is running over the length of the members array:

(gdb) up
#1  0x00007fc0f32d7fb4 in rd_kafka_roundrobin_assignor_assign_cb (rk=0x7fc0d4002e50, member_id=<optimized out>, protocol_name=<optimized out>, metadata=<optimized out>, members=0x7fc0b806a2e0, 
    member_cnt=<optimized out>, eligible_topics=0x7fc0b805bfa0, eligible_topic_cnt=12, errstr=0x7fc0d97f9a00 "", errstr_size=8, opaque=0x0) at rdkafka_roundrobin_assignor.c:80
80			for (partition = 0 ;
(gdb) p next
$12 = 2

(cf. "2", an index, to the member_cnt "2" seen in frame 2).

(gdb) p members[2]
$13 = {rkgm_subscription = 0x0, rkgm_assignment = 0x3885, rkgm_eligible = {rl_size = 0, rl_cnt = 0, rl_elems = 0xf4240, rl_free_cb = 0x0, rl_flags = 2, rl_elemsize = 0, rl_p = 0x8000000007}, 
  rkgm_member_id = 0xff, rkgm_userdata = 0xd00000100, rkgm_member_metadata = 0x700}

I also looked the subscriptions involved for members[0]:

(gdb) p *members[0].rkgm_subscription 
$21 = {cnt = 12, size = 24, elems = 0x7fc0b80591c0}
(gdb) p members[0].rkgm_subscription->elems[0]
$23 = {topic = 0x7fc0b8017ad0 "collaborne-commands", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[0].rkgm_subscription->elems[1]
$24 = {topic = 0x7fc0b8056e10 "collaborne-events", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[0].rkgm_subscription->elems[2]
$25 = {topic = 0x7fc0b805c110 "ml1-commands", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[0].rkgm_subscription->elems[3]
$26 = {topic = 0x7fc0b8016620 "ml1-events", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[0].rkgm_subscription->elems[4]
$27 = {topic = 0x7fc0b8016670 "ml2-commands", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[0].rkgm_subscription->elems[5]
$28 = {topic = 0x7fc0b8015bf0 "ml2-events", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[0].rkgm_subscription->elems[6]
$29 = {topic = 0x7fc0b803e260 "ml3-commands", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[0].rkgm_subscription->elems[7]
$30 = {topic = 0x7fc0b8058170 "ml3-events", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[0].rkgm_subscription->elems[8]
$31 = {topic = 0x7fc0b803e2b0 "ml4-commands", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[0].rkgm_subscription->elems[9]
$32 = {topic = 0x7fc0b80508f0 "ml4-events", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[0].rkgm_subscription->elems[10]
$33 = {topic = 0x7fc0b805ee50 "qa-commands", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[0].rkgm_subscription->elems[11]
$34 = {topic = 0x7fc0b80378d0 "qa-events", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}

... and members[1]:

(gdb)  p *members[1].rkgm_subscription
$50 = {cnt = 5, size = 10, elems = 0x7fc0b8015660}
(gdb) p members[1].rkgm_subscription->elems[0]
$42 = {topic = 0x7fc0b8057760 "collaborne-events", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[1].rkgm_subscription->elems[1]
$43 = {topic = 0x7fc0b8046c10 "ml1-events", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[1].rkgm_subscription->elems[2]
$44 = {topic = 0x7fc0b805e730 "ml2-events", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[1].rkgm_subscription->elems[3]
$45 = {topic = 0x7fc0b8037340 "ml3-events", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}
(gdb) p members[1].rkgm_subscription->elems[4]
$46 = {topic = 0x7fc0b803d9e0 "qa-events", partition = -1, offset = -1001, metadata = 0x0, metadata_size = 0, opaque = 0x0, err = RD_KAFKA_RESP_ERR_NO_ERROR, _private = 0x0}

The crash doesn't always happen, and looking at the subscriptions there seems to be a good chance that it is caused by some form of concurrent modifications to the subscription list. This would not be impossible, our consumer does see changes in subscribed topics sometimes.

Looking at https://github.com/edenhill/librdkafka/blob/d7d58e5407852d17485cf8a82841523eb2b8f6d1/src/rdkafka_roundrobin_assignor.c#L85-L90: It seems that next++ should be guarded over the length of the members, for example:

next = (next+1) % member_cnt;

I'm also a bit in doubt about the later part in https://github.com/edenhill/librdkafka/blob/d7d58e5407852d17485cf8a82841523eb2b8f6d1/src/rdkafka_roundrobin_assignor.c#L105: Instead of checking the eligible topic this should probably use member_cnt as well.

Checklist

Please provide the following information:

  • librdkafka version (release number or git tag): v0.11.5 (from pkgconfig)
  • Apache Kafka version: 2.0.0
  • librdkafka client configuration: partition.assignment.strategy=roundrobin (and others)
  • Operating system: Fedora 29/x86_64
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue
@edenhill
Copy link
Contributor

Can you provide:

  • the number of members and each member's topics subscription
  • the number of partitions for each subscribed topic

This way I can reproduce this locally.

@ankon
Copy link
Contributor Author

ankon commented Nov 29, 2018

the number of partitions for each subscribed topic

1 partition per topic.

the number of members and each member's topics subscription

2 members in total. The subscriptions seen in the gdb status seem reasonable to me, in the sense that members[1] was starting up and adding subscriptions, while members[0] was running already with a full set. The crash happened before the second member was fully started.

In other words:
member0 subscriptions: collaborne-commands, collaborne-events, ml1-commands, ml1-events, ml2-commands, ml2-events, ml3-commands, ml3-events, ml4-commands, ml4-events, qa-commands, qa-events
member1 subscriptions: collaborne-events, ml1-events, ml2-events, ml3-events, qa-events

The application startup works by incrementally adding subscriptions to the ...-events topics to the consumer, and then adding further subscriptions to the linked ...-commands topic a while later. The crash seemed to have happened at the starting phase of the second member, so seeing only some ...-events there "makes complete sense to me" :)
Unfortunately I wasn't able to capture the logs of this run, and I had to learn the hard way that coredumpctl debug (or something related to that thing) for some reason decided to delete the crash dump on me.
I have tried to debug the situation with loglevel set to cgrp, but wasn't able so far to get anything there.

@ankon
Copy link
Contributor Author

ankon commented Nov 30, 2018

FWIW: I now have a specific container image and matching coredump, so I can again look into the state. I've also checked how often this happens for us, and it seems to be very frequent, so that for now I'll have to switch back to the "range" strategy for our production systems.

@edenhill
Copy link
Contributor

It might be worthwhile to recompile librdkafka with --disable-optimization to get better backtraces, and then do the full gdb dance to provide enough information for me to create a reproducible test case

@wilmai
Copy link

wilmai commented Dec 10, 2019

#2650

I ran into this while using roundrobin assignment, it seemed to be caused by a race condition where a new topic is picked up by a subset of the consumers, and the code handling the case of an unsubscribed consumer overflows memory.

Reproducing:

  • Run 3 consumers in a group, subscribed to "foo.*"
  • Create a topic matching the regex with 3 partitions; alternatively, 3 topics with 1 partition
  • Wait for the new topic to trigger rebalancing

@edenhill
Copy link
Contributor

Okay, so it sounds like it is a problem with asymetric subscriptions, we should add a unit test to trigger this.

@takaomag
Copy link

takaomag commented Feb 5, 2020

When is this expected to be completed/closed?
I want to prevent bias of messages caused by range strategy.

@edenhill
Copy link
Contributor

edenhill commented Feb 5, 2020

This will be addressed after the transactions PR has been merged.

wilmai added a commit to wilmai/librdkafka that referenced this issue Feb 12, 2020
)

Check next roundrobin consumer is in the topic's eligible list,
otherwise select the next eligible consumer.
Previous code overflowed members[next] and crashed.
@edenhill edenhill added this to the v1.5.0 milestone Mar 12, 2020
edenhill added a commit that referenced this issue May 4, 2020
This also adds declarative unit tests of the assignors.
edenhill added a commit that referenced this issue May 5, 2020
This also adds declarative unit tests of the assignors.
@edenhill edenhill closed this as completed May 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants