Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdkafka_partition.c:165:rd_kafka_toppar_remove: assert: rktp->rktp_removed==0 #558

Closed
wucanhui opened this issue Feb 22, 2016 · 12 comments
Closed
Labels
Milestone

Comments

@wucanhui
Copy link

We are using librdkafka-0.9.0.99, and it crash when keeping on consuming message for hours.
The logs message:
Consume failed: Local: Operation in progress
Consume failed: Local: Operation in progress
Consume failed: Local: Operation in progress
Consume failed: Local: Operation in progress
Consume failed: Local: Operation in progress
*** rdkafka_partition.c:165:rd_kafka_toppar_remove: assert: rktp->rktp_removed==0 ***
Aborted (core dumped)

The gdb info is like this:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff40195000
Core was generated by `./etld etl.properties'.
Program terminated with signal 6, Aborted.
#0 0x00007ff1e34630d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) bt
#0 0x00007ff1e34630d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ff1e346683b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ff1e28402aa in rd_kafka_crash (file=, line=, function=, rk=0x0,

reason=<optimized out>) at rdkafka.c:2096

#3 0x00007ff1e2870ad5 in rd_kafka_toppar_remove (rktp=0x7ff1bc003020) at rdkafka_partition.c:165
#4 0x00007ff1e2870af2 in rd_kafka_toppar_destroy_final (rktp=0x7ff1bc003020) at rdkafka_partition.c:183
#5 0x00007ff1e28514d8 in rd_kafka_topic_leader_update (rkb=0x7ff1d8002890, partition=1, rkt=0x7ff1bc0011d0,

leader_id=<optimized out>) at rdkafka_topic.c:351

#6 rd_kafka_topic_metadata_update (rkb=0x7ff1b8003f40, mdt=0x7ff1b8001547) at rdkafka_topic.c:735
#7 0x00007ff1e2860d31 in rd_kafka_parse_Metadata (rkb=0x7ff1b8003f40, rkt=0x7ff1bc0011d0, rkbuf=)

at rdkafka_request.c:1657

#8 0x00007ff1e28692b8 in rd_kafka_op_handle_Metadata (rkb=0x7ff1b8003f40, err=RD_KAFKA_RESP_ERR_NO_ERROR, rkbuf=0x2325640,

request=<optimized out>, opaque=0x7ff1bc000970) at rdkafka_request.c:1721

#9 0x00007ff1e2859959 in rd_kafka_buf_callback (rkb=0x7ff1b8003f40, err=RD_KAFKA_RESP_ERR_NO_ERROR, response=0x2325640,

request=0x23257f0) at rdkafka_buf.c:472

#10 0x00007ff1e2838609 in rd_kafka_toppar_q_cb (rk=, rko=, cb_type=,

opaque=<optimized out>) at rdkafka.c:886

#11 0x00007ff1e285bd0f in rd_kafka_q_serve (rkq=0x7ff1d8001670, timeout_ms=, max_cnt=, cb_type=0,

callback=0x7ff1e2838600 <rd_kafka_toppar_q_cb>, opaque=0x0) at rdkafka_queue.c:371

#12 0x00007ff1e2841dbd in rd_kafka_toppars_q_serve (timeout_ms=, rkq=0x7ff1d8001670) at rdkafka.c:892
#13 rd_kafka_thread_main (arg=0x7ff1d80015e0) at rdkafka.c:929
#14 0x00007ff1e28781ff in _thrd_wrapper_function (aArg=) at tinycthread.c:599
#15 0x00007ff1e3d08e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#16 0x00007ff1e352038d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x0000000000000000 in ?? ()

Do you know why it crash like this? Thanks.

@edenhill
Copy link
Contributor

edenhill commented Mar 9, 2016

From the backtrace it seems that one of your topics were suddenly missing from the metadata returned from the broker, or that the partition count changed for a topic. Does that sound familiar?

librdkafka still shouldnt fail because of this and I'll try to reproduce it.

@edenhill edenhill added this to the 0.9.1 milestone Mar 9, 2016
@edenhill edenhill modified the milestones: 0.9.2, 0.9.1 Apr 11, 2016
@alanyin
Copy link

alanyin commented May 9, 2016

I have the exact same issue, the coredump stacks is the same.
Is this issue still open? Any idea when it will be fixed?

@edenhill
Copy link
Contributor

edenhill commented May 9, 2016

Is this still happening on latest master?

@alanyin
Copy link

alanyin commented May 9, 2016

I downloaded the source on 9 Apr.
Didn't try the latest one? Should I try latest master?

Alan

From: Magnus Edenhill <notifications@github.commailto:notifications@github.com>
Reply-To: edenhill/librdkafka <reply@reply.github.commailto:reply@reply.github.com>
Date: Monday, May 9, 2016 at 4:23 PM
To: edenhill/librdkafka <librdkafka@noreply.github.commailto:librdkafka@noreply.github.com>
Cc: Alan Yin <ayin@threatmetrix.commailto:ayin@threatmetrix.com>, Comment <comment@noreply.github.commailto:comment@noreply.github.com>
Subject: Re: [edenhill/librdkafka] rdkafka_partition.c:165:rd_kafka_toppar_remove: assert: rktp->rktp_removed==0 (#558)

Is this still happening on latest master?

You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://github.com//issues/558#issuecomment-217784806

@edenhill
Copy link
Contributor

edenhill commented May 9, 2016

Yes please, try latest master and if problem persists outline exact steps to reproduce. Thanks

@edenhill
Copy link
Contributor

This seems to be caused by connecting to two clusters simultaneously where the topic state differs.
This is a problem on the user side but we still shouldn't crash.

@jleducq
Copy link

jleducq commented Aug 19, 2016

This error is still happening with the latest version:
librdkafka version: latest version
kafka version: 0.10.0

My setup:

bin/kafka-topics.sh --zookeeper server1,server2,server3,server4,server5,server6 --describe --topic myTopic
Topic:myTopic PartitionCount:3 ReplicationFactor:2 Configs:
Topic: myTopic Partition: 0 Leader: 1 Replicas: 1,2 Isr: 2,1
Topic: myTopic Partition: 1 Leader: 3 Replicas: 3,4 Isr: 4,3
Topic: myTopic Partition: 2 Leader: 5 Replicas: 5,6 Isr: 5,6

My application is producing on partition 1 and consuming from partition 0 and 2.

Scenario:

  1. While the application is running (producing) I shutdown the Kafka servers 3 and 4 (both replicas for partition 1) to see the application reaction.
  2. I also had this issue without doing anything (all my partitions/brokers were up) and it failed randomly overnight while the application was not producing/consuming. I had to restart every single kafka broker to restart the application. Otherwise it would automatically fail at startup with the below error.

Application log:

INFO - DeliveryReportCallback: Ack of message delivery for (124 bytes): Success
INFO - DeliveryReportCallback: Ack of message delivery for (124 bytes): Success
INFO - DeliveryReportCallback: Ack of message delivery for (124 bytes): Success
INFO - DeliveryReportCallback: Ack of message delivery for (124 bytes): Success
INFO - KafkaEventCallback: Received a log event: server4:9092/4: Receive failed: Disconnected(3, FAIL)
ERROR - KafkaEventCallback: Received an error event: server4:9092/4: Receive failed: Disconnected(Local: Broker transport failure)
INFO - KafkaEventCallback: Received a log event: Topic myTopic partition count changed from 3 to 1(5, PARTCNT)
INFO - KafkaEventCallback: Received a log event: Topic myTopic: partition [1] is unknown (partition_cnt 1)(4, LEADER)
*** rdkafka_partition.c:207:rd_kafka_toppar_remove: assert: rktp->rktp_removed==0 ***
Aborted (core dumped)

GDB back trace:

#0 0x0000003c24432625 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 cyrus-sasl-plain-2.1.23-15.el6_6.2.x86_64 db4-4.7.25-18.el6_4.x86_64 glibc-2.12-1.149.el6_6.7.x86_64 gperftools-libs-2.0-11.el6.3.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-37.el6_6.x86_64 libcom_err-1.41.12-21.el6.x86_64 libcurl-7.19.7-40.el6_6.4.x86_64 libgcc-4.4.7-11.el6.x86_64 libidn-1.18-2.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64 libssh2-1.4.2-1.el6_6.1.x86_64 libstdc++-4.4.7-11.el6.x86_64 libunwind-1.1-2.el6.x86_64 libuuid-2.17.2-12.18.el6.x86_64 nspr-4.10.6-1.el6_5.x86_64 nss-3.16.2.3-3.el6_6.x86_64 nss-softokn-freebl-3.14.3-22.el6_6.x86_64 nss-util-3.16.2.3-2.el6_6.x86_64 openssl-1.0.1e-30.el6.8.x86_64 xerces-c-3.0.1-20.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000003c24432625 in raise () from /lib64/libc.so.6
#1 0x0000003c24433e05 in abort () from /lib64/libc.so.6
#2 0x00007f4cc612ad33 in rd_kafka_crash (file=file@entry=0x7f4cc617f6a5 "rdkafka_partition.c", line=line@entry=207, function=function@entry=0x7f4cc6180460 <FUNCTION.22025> "rd_kafka_toppar_remove", rk=rk@entry=0x0,
reason=reason@entry=0x7f4cc617f70a "assert: rktp->rktp_removed==0") at rdkafka.c:2395
#3 0x00007f4cc6169b31 in rd_kafka_toppar_remove (rktp=rktp@entry=0x7f4c68000b40) at rdkafka_partition.c:207
#4 0x00007f4cc6169b52 in rd_kafka_toppar_destroy_final (rktp=0x7f4c68000b40) at rdkafka_partition.c:225
#5 0x00007f4cc614dd8c in rd_kafka_op_destroy (rko=0x7f4c680238b0) at rdkafka_op.c:106
#6 0x00007f4cc613b827 in rd_kafka_broker_op_serve (rkb=rkb@entry=0x13275f0, rko=0x7f4c680238b0) at rdkafka_broker.c:3016
#7 0x00007f4cc613c32b in rd_kafka_broker_serve (rkb=rkb@entry=0x13275f0, timeout_ms=timeout_ms@entry=10) at rdkafka_broker.c:3035
#8 0x00007f4cc613c689 in rd_kafka_broker_ua_idle (rkb=rkb@entry=0x13275f0, timeout_ms=timeout_ms@entry=0) at rdkafka_broker.c:3095
#9 0x00007f4cc613cf29 in rd_kafka_broker_thread_main (arg=arg@entry=0x13275f0) at rdkafka_broker.c:4407
#10 0x00007f4cc6171107 in _thrd_wrapper_function (aArg=) at tinycthread.c:611
#11 0x0000003c24c079d1 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003c244e88fd in clone () from /lib64/libc.so.6

Note that I am using the KafkaConsumer, not the legacy Consumer. I also do not have a custom rebalance callback.

Thanks.

@edenhill
Copy link
Contributor

It looks like this happens when different brokers report different partition counts for the same topic.

@jleducq
Copy link

jleducq commented Sep 15, 2016

Just saw that this bug was part of 0.9.2 milestone. Do you have any date in mind for the release? Thanks.

@edenhill edenhill removed the consumer label Sep 15, 2016
@edenhill
Copy link
Contributor

Ive reproduced it and will fix it soon.

@edenhill
Copy link
Contributor

This is fixed on the partition_changes branch

@edenhill
Copy link
Contributor

This issue is now fixed on master, please try to verify the fix in your environment.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants