-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Bad message format" error in consumers during rebalance #1696
Comments
We have seen similar issues that have been fixed, could you try to reproduce on latest librdkafka master? Thanks |
Issue exists on librdkafka version 0.11.3 also. Just FYI about the corrupted partitions if helpful to understand root cause - All 33 Partition Numbers > 100 with leader broker No 2 are corrupted All other 134 partitions are fine and being consumed without any issue The corruption looked to be a domino effect because some of these partitions were consumed post the rebalance but became corrupted after a couple of minutes and stopped being consumed since then. |
Another FYI - What could be a cause of the issue? |
@edenhill |
Can you reproduce this on latest master? |
Tried on master. Happening on master too. Have attached some logs which seem useful
It seems weird that the error log is Broker down but the broker was up. The same partitions get consumed on restarted the consumers. Please let me know if you need more information on the same. @edenhill |
If this is easy to reproduce, could you try the |
Yes its easy to reproduce when there is a significant lag and the consumers are started then. Attaching the logs from the badmsg branch |
I think this is related to confluentinc/confluent-kafka-go#100 (comment) : The workaround for now is to set fetch.message.max.bytes to a reasonable value given the number of partitions you consume, and then making sure that receive.message.max.bytes is at least fetch.message.max.bytes*numPartitions + 5% |
Makes sense. |
Description
I have 8 consumers consuming from 200 partitions with a message rate of 150 per sec and message size of ~2 kb. Consumers are equally balanced and consuming messages from 25 partitions each. I have 3 brokers with a replication factor of 3.
I shut down 4 consumer processes. The partitions are rebalanced correctly and each remaining consumer is assigned 50 partitions each. But some partitions are never consumed after the reassignment. The consumer error log says "Bad message format" in these partitions. All such partitions have the same broker assigned as their leader. Even if I start the consumers which were shut down, I am stuck in this loop because the partitions have somehow got corrupted during the first rebalance.
What is the issue here?
Checklist
The text was updated successfully, but these errors were encountered: