"Bad message format" error in consumers during rebalance #1696

krishanfrog · 2018-02-19T18:16:31Z

Description

I have 8 consumers consuming from 200 partitions with a message rate of 150 per sec and message size of ~2 kb. Consumers are equally balanced and consuming messages from 25 partitions each. I have 3 brokers with a replication factor of 3.

I shut down 4 consumer processes. The partitions are rebalanced correctly and each remaining consumer is assigned 50 partitions each. But some partitions are never consumed after the reassignment. The consumer error log says "Bad message format" in these partitions. All such partitions have the same broker assigned as their leader. Even if I start the consumers which were shut down, I am stuck in this loop because the partitions have somehow got corrupted during the first rebalance.

What is the issue here?

Checklist

librdkafka version - 0.11.1
Apache Kafka version: 0.11.0.0
librdkafka client configuration: { "partition.assignment.strategy": "roundrobin", "enable.auto.commit": true, "auto.commit.interval.ms": 1000, "queued.min.messages": 1, "fetch.message.max.bytes": 102400, "queued.max.messages.kbytes": 512000, "receive.message.max.bytes": 4194304, "fetch.error.backoff.ms": 100, "enable.partition.eof": false, "socket.keepalive.enable": true }, "topicConfig": { "auto.offset.reset": "earliest" }}
Operating system: Ubuntu 14.04
Have verified version of librdkafka on all consumers. It is the same version (0.11.1)
Broker logs during rebalancing -

[2018-02-05 13:00:27,039] INFO [GroupCoordinator 1]: Preparing to rebalance group 'X' with old generation 19 (__consumer_offsets-19) (kafka.coordinator.group.GroupCoordinator)
[2018-02-05 13:00:29,754] INFO [GroupCoordinator 1]: Stabilized group 'X' generation 20 (__consumer_offsets-19) (kafka.coordinator.group.GroupCoordinator)
[2018-02-05 13:00:29,799] INFO [GroupCoordinator 1]: Assignment received from leader for group 'X' for generation 20 (kafka.coordinator.group.GroupCoordinator)

The text was updated successfully, but these errors were encountered:

edenhill · 2018-02-19T18:20:59Z

We have seen similar issues that have been fixed, could you try to reproduce on latest librdkafka master?

Thanks

krishanfrog · 2018-02-23T13:40:28Z

Issue exists on librdkafka version 0.11.3 also.
I added debug logs for "cgrp,topic,fetch" but it doesn't give much information about the error.
Broker logs are also the same

Just FYI about the corrupted partitions if helpful to understand root cause -

All 33 Partition Numbers > 100 with leader broker No 2 are corrupted
All 33 Partition Numbers < 100 with leader broker No 1 are corrupted

All other 134 partitions are fine and being consumed without any issue

The corruption looked to be a domino effect because some of these partitions were consumed post the rebalance but became corrupted after a couple of minutes and stopped being consumed since then.

@edenhill

krishanfrog · 2018-02-23T16:32:17Z

Another FYI -
After Restarting the consumer processes, the messages in the corrupted partitions are being consumed now.

What could be a cause of the issue?

krishanfrog · 2018-02-26T04:46:56Z

@edenhill
Any suggestion? Thanks

edenhill · 2018-03-06T09:46:53Z

Can you reproduce this on latest master?

krishanfrog · 2018-03-13T12:06:52Z

Tried on master. Happening on master too.

Have attached some logs which seem useful

Broker-0:9092/2: failed: err: Local: Bad message format: (errno: Resource temporarily unavailable)
BROKERFAIL
Broker-0:9092/2: Broker changed state UP -> DOWN
STATE
Broker-0:9092/2: Purging bufq with 1 buffers
BUFQ
Broker-0:9092/2: Fetch backoff for 100ms: Local: Bad message format
BACKOFF
Broker-0:9092/2: Updating 0 buffers on connection reset
BUFQ
Requesting metadata for 1/1 topics: broker down
METADATA
Request metadata for 1 topic(s): broker down
METADATA
Sent MetadataRequest (v2, 45 bytes @ 0, CorrId 561)
SEND
Received MetadataResponse (v2, 8521 bytes, CorrId 595, rtt 1.32ms
RECV
===== Received metadata (for 1 requested topics): broker down =====
ClusterId: <Id>, ControllerId: 1
3 brokers, 1 topics
Broker #0/3: Broker-0:9092 NodeId 2
Broker #1/3: Broker-1:9092 NodeId 1
Broker #2/3: Broker-2:9092 NodeId 0
Topic #0/1: <topic> with 200 partitions
Topic <topic> partition 137 Leader 2

It seems weird that the error log is Broker down but the broker was up. The same partitions get consumed on restarted the consumers.

Please let me know if you need more information on the same. @edenhill

edenhill · 2018-03-13T18:54:40Z

If this is easy to reproduce, could you try the badmsg branch which contains some extra logging messages to track down the source of the BAD_MSG errors?

krishanfrog · 2018-03-20T05:32:43Z

Yes its easy to reproduce when there is a significant lag and the consumers are started then.

Attaching the logs from the badmsg branch

20thMarchBadMsg.txt.zip

edenhill · 2018-03-20T06:32:37Z

I think this is related to confluentinc/confluent-kafka-go#100 (comment) :
Okay, so I think I know what the issue is.
fetch.message.max.bytes changed from meaning max number of bytes to fetch in total per request to max number of bytes to fetch per partition per request.
So if you are fetching 10 partitions, with a fetch.message.max.bytes of 1M, and your receive.message.max.bytes is 5M, it is possible that the returned response is 1M*10, thus > 5M.

The workaround for now is to set fetch.message.max.bytes to a reasonable value given the number of partitions you consume, and then making sure that receive.message.max.bytes is at least fetch.message.max.bytes*numPartitions + 5%

krishanfrog · 2018-03-20T06:40:27Z

Makes sense.

edenhill added the workaround-provided label Aug 27, 2018

edenhill closed this as completed Apr 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Bad message format" error in consumers during rebalance #1696

"Bad message format" error in consumers during rebalance #1696

krishanfrog commented Feb 19, 2018

edenhill commented Feb 19, 2018

krishanfrog commented Feb 23, 2018 •

edited

Loading

krishanfrog commented Feb 23, 2018

krishanfrog commented Feb 26, 2018

edenhill commented Mar 6, 2018

krishanfrog commented Mar 13, 2018

edenhill commented Mar 13, 2018

krishanfrog commented Mar 20, 2018

edenhill commented Mar 20, 2018

krishanfrog commented Mar 20, 2018

"Bad message format" error in consumers during rebalance #1696

"Bad message format" error in consumers during rebalance #1696

Comments

krishanfrog commented Feb 19, 2018

Description

Checklist

edenhill commented Feb 19, 2018

krishanfrog commented Feb 23, 2018 • edited Loading

krishanfrog commented Feb 23, 2018

krishanfrog commented Feb 26, 2018

edenhill commented Mar 6, 2018

krishanfrog commented Mar 13, 2018

edenhill commented Mar 13, 2018

krishanfrog commented Mar 20, 2018

edenhill commented Mar 20, 2018

krishanfrog commented Mar 20, 2018

krishanfrog commented Feb 23, 2018 •

edited

Loading