Consumer offset "stuck" on certain partitions #1072

Shamshiel · 2021-04-09T13:56:23Z

Observed behavior
I consume messages from a topic and this topic has 24 partitions. I started to consume from beginning and at first everything was fine but after some time the consumer stopped consuming messages from certain partitions.

The issue itself is very similar to this issue (562) but I'm using the current version of KafkaJS (v1.15.0) so I'm at a loss what the problem could be. As far as I'm aware the topic also uses log compaction.

I wrote a simple partition assigner that I programmed to only consume from partitions that were "stuck". After that I added some console.log messages into the KafkaJS code (consumerGroup.js) to debug the problem further. I came to the point that I always got zero messages in the response from broker.fetch.

This was the response:

[
  {
    topicName: "MyTopic",
    partitions: [
      {
        partition: 1,
        errorCode: 0,
        highWatermark: "532672",
        lastStableOffset: "532672",
        lastStartOffset: "0",
        abortedTransactions: [],
        preferredReadReplica: -1,
        messages: [],
      },
    ],
  }
]

The offset that was used to fetch the next messages was like this:
{ MyTopic: { '1': '484158' } }

There are clearly still messages to consume but it always fetches zero because always the offset 484158 is used. I changed the offset manually via the admin interface to a higher and valid offset and after that the consumer worked again.

Expected behavior
I would expect to receive all messages until the latest offset.

Environment:

OS: Mac OS 10.15.7
KafkaJS version 1.15.0
Kafka version 2.6.1
NodeJS version 12.18.3

Additional context
If further logs are needed I can provide them. I couldn't see any useful debug messages for this problem....

The text was updated successfully, but these errors were encountered:

tulios · 2021-04-09T13:59:19Z

Can you check the size of the message at offset 484158? What is the format of the messages, JSON, Avro, etc.? I've seen this issue many times when you configure maxBytes at 1MB and have a message with 2MB. It could be something else but just checking a regular case.

Shamshiel · 2021-04-09T14:07:09Z

Thank you for the very quick answer. The messages are AVRO encoded.
Currently I'm not sure how big the message at this offset is but usually they should not be bigger than 20kb.

I already tried the following settings for my consumer:
const consumer = kafka.consumer({ partitionAssigners: [SpecificPartitionAssigner], groupId: 'MyGroupId', maxBytes: 2147483647, maxBytesPerPartition: 214748364 });
(Forgot to mention it above because I already tried so many different things)

Shamshiel · 2021-04-09T14:27:31Z

I think there is still maybe a problem with log compaction and offsets?

Here are the offsets of partition 1 in order as they appear:
...
offset = 484134,
offset = 484135,
offset = 484136,
offset = 484143,
offset = 484156,
offset = 484157, <-- this was the last offset that KafkaJS received
offset = 484207 <- this is the next available offset
...

I played around with the offsets a little bit and the next offset that would work is "484197". I'm not sure if this is helpful or not. But there isn't really a message at this offset. The next message (like mentioned above) that KafkaJS received is the offset 484207.

Shamshiel · 2021-04-12T07:01:38Z

I tested two different frameworks (C# and Java) and both were able to consume past this offset (484158) with the same configuration.

As I mentioned above KafkaJS receives zero messages from Kafka when it tries to fetch messages with this offset (484158). I'm not sure but maybe the other frameworks also receive zero messages but if they do they then check the highest offset of this partition and if it is higher than the current fetch-offset they just skip this offset that receives zero messages?
I'm quite new to Kafka so I'm not sure but if this proposal of Kafka was implemented "Fetch" should always at least return one message. So maybe it is handled as an error in the other frameworks if fetch doesn't return a message if the offset of the partition is higher.

Nevon · 2021-04-12T07:43:57Z

This feels incredibly familiar, like we worked on this bug before, but maybe I'm just having dejavu.

Related issues from other clients:

EDIT: I knew I had worked on this! #577 Guess there's some other case that's not covered by that.

Shamshiel · 2021-04-12T08:38:17Z

@Nevon Do you need any further details to find the cause of this bug?

Nevon · 2021-04-12T09:03:58Z

I'm not sure I will have the time to look into this myself, but a way to reproduce the issue would be great. The relevant topic configuration would be a start, but ideally a script that creates a topic and produces messages in whatever way is needed to trigger the bug would be 💯

alfamegaxq · 2021-04-13T19:07:22Z

I have a similar problem.

Last offset successfully consumed = 221235053, next offset = 221306740. diference of offsets between 2 messages is ~70k

But the consumer is stuck and does not consume further and constantly tries to fetch 221235053 offset and gets no messages.

I have to define ridiculously high maxBytes so that the consumer could grab the next offset. But this should not be a solution, because it's not optimal to fetch a high number of messages all at once.

I think there should be a check if this batch is empty, but not the last by using offsetApi or by checking if fetchApi returned OFFSET_OUT_OF_RANGE

ThomasFerro · 2021-06-10T15:35:07Z

Hi,

We have what seems to be a similar issue.

One of our partition is stuck at the same offset for our three consumer groups. It also is a compacted topic and using eachMessage instead of eachBatch does not help.

How could we help resolving the issue ? Do you know any workaround other than moving the offset manually ?

Thanks !

anaistournoisAdeo · 2022-02-04T09:29:51Z

Hi,

We too have had the same issue twice this week. Each time one or two partitions from a 3-partitions topic were stuck at the same offset for all our consumer groups (This topic is compacted too). Do you know if someone made progress on this issue? Is there a way we can help solve it?

Thanks in advance

Nevon · 2022-02-08T14:26:56Z

Is there a way we can help solve it?

Like I mentioned a year ago, a way to consistently reproduce the issue is the best way to resolve it. Ideally a fork with a failing test, but even just a script that creates a topic and produces to it with whatever parameters are required to trigger the bug, and then a consumer that gets stuck, would be helpful.

dhdnicodemus · 2022-10-15T13:52:57Z

Is there any consensus that this IS related to compaction? We're seeing something similar, using 2.1.0. But this is not in a compacted topic, it does however also have 3 partitions.

dhruvrathore93 · 2022-11-07T13:22:45Z

can you try reducing the max.poll.records or increasing the max.poll.interval.ms.

dhdnicodemus · 2022-11-07T13:37:43Z

I will look into it, thanks.

dhdnicodemus · 2022-11-07T17:58:06Z

A quick question @dhruvrathore93 if either of these were the problem, wouldn't we be seeing a rebalance of the consumer group by the broker?

dhdnicodemus · 2022-11-07T18:41:13Z

One more follow up, does kafakJs support setting max.poll.orecord.size ?

NorDroN · 2023-06-05T13:00:29Z

Any updates on this?

Nevon added the bug label Apr 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer offset "stuck" on certain partitions #1072

Consumer offset "stuck" on certain partitions #1072

Shamshiel commented Apr 9, 2021

tulios commented Apr 9, 2021 •

edited

Loading

Shamshiel commented Apr 9, 2021

Shamshiel commented Apr 9, 2021 •

edited

Loading

Shamshiel commented Apr 12, 2021

Nevon commented Apr 12, 2021 •

edited

Loading

Shamshiel commented Apr 12, 2021

Nevon commented Apr 12, 2021

alfamegaxq commented Apr 13, 2021

ThomasFerro commented Jun 10, 2021

anaistournoisAdeo commented Feb 4, 2022

Nevon commented Feb 8, 2022

dhdnicodemus commented Oct 15, 2022

dhruvrathore93 commented Nov 7, 2022

dhdnicodemus commented Nov 7, 2022

dhdnicodemus commented Nov 7, 2022

dhdnicodemus commented Nov 7, 2022

NorDroN commented Jun 5, 2023

Consumer offset "stuck" on certain partitions #1072

Consumer offset "stuck" on certain partitions #1072

Comments

Shamshiel commented Apr 9, 2021

tulios commented Apr 9, 2021 • edited Loading

Shamshiel commented Apr 9, 2021

Shamshiel commented Apr 9, 2021 • edited Loading

Shamshiel commented Apr 12, 2021

Nevon commented Apr 12, 2021 • edited Loading

Shamshiel commented Apr 12, 2021

Nevon commented Apr 12, 2021

alfamegaxq commented Apr 13, 2021

ThomasFerro commented Jun 10, 2021

anaistournoisAdeo commented Feb 4, 2022

Nevon commented Feb 8, 2022

dhdnicodemus commented Oct 15, 2022

dhruvrathore93 commented Nov 7, 2022

dhdnicodemus commented Nov 7, 2022

dhdnicodemus commented Nov 7, 2022

dhdnicodemus commented Nov 7, 2022

NorDroN commented Jun 5, 2023

tulios commented Apr 9, 2021 •

edited

Loading

Shamshiel commented Apr 9, 2021 •

edited

Loading

Nevon commented Apr 12, 2021 •

edited

Loading