Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer offset "stuck" on certain partitions #1072

Open
Shamshiel opened this issue Apr 9, 2021 · 17 comments
Open

Consumer offset "stuck" on certain partitions #1072

Shamshiel opened this issue Apr 9, 2021 · 17 comments
Labels

Comments

@Shamshiel
Copy link

Observed behavior
I consume messages from a topic and this topic has 24 partitions. I started to consume from beginning and at first everything was fine but after some time the consumer stopped consuming messages from certain partitions.

The issue itself is very similar to this issue (562) but I'm using the current version of KafkaJS (v1.15.0) so I'm at a loss what the problem could be. As far as I'm aware the topic also uses log compaction.

I wrote a simple partition assigner that I programmed to only consume from partitions that were "stuck". After that I added some console.log messages into the KafkaJS code (consumerGroup.js) to debug the problem further. I came to the point that I always got zero messages in the response from broker.fetch.

This was the response:

[
  {
    topicName: "MyTopic",
    partitions: [
      {
        partition: 1,
        errorCode: 0,
        highWatermark: "532672",
        lastStableOffset: "532672",
        lastStartOffset: "0",
        abortedTransactions: [],
        preferredReadReplica: -1,
        messages: [],
      },
    ],
  }
]

The offset that was used to fetch the next messages was like this:
{ MyTopic: { '1': '484158' } }

There are clearly still messages to consume but it always fetches zero because always the offset 484158 is used. I changed the offset manually via the admin interface to a higher and valid offset and after that the consumer worked again.

Expected behavior
I would expect to receive all messages until the latest offset.

Environment:

  • OS: Mac OS 10.15.7
  • KafkaJS version 1.15.0
  • Kafka version 2.6.1
  • NodeJS version 12.18.3

Additional context
If further logs are needed I can provide them. I couldn't see any useful debug messages for this problem....

@tulios
Copy link
Owner

tulios commented Apr 9, 2021

Can you check the size of the message at offset 484158? What is the format of the messages, JSON, Avro, etc.? I've seen this issue many times when you configure maxBytes at 1MB and have a message with 2MB. It could be something else but just checking a regular case.

@Shamshiel
Copy link
Author

Thank you for the very quick answer. The messages are AVRO encoded.
Currently I'm not sure how big the message at this offset is but usually they should not be bigger than 20kb.

I already tried the following settings for my consumer:
const consumer = kafka.consumer({ partitionAssigners: [SpecificPartitionAssigner], groupId: 'MyGroupId', maxBytes: 2147483647, maxBytesPerPartition: 214748364 });
(Forgot to mention it above because I already tried so many different things)

@Shamshiel
Copy link
Author

Shamshiel commented Apr 9, 2021

I think there is still maybe a problem with log compaction and offsets?

Here are the offsets of partition 1 in order as they appear:
...
offset = 484134,
offset = 484135,
offset = 484136,
offset = 484143,
offset = 484156,
offset = 484157, <-- this was the last offset that KafkaJS received
offset = 484207 <- this is the next available offset
...

I played around with the offsets a little bit and the next offset that would work is "484197". I'm not sure if this is helpful or not. But there isn't really a message at this offset. The next message (like mentioned above) that KafkaJS received is the offset 484207.

@Shamshiel
Copy link
Author

I tested two different frameworks (C# and Java) and both were able to consume past this offset (484158) with the same configuration.

As I mentioned above KafkaJS receives zero messages from Kafka when it tries to fetch messages with this offset (484158). I'm not sure but maybe the other frameworks also receive zero messages but if they do they then check the highest offset of this partition and if it is higher than the current fetch-offset they just skip this offset that receives zero messages?
I'm quite new to Kafka so I'm not sure but if this proposal of Kafka was implemented "Fetch" should always at least return one message. So maybe it is handled as an error in the other frameworks if fetch doesn't return a message if the offset of the partition is higher.

@Nevon
Copy link
Collaborator

Nevon commented Apr 12, 2021

This feels incredibly familiar, like we worked on this bug before, but maybe I'm just having dejavu.

Related issues from other clients:

EDIT: I knew I had worked on this! #577 Guess there's some other case that's not covered by that.

@Nevon Nevon added the bug label Apr 12, 2021
@Shamshiel
Copy link
Author

@Nevon Do you need any further details to find the cause of this bug?

@Nevon
Copy link
Collaborator

Nevon commented Apr 12, 2021

I'm not sure I will have the time to look into this myself, but a way to reproduce the issue would be great. The relevant topic configuration would be a start, but ideally a script that creates a topic and produces messages in whatever way is needed to trigger the bug would be 💯

@alfamegaxq
Copy link

I have a similar problem.

Last offset successfully consumed = 221235053, next offset = 221306740. diference of offsets between 2 messages is ~70k

But the consumer is stuck and does not consume further and constantly tries to fetch 221235053 offset and gets no messages.

I have to define ridiculously high maxBytes so that the consumer could grab the next offset. But this should not be a solution, because it's not optimal to fetch a high number of messages all at once.

I think there should be a check if this batch is empty, but not the last by using offsetApi or by checking if fetchApi returned OFFSET_OUT_OF_RANGE

@ThomasFerro
Copy link
Contributor

Hi,

We have what seems to be a similar issue.

One of our partition is stuck at the same offset for our three consumer groups. It also is a compacted topic and using eachMessage instead of eachBatch does not help.

How could we help resolving the issue ? Do you know any workaround other than moving the offset manually ?

Thanks !

@anaistournoisAdeo
Copy link

Hi,

We too have had the same issue twice this week. Each time one or two partitions from a 3-partitions topic were stuck at the same offset for all our consumer groups (This topic is compacted too). Do you know if someone made progress on this issue? Is there a way we can help solve it?

Thanks in advance

@Nevon
Copy link
Collaborator

Nevon commented Feb 8, 2022

Is there a way we can help solve it?

Like I mentioned a year ago, a way to consistently reproduce the issue is the best way to resolve it. Ideally a fork with a failing test, but even just a script that creates a topic and produces to it with whatever parameters are required to trigger the bug, and then a consumer that gets stuck, would be helpful.

@dhdnicodemus
Copy link

Is there any consensus that this IS related to compaction? We're seeing something similar, using 2.1.0. But this is not in a compacted topic, it does however also have 3 partitions.

@dhruvrathore93
Copy link

can you try reducing the max.poll.records or increasing the max.poll.interval.ms.

@dhdnicodemus
Copy link

I will look into it, thanks.

@dhdnicodemus
Copy link

A quick question @dhruvrathore93 if either of these were the problem, wouldn't we be seeing a rebalance of the consumer group by the broker?

@dhdnicodemus
Copy link

One more follow up, does kafakJs support setting max.poll.orecord.size ?

@NorDroN
Copy link

NorDroN commented Jun 5, 2023

Any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants