Unprocessed records due to wrong offset in on_assign callback in consumer.subscribe() #1329

jindalshivam09 · 2022-04-25T21:21:06Z

Description

TopicPartitions list I am receiving after consumer rebalancing (in on_assign callback) has offset set to -1001 for each partition.

Timeline:

10:22:05 partition x got revoked
10:22:15 some messages were produced at partition x
10:22:17 previous on-going commit failed due to rebalancing (KafkaError.REBALANCE_IN_PROGRESS)
10:22:17 partition x got re-assigned to a consumer

That's it. The new consumer never read the messages produced during re-balancing.

Notable config:

'auto.offset.reset': 'latest'
'auto.commit.enable': False

'confluent_kafka.version()': cp-kafka:5.0.0-1
OS: linux

I checked the logs and saw that on_assign callback contains list of topic partitions with offset set to -1001 and since this is out of bound offset, my guess is offset is falling back to 'latest', hence missing some of the messages.

How to reproduce

Checklist

Please provide the following information:

confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()):
Apache Kafka broker version:
Client configuration: {...}
Operating system:
Provide client logs (with 'debug': '..' as necessary)
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

edenhill · 2022-04-26T10:29:07Z

-1001 is the unset offset, e.g., no value.
Since you only get a list of partitions to your rebalance callback this is expected, and if you pass partitions with offset -1001 to assign() it will first try to get committed offsets for the partition and if that fails resort to auto.offset.reset. You may also change the offset to a logical offset (BEGINNING, END) or an absolute offset (>= 0).

If you have auto-commit disabled, and fail to commit on rebalance (because the partition is no longer owned), then there will be no committed offset to resume from, so it will employ auto.offset.reset which you've set to latest, thus skipping messages.

Do note though that you are setting auto.commit.enable, which is a legacy property that should not be used (you should see a warning on startup), but you should instead set enable.auto.commit.

A recommended approach to control what is being committed is to set enable.auto.offset.store=false and leave enable.auto.commit=true(default) as is, and then call store_offset() after processing a message.
This ensures that only processed messages will be committed, but the actual committing of offsets to the broker will be taken care of automatically.

jindalshivam09 · 2022-04-26T18:56:43Z

If you have auto-commit disabled...

Even if one commit fail, there might be earlier successful commits by other consumers, shouldn't it use that? Or there might no earlier committed offset in this case (which is highly unlikely though)?

Also we don't want to set the offset to -2 in assign() (as mentioned here) as it will lead to re-processing of all the messages.

Do note though that you are setting auto.commit.enable...

my bad, we are actually setting both.

A recommended approach to control...

What's the added benefit here?

jindalshivam09 · 2022-05-03T02:02:25Z

@edenhill ping..

edenhill · 2022-05-03T05:54:47Z

Even if one commit fail, there might be earlier successful commits by other consumers, shouldn't it use that? Or there might no earlier committed offset in this case (which is highly unlikely though)?

It will use a previously committed offset first, if available.

What's the added benefit here?

At least once delivery. Fine-grained offset commit control.

jindalshivam09 · 2022-05-08T01:36:21Z

Thanks @edenhill for answering the questions.

jindalshivam09 · 2022-05-14T02:45:49Z

I have noticed that all of TopicPartition list we are receiving as part of consumer reassignment in on_assign event has offsets set to -1001. For instance, TopicPartition{topic=<topic_name>,partition=2,offset=-1001,error=None

It is happening regardless of whether there is a previously committed offset present or not. I tried to look in server.log on the broker side but didn't find anything other than

[2022-05-13 22:12:11,255] INFO [GroupCoordinator 1]: Preparing to rebalance group <consumer_id> with old generation 0 (__consumer_offsets-32) (kafka.coordinator.group.GroupCoordinator)
[2022-05-13 22:12:14,256] INFO [GroupCoordinator 1]: Stabilized group <consumer_id> generation 1 (__consumer_offsets-32) (kafka.coordinator.group.GroupCoordinator)
[2022-05-13 22:12:24,256] INFO [GroupCoordinator 1]: Member <id> in group <consumer_id> has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2022-05-13 22:12:24,256] INFO [GroupCoordinator 1]: Preparing to rebalance group <consumer_id> with old generation 1 (__consumer_offsets-32) (kafka.coordinator.group.GroupCoordinator)
[2022-05-13 22:12:24,256] INFO [GroupCoordinator 1]: Group <consumer_id> with generation 2 is now empty (__consumer_offsets-32) (kafka.coordinator.group.GroupCoordinator)
[2022-05-13 22:15:09,399] INFO [GroupMetadataManager brokerId=1] Group <consumer_id> transitioned to Dead in generation 2 (kafka.coordinator.group.GroupMetadataManager)

We are also seeing continuous rebalancing (assignments and revokes)

sjindal-moveworks · 2022-05-16T23:54:17Z

Since you only get a list of partitions to your rebalance callback this is expected,

Oh this explains -1001 behavior.

mhowlett added the question label May 6, 2022

jindalshivam09 closed this as completed May 8, 2022

jindalshivam09 reopened this May 14, 2022

jindalshivam09 closed this as completed May 16, 2022

yzhan289 mentioned this issue Jul 3, 2023

Fix unnecessary metrics DataDog/integrations-core#15098

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unprocessed records due to wrong offset in on_assign callback in consumer.subscribe() #1329

Unprocessed records due to wrong offset in on_assign callback in consumer.subscribe() #1329

jindalshivam09 commented Apr 25, 2022

edenhill commented Apr 26, 2022

jindalshivam09 commented Apr 26, 2022 •

edited

Loading

jindalshivam09 commented May 3, 2022

edenhill commented May 3, 2022

jindalshivam09 commented May 8, 2022

jindalshivam09 commented May 14, 2022

sjindal-moveworks commented May 16, 2022 •

edited

Loading

Unprocessed records due to wrong offset in on_assign callback in consumer.subscribe() #1329

Unprocessed records due to wrong offset in on_assign callback in consumer.subscribe() #1329

Comments

jindalshivam09 commented Apr 25, 2022

Description

How to reproduce

Checklist

edenhill commented Apr 26, 2022

jindalshivam09 commented Apr 26, 2022 • edited Loading

jindalshivam09 commented May 3, 2022

edenhill commented May 3, 2022

jindalshivam09 commented May 8, 2022

jindalshivam09 commented May 14, 2022

sjindal-moveworks commented May 16, 2022 • edited Loading

jindalshivam09 commented Apr 26, 2022 •

edited

Loading

sjindal-moveworks commented May 16, 2022 •

edited

Loading