Fluctuations and resets in committed offsets due to temporal network outage #4427

krasin-ga · 2023-09-10T17:31:46Z

Description

We were testing a network outage scenario where one of three data centers became unavailable and noticed strange fluctuations and resets in committed offsets after the data center went back online. I've observed some anomalies in the logs that might be related to it.

Temporary errors in host resolution that result in the resetting of offsets

[thrd:main]: test-topic [2]: offset reset (at offset INVALID (leader epoch 6059), broker 1014) to offset BEGINNING (leader epoch -1): Unable to validate offset and epoch: Local: Host resolution failure: Local: Partition log truncation detected

[thrd:main]: test-topic [19]: offset reset (at offset BEGINNING (leader epoch -1), broker 1014) to offset BEGINNING (leader epoch -1): failed to query logical offset: Local: Host resolution failure

Suspicious updates of committed offsets

2023-09-08T00:13:46.246Z // Started at correct offset

Consumer in the group "testbench-1000": "[thrd:main]: Partition test-topic [0] start fetching at offset 57842809 (leader epoch 7007)" Code: "FETCH"; SysLevel: Debug;

2023-09-08T00:13:50.051Z // Race condition? Committed offset and leader epoch for partition 0 is from partition 7 (see log below)

Consumer in the group "testbench-1000": "[thrd:main]: Topic test-topic [0]: stored offset INVALID (leader epoch -1), committed offset 55171745 (leader epoch 6476): not including in commit" Code: "OFFSET"; SysLevel: Debug;

2023-09-08T00:13:50.058Z

Consumer in the group "testbench-1000": "[thrd:main]: Topic test-topic [7]: stored offset 55181424 (leader epoch 6476), committed offset 55171745 (leader epoch 6476): setting stored offset 55181424 (leader epoch 6476) for commit" Code: "OFFSET"; SysLevel: Debug;

2023-09-08T00:13:55.056Z // Back to normal

Consumer in the group "testbench-1000": "[thrd:main]: Topic test-topic [0]: stored offset 57842809 (leader epoch 7007), committed offset 57842809 (leader epoch 7007): not including in commit" Code: "OFFSET";"

Here is the graph displaying the committed offsets by partitions for that consumer group:

Please note that on our test bench the probability of encountering a race condition increases, because the Kubernetes pods running the consumer are constantly being throttled.

Checklist

Please provide the following information:

librdkafka version (release number or git tag): v2.2.0
Apache Kafka version: v2.7.2
librdkafka client configuration:

auto.offset.reset: earliest

Operating system: Debian 11.7
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts: N/A
Critical issue

The text was updated successfully, but these errors were encountered:

emasab · 2023-09-19T16:31:50Z

The offset reset was triggered by RD_KAFKA_RESP_ERR__RESOLVE that corresponds to a host resolution failure during an offset validation.

                        /* Permanent error */
                        rd_kafka_offset_reset(
                            rktp, rd_kafka_broker_id(rkb),
                            RD_KAFKA_FETCH_POS(RD_KAFKA_OFFSET_INVALID,
                                               rktp->rktp_leader_epoch),
                            RD_KAFKA_RESP_ERR__LOG_TRUNCATION,
                            "Unable to validate offset and epoch: %s",
                            rd_kafka_err2str(err));

We need to remove this reset and retry even in case of permanent error here, as in Java

emasab added the bug label Sep 19, 2023

emasab added a commit that referenced this issue Sep 26, 2023

Fix for #4427

232a00e

emasab mentioned this issue Sep 26, 2023

Permanent errors during offset validation should be retried #4447

Merged

emasab added the in progress label Sep 26, 2023

emasab closed this as completed in #4447 Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluctuations and resets in committed offsets due to temporal network outage #4427

Fluctuations and resets in committed offsets due to temporal network outage #4427

krasin-ga commented Sep 10, 2023 •

edited

Loading

emasab commented Sep 19, 2023

Fluctuations and resets in committed offsets due to temporal network outage #4427

Fluctuations and resets in committed offsets due to temporal network outage #4427

Comments

krasin-ga commented Sep 10, 2023 • edited Loading

Description

Temporary errors in host resolution that result in the resetting of offsets

Suspicious updates of committed offsets

Checklist

emasab commented Sep 19, 2023

krasin-ga commented Sep 10, 2023 •

edited

Loading