Consumer stalled after commit failure during rebalance #2933

gridaphobe · 2020-06-12T22:14:21Z

Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ

Description

If an OffsetCommit request overlaps with a rebalance, the partition fetcher threads are not restarted until the OffsetCommit response is received. If the commit fails, they are not restarted at all, and the consumer appears to hang.

How to reproduce

See https://gist.github.com/gridaphobe/d1c544631c9569af810b405e572144cd. We force a commit failure by unassigning ourselves before committing, which causes the commit to be processed by the broker after the generation id has been incremented. (The actual consumer code where we discovered this issue does an asynchronous commit on a regular interval, which only sometimes overlaps with rebalances in this way.) Start one instance of the consumer, and after it has received a few messages start another one. After the rebalance, only the new consumer will receive messages even though the partitions have been evenly distributed. The original consumer never recovers, even if you shut down the other one.

IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

librdkafka version (release number or git tag): 1.4.2
Apache Kafka version: 2.2.0
librdkafka client configuration: https://gist.github.com/gridaphobe/d1c544631c9569af810b405e572144cd#file-commit_failure-c-L24-L27
Operating system: macOS 10.14.6, also observed on rhel7 linux
Provide logs (with debug=consumer,cgrp) from librdkafka: https://gist.github.com/gridaphobe/d1c544631c9569af810b405e572144cd#file-consumer-1-log and https://gist.github.com/gridaphobe/d1c544631c9569af810b405e572144cd#file-consumer-2-log
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

gridaphobe · 2020-06-12T23:43:20Z

I forgot to mention, I tested this against librdkafka 1.4.0 as well and the consumer does not stall. Here's the relevant portion of the debug logs.

%7|1592004943.002|JOIN|rdkafka#consumer-1| [thrd:main]: localhost:62625/2: Joining group "test" with 1 subscribed topic(s)
%7|1592004943.002|CGRPJOINSTATE|rdkafka#consumer-1| [thrd:main]: Group "test" changed join state init -> wait-join (v6, state up)
%7|1592004943.002|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "test" received op OFFSET_COMMIT (v0) in state up (join state wait-join, v6 vs 0)
%7|1592004943.008|COMMIT|rdkafka#consumer-1| [thrd:main]: GroupCoordinator/2: Committing offsets for 4 partition(s): manual
%7|1592004943.008|JOINGROUP|rdkafka#consumer-1| [thrd:main]: JoinGroup response: GenerationId 11, Protocol range, LeaderId rdkafka-d98c643b-b00d-4c7c-b7d3-c70e75dbaa43 (me), my MemberId rdkafka-d98c643b-b00d-4c7c-b7d3-c70e75dbaa43, 2 members in group: (no error)
%7|1592004943.008|JOINGROUP|rdkafka#consumer-1| [thrd:main]: Elected leader for group "test" with 2 member(s)
%7|1592004943.009|GRPLEADER|rdkafka#consumer-1| [thrd:main]: Group "test": resetting group leader info: JoinGroup response clean-up
%7|1592004943.009|CGRPJOINSTATE|rdkafka#consumer-1| [thrd:main]: Group "test" changed join state wait-join -> wait-metadata (v6, state up)
%7|1592004943.009|COMMIT|rdkafka#consumer-1| [thrd:main]: GroupCoordinator/2: OffsetCommit for 4 partition(s): manual: returned: Broker: Group rebalance in progress
%4|1592004943.009|COMMITFAIL|rdkafka#consumer-1| [thrd:main]: Offset commit (manual) failed for 4/4 partition(s): Broker: Group rebalance in progress: test_topic[0]@72(Broker: Group rebalance in progress), test_topic[1]@63(Broker: Group rebalance in progress), test_topic[2]@55(Broker: Group rebalance in progress), test_topic[3]@67(Broker: Group rebalance in progress)
%7|1592004943.009|ASSIGN|rdkafka#consumer-1| [thrd:main]: Group "test" running range assignment for 2 member(s):
%7|1592004943.009|ASSIGN|rdkafka#consumer-1| [thrd:main]:  Member "rdkafka-d98c643b-b00d-4c7c-b7d3-c70e75dbaa43" (me) with 1 subscription(s):
%7|1592004943.009|ASSIGN|rdkafka#consumer-1| [thrd:main]:   test_topic [-1]
%7|1592004943.009|ASSIGN|rdkafka#consumer-1| [thrd:main]:  Member "rdkafka-998173de-95f6-48ce-bba1-0b564bbea3c0" with 1 subscription(s):
%7|1592004943.009|ASSIGN|rdkafka#consumer-1| [thrd:main]:   test_topic [-1]
%7|1592004943.009|ASSIGN|rdkafka#consumer-1| [thrd:main]: range: Topic test_topic with 4 partition(s) and 2 subscribing member(s)
%7|1592004943.009|ASSIGN|rdkafka#consumer-1| [thrd:main]: range: Member "rdkafka-998173de-95f6-48ce-bba1-0b564bbea3c0": assigned topic test_topic partitions 0..1
%7|1592004943.009|ASSIGN|rdkafka#consumer-1| [thrd:main]: range: Member "rdkafka-d98c643b-b00d-4c7c-b7d3-c70e75dbaa43": assigned topic test_topic partitions 2..3
%7|1592004943.010|ASSIGN|rdkafka#consumer-1| [thrd:main]: Group "test" range assignment for 2 member(s) finished in 0.051ms:
%7|1592004943.010|ASSIGN|rdkafka#consumer-1| [thrd:main]:  Member "rdkafka-d98c643b-b00d-4c7c-b7d3-c70e75dbaa43" (me) assigned 2 partition(s):
%7|1592004943.010|ASSIGN|rdkafka#consumer-1| [thrd:main]:   test_topic [2]
%7|1592004943.010|ASSIGN|rdkafka#consumer-1| [thrd:main]:   test_topic [3]
%7|1592004943.010|ASSIGN|rdkafka#consumer-1| [thrd:main]:  Member "rdkafka-998173de-95f6-48ce-bba1-0b564bbea3c0" assigned 2 partition(s):
%7|1592004943.010|ASSIGN|rdkafka#consumer-1| [thrd:main]:   test_topic [0]
%7|1592004943.010|ASSIGN|rdkafka#consumer-1| [thrd:main]:   test_topic [1]
%7|1592004943.010|ASSIGNOR|rdkafka#consumer-1| [thrd:main]: Group "test": "range" assignor run for 2 member(s)
%7|1592004943.010|CGRPJOINSTATE|rdkafka#consumer-1| [thrd:main]: Group "test" changed join state wait-metadata -> wait-sync (v6, state up)
%7|1592004943.011|SYNCGROUP|rdkafka#consumer-1| [thrd:main]: SyncGroup response: Success (34 bytes of MemberState data)
%7|1592004943.011|ASSIGN|rdkafka#consumer-1| [thrd:main]: Group "test": delegating assign of 2 partition(s) to application rebalance callback on queue rd_kafka_cgrp_new: new assignment
%7|1592004943.011|CGRPJOINSTATE|rdkafka#consumer-1| [thrd:main]: Group "test" changed join state wait-sync -> wait-assign-rebalance_cb (v6, state up)
%7|1592004943.011|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "test" received op ASSIGN (v0) in state up (join state wait-assign-rebalance_cb, v6 vs 0)
%7|1592004943.011|ASSIGN|rdkafka#consumer-1| [thrd:main]: Group "test": new assignment of 2 partition(s) in join state wait-assign-rebalance_cb
%7|1592004943.011|BARRIER|rdkafka#consumer-1| [thrd:main]: Group "test": rd_kafka_cgrp_assign:2566: new version barrier v7
%7|1592004943.011|ASSIGN|rdkafka#consumer-1| [thrd:main]: Group "test": assigning 2 partition(s) in join state wait-assign-rebalance_cb
%7|1592004943.011|CGRPJOINSTATE|rdkafka#consumer-1| [thrd:main]: Group "test" changed join state wait-assign-rebalance_cb -> assigned (v7, state up)
%7|1592004943.011|BARRIER|rdkafka#consumer-1| [thrd:main]: Group "test": rd_kafka_cgrp_partitions_fetch_start0:1848: new version barrier v8
%7|1592004943.011|FETCHSTART|rdkafka#consumer-1| [thrd:main]: Group "test": starting fetchers for 2 assigned partition(s) in join-state assigned (usable_offsets=no, v8, line 2619)

It looks like the OffsetCommit response is received before the new partitions are assigned, so the assign() is not deferred pending the commit. That seems like a race condition, but the ordering of events is very consistent in my testing: in 1.4.0 the OffsetCommit fails before we assign(), and in 1.4.2 it fails afterwards.

gridaphobe · 2020-06-17T18:23:54Z

I just bisected the issue to commit 757b376.

gridaphobe · 2020-06-17T19:24:36Z

@edenhill I think I see what's going on here. The above commit changed the semantics of OffsetCommit failures, in particular it retries the OffsetCommit for RD_KAFKA_RESP_ERR_REBALANCE_IN_PROGRESS errors, which keeps the commit alive long enough for the partition fetchers to not be restarted when the new assignment is received. Eventually the OffsetCommit fails permanently with RD_KAFKA_RESP_ERR_ILLEGAL_GENERATION, but the partition fetchers are still not restarted.

I can fix the issue by changing the error action for RD_KAFKA_RESP_ERR_REBALANCE_IN_PROGRESS to RD_KAFKA_ERR_ACTION_PERMANENT, which seems pretty sensible to me. If a rebalance is in progress, we know the commit will eventually fail due to an old generation id, so why bother retrying? (I think you could argue similarly that retrying RD_KAFKA_RESP_ERR_ILLEGAL_GENERATION is pointless.)

But this fix doesn't sit well with me. Why does an outstanding commit request prevent restarting the partition fetchers in the first place? If there's a good reason, would it make sense to unconditionally restart them when the request succeeds or fails?

edenhill · 2020-06-18T13:50:07Z

Wow, very impressed by your root cause analysis! 💯

edenhill · 2020-06-18T13:50:42Z

This will not make v1.4.4, will address for v1.5.0

edenhill · 2020-07-07T11:28:07Z

But this fix doesn't sit well with me. Why does an outstanding commit request prevent restarting the partition fetchers in the first place? If there's a good reason, would it make sense to unconditionally restart them when the request succeeds or fails?

The reason is that the fetcher might need to resume from the committed offset (which is the default behaviour), so we'll want to make sure any outstanding commits are done before we try to read the commits back from the broker.

gridaphobe · 2020-07-07T12:48:50Z

Ok, so we want to delay starting the fetcher until the OffsetCommit request returns to ensure it has an accurate starting point, which could either be (1) the offset we're trying to commit (if the OffsetCommit request succeeds) or (2) the previously-committed offset on the cluster (if the OffsetCommit request fails). In that case, we'd want to unconditionally restart the fetchers when outstanding OffsetCommits return, regardless of success/failure, right?

… handling (#2933)

…2933) This also refactors the cgrp's OffsetCommit handling.

guttulasunil · 2020-09-16T23:46:52Z

Thanks for sharing the issue, and nice RCA. Is there any workaround for this, other than using v1.4.0?

edenhill · 2020-09-17T07:54:30Z

I think it might be possible to call assign() again on the current assignment.

guttulasunil · 2020-09-17T09:36:06Z

Ok, let me check that.

… handling (#2933)

…2933) This also refactors the cgrp's OffsetCommit handling.

mkevac · 2020-10-07T14:06:53Z

Hi! Any plans to fix this? We are seeing this problem in production constantly and this is a huge problem for us. Thanks.

edenhill · 2020-11-11T10:12:52Z

There's a fix for #2933 in librdkafka v1.5.2 that you'll definitely want to have.

Besides keeping the worker updated, this release solves a bug, where consumer would randomly get stuck during rebalances. Links: - confluentinc/librdkafka#2933

edenhill · 2021-01-11T08:42:34Z

Can't reproduce on master

This mitigates confluentinc/librdkafka#2933 which is fixed in 1.5.2. This issue was leading to hanging consumers when they tried to commit during a rebalance operation. Signed-off-by: Matthias Wahl <mwahl@wayfair.com>

This mitigates confluentinc/librdkafka#2933 which is fixed in 1.5.2. This issue was leading to hanging consumers when they tried to commit during a rebalance operation. Signed-off-by: Matthias Wahl <mwahl@wayfair.com> Signed-off-by: Heinz N. Gies <heinz@licenser.net>

edenhill added the GREAT REPORT label Jun 18, 2020

dimpavloff mentioned this issue Jun 27, 2020

Kafka consumer (v1.4.2) gets stuck (NOT_COORDINATOR) whilst rejoining a group after a broker rolling update #2944

Closed

7 tasks

edenhill added this to the v1.6.0 milestone Jul 7, 2020

edenhill added a commit that referenced this issue Sep 16, 2020

Add REBALANCE_IN_PROGRESS and proper per-partition OffsetCommit error…

44ebfa8

… handling (#2933)

edenhill added a commit that referenced this issue Sep 16, 2020

Start fetchers when outstanding commits are done regardless of error (#…

466f46f

…2933) This also refactors the cgrp's OffsetCommit handling.

edenhill added a commit that referenced this issue Sep 22, 2020

Add REBALANCE_IN_PROGRESS and proper per-partition OffsetCommit error…

9d6412f

… handling (#2933)

edenhill added a commit that referenced this issue Sep 22, 2020

Start fetchers when outstanding commits are done regardless of error (#…

d0a9a8a

…2933) This also refactors the cgrp's OffsetCommit handling.

amotl mentioned this issue Oct 15, 2020

Robustness and resiliency on Azure #3109

Open

GureevLeonid mentioned this issue Oct 20, 2020

Update librdkafka.redist dependency to 1.5.2 confluentinc/confluent-kafka-dotnet#1432

Closed

jeffjnh mentioned this issue Oct 23, 2020

Consumer.consume() stops returning messages confluentinc/confluent-kafka-python#970

Closed

7 tasks

andremissaglia added a commit to arquivei/goduck that referenced this issue Nov 12, 2020

Upgrades confluent-kafka-go to 1.5.2

5147af8

Besides keeping the worker updated, this release solves a bug, where consumer would randomly get stuck during rebalances. Links: - confluentinc/librdkafka#2933

andremissaglia added a commit to arquivei/goduck that referenced this issue Nov 12, 2020

Upgrades confluent-kafka-go to 1.5.2

09a60c6

Besides keeping the worker updated, this release solves a bug, where consumer would randomly get stuck during rebalances. Links: - confluentinc/librdkafka#2933

lindgrenj6 mentioned this issue Dec 2, 2020

Update rdkafka library to use new version RedHatInsights/topological_inventory-ansible_tower#149

Merged

edenhill closed this as completed Jan 11, 2021

mfelsche mentioned this issue Sep 28, 2021

Upgrade librdkafka to 1.6.1 and prepare release 0.11.7 tremor-rs/tremor-runtime#1228

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer stalled after commit failure during rebalance #2933

Consumer stalled after commit failure during rebalance #2933

gridaphobe commented Jun 12, 2020

gridaphobe commented Jun 12, 2020

gridaphobe commented Jun 17, 2020

gridaphobe commented Jun 17, 2020

edenhill commented Jun 18, 2020

edenhill commented Jun 18, 2020

edenhill commented Jul 7, 2020

gridaphobe commented Jul 7, 2020

guttulasunil commented Sep 16, 2020

edenhill commented Sep 17, 2020

guttulasunil commented Sep 17, 2020

mkevac commented Oct 7, 2020

edenhill commented Nov 11, 2020

edenhill commented Jan 11, 2021

Consumer stalled after commit failure during rebalance #2933

Consumer stalled after commit failure during rebalance #2933

Comments

gridaphobe commented Jun 12, 2020

Description

How to reproduce

Checklist

gridaphobe commented Jun 12, 2020

gridaphobe commented Jun 17, 2020

gridaphobe commented Jun 17, 2020

edenhill commented Jun 18, 2020

edenhill commented Jun 18, 2020

edenhill commented Jul 7, 2020

gridaphobe commented Jul 7, 2020

guttulasunil commented Sep 16, 2020

edenhill commented Sep 17, 2020

guttulasunil commented Sep 17, 2020

mkevac commented Oct 7, 2020

edenhill commented Nov 11, 2020

edenhill commented Jan 11, 2021