Fix messages being processed twice by the same consumer instance after rebalancing #591

svroonland · 2023-01-16T19:28:57Z

Fix for issue #590

After rebalancing, partitions that were revoked and then reassigned to the consumer are resumed from the last committed offset, which is lower than the last fetched offset that may still be in the partition stream's buffer. This leads to the same consumer instance processing messages for the revoked topic-partition twice. The issue is exacerbated for higher values of perPartitionChunkPrefetch and the bufferSize of Consumer.plainStream.

The fix to minimize this issue is to close the old partition streams, wait until they are fully drained so that the last offset can be committed by the user code (depending on stream topology, eg offset batching over multiple partitions negatively affects this issue), seek to the last committed offset and only then start fetching for the partition.

When using plainStream, there is a race condition between finalizing the partition stream and committing the last offsets, so some messages may be processed duplicately. When using partitionedStream and committing offsets as part of each partition's stream, the number of messages processed duplicately can be reduced to zero.

The restartStreamsOnRebalancing setting is removed and the former behavior of the value true is now the default. This, as a consequence, improves the semantics of the partitionedAssignmentStream method, where each emitted chunk of partition streams represents a generation of the consumer group.

Also in this PR:

Some added ZIO logging with annotations for topic and partition.
Fix for Throughput issue #428 replacing Increase commit vs poll fairness #430
Remove the use for a currentState Ref outside of the runFoldZIO by moving more logic from the rebalance handler into handlePoll
Improve support for CooperativeStickyAssignor by examining the consumer group generation ID after rebalance events and adding more tests

TODO:

Fix (flaky) tests after latest merge
See if we can refactor the restarting behavior a bit more outside of handlePoll
Figure out how to deal with buffered records in combination with seek()ing.
When do buffered records ( = unrequested partitions in poll result) appear..?
Transactional produce&commit test for the CooperativeStickyAssignor - what is the expected behavior in this case..?

…ting a new stream

Saw that after a Revoked, Assigned only the non-assigned partitions' streams are ended.

These changes were copied over from zio#591.

…uffered records

…treams" test relating to shutdown. This was also fixed in #591

erikvanoosten · 2023-03-14T12:38:21Z

I was asked for a review so I'll try. This change is too large to give a qualified opinion that is correct in the details.
Instead I'll just give a few high level worries.

This PR proposes to ignore the last committed offset and instead 'seek' to the last known consumed offset.

My first worry is that this only solves part of the problem. When a partition is removed from one consumer and assigned to another, the other consumer does not know what has been consumed and has no choice but to look at what has been committed. In other words, when a partition moves to another consumer, we will continue to have duplicates.

Another worry is that consumption is no guarantee for processing, only commits give this guarantee. This makes the strategy of this PR not suitable for conservative use cases where every messages is important.

My final worry is that the current approach is highly dependent on knowning when a rebalance is happening. Unfortunately, using the rebalance listener for that is not reliable. It could be that the only change is that a partition is revoked. In this case we do not detect the end of a rebalance because only onRevoke will be called, and not onAssigned.

IMHO the best solution is to make sure a revoked stream gets the chance to finish its work and commit everything before the rebalance starts. This is the purpose of the onRevoked callback of the rebalance listener. This means quite an overhaul of how the runloop works and I am happy to assist with doing this.

…treams" test relating to shutdown. This was also fixed in #591

erikvanoosten · 2023-03-17T09:05:27Z

It appears I was mistaken about a few things. Let me correct that here:

This PR proposes to ignore the last committed offset and instead 'seek' to the last known consumed offset.

I understand now that this PR will seek to the last know committed offset. This sounds safe to me!

My first worry is that this only solves part of the problem...

This worry stands. However, this PR still makes the situation better.

Another worry is that consumption is no guarantee for processing, only commits give this guarantee...

This worry now falls away.

My final worry is that the current approach is highly dependent on knowning when a rebalance is happening....

This worry was taken away by another PR. Runloop no longer depends on the this.

In other words, this PR looks much better than I though before!

svroonland · 2023-04-13T19:17:44Z

#788 will implement a significant portion of this PR. We should probably harvest some things from this PR into separate PRs.

svroonland added 6 commits January 14, 2023 11:36

Reproduce issue

d99932c

This also prevents the issue

1c48eac

Reduce issue by interrupting after the perPartitionPrefetch buffer stage

6607426

Fixes the issue but with perPartitionChunkPrefetch 2

dc2bd5c

Better reproduce

c970dcb

9000 becomes 15000

f9a8f03

svroonland mentioned this pull request Jan 16, 2023

Many records duplicately processed after rebalancing #590

Closed

svroonland added 23 commits January 24, 2023 20:05

Diagnose commit upon rebalancing

12c4d02

Reduce issue by removing buffers

4e6df83

Track partition stream completion

0e6b09f

Await completion of the previous stream after rebalancing before star…

014128f

…ting a new stream

Only 1000 extra for 30000

c75d24b

Restore stuff

23a6fe1

Fix tests

6bf906c

Rename

cec3cda

Merge remote-tracking branch 'origin/master' into investigate-issue-590

4cbe215

Shorten test

432cd5a

Stronger assertion

16e2d4a

Make test succeed with restartStreams

95a6854

Saw that after a Revoked, Assigned only the non-assigned partitions' streams are ended.

Improved logging with annotations

565d28c

Seek to last committed position after draining previous partition stream

8ccfe07

Tweak

e8b1ae4

interruptionPromise is no longer needed

b9cad66

Fix all tests

d4230b1

Remove unneeded restartStreamsOnRebalancing

0af4867

Tweak test, only works reliably for partitionedStream

6abfd27

Tweak logging

3a3e5db

Merge remote-tracking branch 'origin/master' into investigate-issue-590

d4ad995

Tweaks

2c29e08

Tweaks

e3d6ce3

svroonland marked this pull request as ready for review February 14, 2023 17:35

svroonland requested a review from iravid as a code owner February 14, 2023 17:35

svroonland marked this pull request as draft February 14, 2023 17:35

This was referenced Feb 18, 2023

Add transactional test #643

Closed

Add transactional test #644

Merged

erikvanoosten added a commit to erikvanoosten/zio-kafka that referenced this pull request Feb 19, 2023

Run consumer test sequentially and give it more time

296051b

These changes were copied over from zio#591.

svroonland added 16 commits February 19, 2023 13:31

Merge remote-tracking branch 'origin/master' into investigate-issue-590

ad6281c

Merge fix

5574d47

Alternative implementation of endRevoked / rebalancing integration

3971a2a

Enable test for RangeAssignor

9481a02

Pause fetching for newly assigned partitions, this should alleviate b…

510b067

…uffered records

Logging in test

a4328bb

Log awaiting stream completion

ce8f075

Re-enable only the RangeAssignor

c37303f

Restore for now

88ad8f2

Fix logAfterTime

b5466f2

Fix pausing of partitions

7d9d90d

Log errors in RebalanceListener

a1c370a

Merge remote-tracking branch 'origin/master' into investigate-issue-590

d82854b

Merge command queues

7d86539

Remove merge leftover

7526eda

Remove handlePoll's rebalance listener after calling poll

3a36470

svroonland added a commit that referenced this pull request Mar 6, 2023

Fix issue in "restartStreamsOnRebalancing mode closes all partition s…

dac6827

…treams" test relating to shutdown. This was also fixed in #591

guizmaii pushed a commit that referenced this pull request Mar 15, 2023

Fix issue in "restartStreamsOnRebalancing mode closes all partition s…

1103431

…treams" test relating to shutdown. This was also fixed in #591

Merge remote-tracking branch 'origin/master' into investigate-issue-590

8e369ea

svroonland closed this Apr 13, 2023

svroonland mentioned this pull request Oct 28, 2023

Always end streams in rebalance listener, support lost partitions #1089

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix messages being processed twice by the same consumer instance after rebalancing #591

Fix messages being processed twice by the same consumer instance after rebalancing #591

svroonland commented Jan 16, 2023 •

edited

Loading

erikvanoosten commented Mar 14, 2023

erikvanoosten commented Mar 17, 2023

svroonland commented Apr 13, 2023

Fix messages being processed twice by the same consumer instance after rebalancing #591

Fix messages being processed twice by the same consumer instance after rebalancing #591

Conversation

svroonland commented Jan 16, 2023 • edited Loading

erikvanoosten commented Mar 14, 2023

erikvanoosten commented Mar 17, 2023

svroonland commented Apr 13, 2023

svroonland commented Jan 16, 2023 •

edited

Loading