KAFKA-16637: AsyncKafkaConsumer removes offset fetch responses from cache too aggressively by kirktrue · Pull Request #16310 · apache/kafka

kirktrue · 2024-06-12T19:36:30Z

Allow the committed offsets fetch to run for as long as needed. This handles the case where a user invokes Consumer.poll() with a very small timeout (including zero).

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…ache too aggressively Allow the committed offsets fetch to run for as long as needed. This handles the case where a user invokes Consumer.poll() with a very small timeout (including zero).

…than Long.MAX_VALUE

…ew one created

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

kirktrue · 2024-06-13T02:33:58Z

@AndrewJSchofield @cadonna @lianetm @philipnee—please review this PR. It's an alternative take on #16241 that seems simpler and not fraught with peril.

Thanks!

…hed or were unique, based on the test

AndrewJSchofield

Thanks for the PR. I like the new approach and the PR looks pretty good. I have left one comment to do with exception handling.

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

cadonna

Thanks for the PR, @kirktrue !

I like the approach.

Once the @AndrewJSchofield's comments are addressed, I guess we are good to go.

WDYT, @lianetm ?

clients/src/test/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumerTest.java

lianetm · 2024-06-13T13:29:00Z

Thanks for the PR @kirktrue, I love the simple the approach and the consistency with the legacy logic, thanks for your patience here :)

Same as @cadonna , LGTM once the error handling on getResult that @AndrewJSchofield pointed out is addressed.

Thanks!

…with application event queue

…clearing the pending event

clients/src/test/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumerTest.java

AndrewJSchofield · 2024-06-13T20:36:35Z

I'm happy with the approach now. Thanks @kirktrue.

lianetm · 2024-06-13T21:03:08Z

Thanks for the updates @kirktrue , happy with the approach too. Left an answer to your concerns on the tests, can be totally done separately if you prefer.

kirktrue · 2024-06-13T22:09:09Z

@AndrewJSchofield @cadonna @lianetm @philipnee: this PR is ready to be re-reviewed. Thanks all for your input 😄

…t deadlines are calculated

kirktrue · 2024-06-14T00:44:16Z

@jlprat—This Jira/PR is a blocker for the KIP-848 Java client work. It sounds like we're really close to merging this within the next day or two. Thanks!

lianetm

Thanks for the fix and updates @kirktrue, LGTM.

chia7712

@kirktrue nice patch and approach. two small questions are left. PTAL

chia7712 · 2024-06-14T05:50:14Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

+        if (pendingOffsetFetchEvent == null)
+            return false;
+
+        if (!pendingOffsetFetchEvent.partitions().equals(partitions))


Can it get reuse if the partitions of fetch request includes "all" input partitions? It seems refreshCommittedOffsets can ignore those partitions.

@chia7712—that's definitely an interesting optimization!

IIUC, the suggestion is to relax the requirement to allow reuse if the partitions for the current request are a subset of (or equal to) the previous request, right? So basically:

Suggested change

if (!pendingOffsetFetchEvent.partitions().equals(partitions))

if (!pendingOffsetFetchEvent.partitions().containsAll(partitions))

The behavior of the existing LegacyKafkaConsumer is to allow reuse only if the partitions for the current request equal those of the previous request exactly (source). That logic is the basis for the behavior used in the AsyncKafkaConsumer. We've been very deliberate to try to match the behavior between the two Consumer implementations as closely as possible, unless there's a specific reason not to.

It's a small change, and it does makes sense (to me). My main concern is that it introduces a subtle difference in behavior between the two Consumer implementations. Also, the specific case we're trying to solve with this change is when the user has passed in a very low timeout and we're in a tight poll() loop, which suggests the partitions wouldn't be changing between those loops (CMIIW).

If I understand correctly, this seems like an optimization, rather than something needed for correctness. If that's the case, can I file a new Jira to implement this when we have a little more time to investigate and test?

Thanks!

can I file a new Jira to implement this when we have a little more time to investigate and test?

sure and thanks for you sharing. I wasn't even aware that "match the behavior" before 🥲

I've filed KAFKA-16966 to track this optimization.

chia7712 · 2024-06-14T06:20:44Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

            refreshCommittedOffsets(offsets, metadata, subscriptions);
            return true;
        } catch (TimeoutException e) {
            log.error("Couldn't refresh committed offsets before timeout expired");


Does poll(0) fill users' log with this error message? if so, should we change the log level or add more explanation in order to avoiding freak users out?

Good point! ERROR level seems excessive here. At least WARN should be used. However, WARN would still fill users' logs. So maybe DEBUG?

Changed to debug and updated the text of the log to be slightly more helpful.

I'm not sure where that ERROR-level debugging crept in. I'd assumed it was a holdover from the LegacyKafkaConsumer, but its implementation doesn't log anything in the case of timeouts.

I think it's helpful to keep it there in DEBUG form.

jlprat · 2024-06-14T07:10:26Z

Hi @kirktrue. Let's check again when the PR is approved :)

cadonna

Thanks for all the updates and be open for all the feedback @kirktrue !

LGTM!

However, you should address the concern about excessive logging before we merge.

Thanks to all the other reviewers for their input!

cadonna · 2024-06-14T08:40:22Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

            refreshCommittedOffsets(offsets, metadata, subscriptions);
            return true;
        } catch (TimeoutException e) {
            log.error("Couldn't refresh committed offsets before timeout expired");


Good point! ERROR level seems excessive here. At least WARN should be used. However, WARN would still fill users' logs. So maybe DEBUG?

cadonna · 2024-06-14T08:46:08Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

+        } catch (InterruptException e) {
+            throw e;
+        } catch (Throwable t) {
+            // Clear the pending event on errors that are not timeout- or interrupt-related.


nit: Could you please remove the comment? It does not really add any information because it just spells out what the code does.

cadonna · 2024-06-14T08:48:32Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

+            throw ConsumerUtils.maybeWrapAsKafkaException(t);
        } finally {
+            if (shouldClearPendingEvent)
+                pendingOffsetFetchEvent = null;


nit:
Why not clearing pendingOffsetFetchEvent where you set shouldClearPendingEvent = true?
Would avoid the if here and the additional variable shouldClearPendingEvent.

I went back and forth on this a couple of times. The idea was to have a single place that the variable is assigned and a single place it is cleared, just for easier reasoning of the code for later troubleshooting/debugging/modifying. But it's clearly debatable if it's cleaner that way, so I went ahead and removed the flag.

kirktrue · 2024-06-14T18:40:24Z

@AndrewJSchofield @cadonna @chia7712 @lianetm @philipnee: this PR is ready to be re-reviewed. Thanks all for your continued input 😄

lianetm

Thanks for the simplification and adjusted log level, that will definitely avoid misleading red flags (been there). LGTM.

chia7712

LGTM

…che too aggressively (#16310) Allow the committed offsets fetch to run for as long as needed. This handles the case where a user invokes Consumer.poll() with a very small timeout (including zero). Reviewers: Andrew Schofield <aschofield@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>

kirktrue · 2024-06-15T02:17:13Z

@jlprat Can this get merged into 3.8.0?

chia7712 · 2024-06-15T03:19:14Z

Can this get merged into 3.8.0?

Sorry that I did not wait for @jlprat feedback before backporting to 3.8 ...

@jlprat I will revert that if it is unsuitable to be in 3.8 :)

jlprat · 2024-06-15T06:27:46Z

Hi @chia7712 we can let it stay in the 3.8 branch

kirktrue added 10 commits June 12, 2024 12:35

KAFKA-16637: AsyncKafkaConsumer removes offset fetch responses from c…

7117fcb

…ache too aggressively Allow the committed offsets fetch to run for as long as needed. This handles the case where a user invokes Consumer.poll() with a very small timeout (including zero).

Update unit tests to handle add vs. addAndGet for fetch offsets

6746dfc

Updated logic in AsyncKafkaConsumer.initWithCommittedOffsetsIfNeeded()

e0f519b

Minor refactoring

c1805fb

Starting to implement more reasonable timeout for fetch event rather …

3c5c6d1

…than Long.MAX_VALUE

Added test to ensure that an expired pending event is dropped and a n…

58ab141

…ew one created

Removed unnecessary call to assign() in test

32b2e70

Whitespace consistency

ed5fd06

Moved more stuff around to lessen the diff noise

235080c

Further tidying

7373153

kirktrue mentioned this pull request Jun 13, 2024

KAFKA-16637: AsyncKafkaConsumer removes offset fetch responses from cache too aggressively #16241

Closed

3 tasks

kirktrue marked this pull request as ready for review June 13, 2024 01:01

philipnee reviewed Jun 13, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java Outdated Show resolved Hide resolved

Changed pendingOffsetFetch() to hasPendingOffsetFetch()

b9bdfde

kirktrue added 2 commits June 12, 2024 20:03

Minor naming and comments refactoring

0df9e99

Added/modified tests to check that events from multiple attempts matc…

8c21dc2

…hed or were unique, based on the test

AndrewJSchofield requested changes Jun 13, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java Show resolved Hide resolved

cadonna reviewed Jun 13, 2024

View reviewed changes

lianetm reviewed Jun 13, 2024

View reviewed changes

clients/src/test/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumerTest.java Outdated Show resolved Hide resolved

lianetm reviewed Jun 13, 2024

View reviewed changes

clients/src/test/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumerTest.java Outdated Show resolved Hide resolved

kirktrue added 3 commits June 13, 2024 09:32

Refactoring to use Mockito.clearInvocations for clarity of intention …

5722700

…with application event queue

Updated error handling logic in initWithCommittedOffsetsIfNeeded for …

710ede1

…clearing the pending event

Removed unnecessary call to maybeWrapAsKafkaException()

849d7af

kirktrue commented Jun 13, 2024

View reviewed changes

clients/src/test/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumerTest.java Outdated Show resolved Hide resolved

kirktrue commented Jun 13, 2024

View reviewed changes

clients/src/test/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumerTest.java Outdated Show resolved Hide resolved

Clean up of tests to avoid unnecessary complexity

50d0cdd

kirktrue added 2 commits June 13, 2024 15:06

Clarification in comment

2c4bb39

Test method name consistency

6ab38df

kirktrue requested review from AndrewJSchofield, cadonna and philipnee June 13, 2024 22:18

kirktrue added 2 commits June 13, 2024 15:23

Using calculateDeadlineMs so its easier to find all places where even…

0e14cc1

…t deadlines are calculated

Added final to local variables for method-internal consistency

97c9e97

lianetm approved these changes Jun 14, 2024

View reviewed changes

chia7712 reviewed Jun 14, 2024

View reviewed changes

cadonna approved these changes Jun 14, 2024

View reviewed changes

kirktrue added 2 commits June 14, 2024 09:44

Removed shouldClearPendingEvent

93647cf

Updated log message and reduced level from error to debug

b1e35cb

kirktrue requested review from cadonna, chia7712 and lianetm June 14, 2024 18:39

AndrewJSchofield approved these changes Jun 14, 2024

View reviewed changes

lianetm approved these changes Jun 14, 2024

View reviewed changes

chia7712 approved these changes Jun 15, 2024

View reviewed changes

chia7712 merged commit 8f86b9c into apache:trunk Jun 15, 2024

kirktrue deleted the KAFKA-16637-long-running-offset-fetch branch June 19, 2024 19:35

cadonna added the ctr Consumer Threading Refactor (KIP-848) label Dec 28, 2024

	if (!pendingOffsetFetchEvent.partitions().equals(partitions))
	if (!pendingOffsetFetchEvent.partitions().containsAll(partitions))

Conversation

kirktrue commented Jun 12, 2024

Committer Checklist (excluded from commit message)

Uh oh!

Uh oh!

kirktrue commented Jun 13, 2024

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cadonna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lianetm commented Jun 13, 2024

Uh oh!

Uh oh!

Uh oh!

AndrewJSchofield commented Jun 13, 2024

Uh oh!

lianetm commented Jun 13, 2024

Uh oh!

kirktrue commented Jun 13, 2024

Uh oh!

kirktrue commented Jun 14, 2024

Uh oh!

lianetm left a comment

Choose a reason for hiding this comment

Uh oh!

chia7712 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlprat commented Jun 14, 2024

Uh oh!

cadonna left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kirktrue commented Jun 14, 2024

Uh oh!

lianetm left a comment

Choose a reason for hiding this comment

Uh oh!

chia7712 left a comment

Choose a reason for hiding this comment

Uh oh!

kirktrue commented Jun 15, 2024

Uh oh!

chia7712 commented Jun 15, 2024

Uh oh!

jlprat commented Jun 15, 2024

Uh oh!

Reviewers