Skip to content

KAFKA-16637: AsyncKafkaConsumer removes offset fetch responses from cache too aggressively#16310

Merged
chia7712 merged 23 commits intoapache:trunkfrom
kirktrue:KAFKA-16637-long-running-offset-fetch
Jun 15, 2024
Merged

KAFKA-16637: AsyncKafkaConsumer removes offset fetch responses from cache too aggressively#16310
chia7712 merged 23 commits intoapache:trunkfrom
kirktrue:KAFKA-16637-long-running-offset-fetch

Conversation

@kirktrue
Copy link
Contributor

Allow the committed offsets fetch to run for as long as needed. This handles the case where a user invokes Consumer.poll() with a very small timeout (including zero).

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@kirktrue
Copy link
Contributor Author

@AndrewJSchofield @cadonna @lianetm @philipnee—please review this PR. It's an alternative take on #16241 that seems simpler and not fraught with peril.

Thanks!

Copy link
Member

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I like the new approach and the PR looks pretty good. I have left one comment to do with exception handling.

Copy link
Member

@cadonna cadonna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @kirktrue !

I like the approach.

Once the @AndrewJSchofield's comments are addressed, I guess we are good to go.

WDYT, @lianetm ?

@lianetm
Copy link
Member

lianetm commented Jun 13, 2024

Thanks for the PR @kirktrue, I love the simple the approach and the consistency with the legacy logic, thanks for your patience here :)

Same as @cadonna , LGTM once the error handling on getResult that @AndrewJSchofield pointed out is addressed.

Thanks!

@AndrewJSchofield
Copy link
Member

I'm happy with the approach now. Thanks @kirktrue.

@lianetm
Copy link
Member

lianetm commented Jun 13, 2024

Thanks for the updates @kirktrue , happy with the approach too. Left an answer to your concerns on the tests, can be totally done separately if you prefer.

@kirktrue
Copy link
Contributor Author

@AndrewJSchofield @cadonna @lianetm @philipnee: this PR is ready to be re-reviewed. Thanks all for your input 😄

@kirktrue
Copy link
Contributor Author

@jlprat—This Jira/PR is a blocker for the KIP-848 Java client work. It sounds like we're really close to merging this within the next day or two. Thanks!

Copy link
Member

@lianetm lianetm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix and updates @kirktrue, LGTM.

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirktrue nice patch and approach. two small questions are left. PTAL

if (pendingOffsetFetchEvent == null)
return false;

if (!pendingOffsetFetchEvent.partitions().equals(partitions))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it get reuse if the partitions of fetch request includes "all" input partitions? It seems refreshCommittedOffsets can ignore those partitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chia7712—that's definitely an interesting optimization!

IIUC, the suggestion is to relax the requirement to allow reuse if the partitions for the current request are a subset of (or equal to) the previous request, right? So basically:

Suggested change
if (!pendingOffsetFetchEvent.partitions().equals(partitions))
if (!pendingOffsetFetchEvent.partitions().containsAll(partitions))

The behavior of the existing LegacyKafkaConsumer is to allow reuse only if the partitions for the current request equal those of the previous request exactly (source). That logic is the basis for the behavior used in the AsyncKafkaConsumer. We've been very deliberate to try to match the behavior between the two Consumer implementations as closely as possible, unless there's a specific reason not to.

It's a small change, and it does makes sense (to me). My main concern is that it introduces a subtle difference in behavior between the two Consumer implementations. Also, the specific case we're trying to solve with this change is when the user has passed in a very low timeout and we're in a tight poll() loop, which suggests the partitions wouldn't be changing between those loops (CMIIW).

If I understand correctly, this seems like an optimization, rather than something needed for correctness. If that's the case, can I file a new Jira to implement this when we have a little more time to investigate and test?

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can I file a new Jira to implement this when we have a little more time to investigate and test?

sure and thanks for you sharing. I wasn't even aware that "match the behavior" before 🥲

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed KAFKA-16966 to track this optimization.

refreshCommittedOffsets(offsets, metadata, subscriptions);
return true;
} catch (TimeoutException e) {
log.error("Couldn't refresh committed offsets before timeout expired");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does poll(0) fill users' log with this error message? if so, should we change the log level or add more explanation in order to avoiding freak users out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! ERROR level seems excessive here. At least WARN should be used. However, WARN would still fill users' logs. So maybe DEBUG?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to debug and updated the text of the log to be slightly more helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where that ERROR-level debugging crept in. I'd assumed it was a holdover from the LegacyKafkaConsumer, but its implementation doesn't log anything in the case of timeouts.

I think it's helpful to keep it there in DEBUG form.

@jlprat
Copy link
Contributor

jlprat commented Jun 14, 2024

Hi @kirktrue. Let's check again when the PR is approved :)

Copy link
Member

@cadonna cadonna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates and be open for all the feedback @kirktrue !

LGTM!

However, you should address the concern about excessive logging before we merge.

Thanks to all the other reviewers for their input!

refreshCommittedOffsets(offsets, metadata, subscriptions);
return true;
} catch (TimeoutException e) {
log.error("Couldn't refresh committed offsets before timeout expired");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! ERROR level seems excessive here. At least WARN should be used. However, WARN would still fill users' logs. So maybe DEBUG?

} catch (InterruptException e) {
throw e;
} catch (Throwable t) {
// Clear the pending event on errors that are not timeout- or interrupt-related.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could you please remove the comment? It does not really add any information because it just spells out what the code does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

throw ConsumerUtils.maybeWrapAsKafkaException(t);
} finally {
if (shouldClearPendingEvent)
pendingOffsetFetchEvent = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
Why not clearing pendingOffsetFetchEvent where you set shouldClearPendingEvent = true?
Would avoid the if here and the additional variable shouldClearPendingEvent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on this a couple of times. The idea was to have a single place that the variable is assigned and a single place it is cleared, just for easier reasoning of the code for later troubleshooting/debugging/modifying. But it's clearly debatable if it's cleaner that way, so I went ahead and removed the flag.

@kirktrue kirktrue requested review from cadonna, chia7712 and lianetm June 14, 2024 18:39
@kirktrue
Copy link
Contributor Author

@AndrewJSchofield @cadonna @chia7712 @lianetm @philipnee: this PR is ready to be re-reviewed. Thanks all for your continued input 😄

Copy link
Member

@lianetm lianetm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the simplification and adjusted log level, that will definitely avoid misleading red flags (been there). LGTM.

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chia7712 chia7712 merged commit 8f86b9c into apache:trunk Jun 15, 2024
chia7712 pushed a commit that referenced this pull request Jun 15, 2024
…che too aggressively (#16310)

Allow the committed offsets fetch to run for as long as needed. This handles the case where a user invokes Consumer.poll() with a very small timeout (including zero).

Reviewers: Andrew Schofield <aschofield@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
@kirktrue
Copy link
Contributor Author

@jlprat Can this get merged into 3.8.0?

@chia7712
Copy link
Member

Can this get merged into 3.8.0?

Sorry that I did not wait for @jlprat feedback before backporting to 3.8 ...

@jlprat I will revert that if it is unsuitable to be in 3.8 :)

@jlprat
Copy link
Contributor

jlprat commented Jun 15, 2024

Hi @chia7712 we can let it stay in the 3.8 branch

@kirktrue kirktrue deleted the KAFKA-16637-long-running-offset-fetch branch June 19, 2024 19:35
@cadonna cadonna added the ctr Consumer Threading Refactor (KIP-848) label Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ctr Consumer Threading Refactor (KIP-848)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants