KAFKA-15974: Enforce that event processing respects user-provided timeout by kirktrue · Pull Request #15640 · apache/kafka

kirktrue · 2024-04-01T21:43:52Z

The intention of the CompletableApplicationEvent is for a Consumer to enqueue the event and then block, waiting for it to complete. The application thread will block up to the amount of the timeout. This change introduces a consistent manner in which events are expired out by checking their timeout values.

The CompletableEventReaper is a new class that tracks CompletableEvents that are enqueued. Both the application thread and the network I/O thread maintain their own reaper instances. The application thread will track any CompletableBackgroundEvents that it receives and the network I/O thread will do the same with any CompletableApplicationEvents it receives. The application and network I/O threads will check their tracked events, and if any are expired, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a TimeoutException.

On closing the AsyncKafkaConsumer, both threads will invoke their respective reapers to cancel any unprocessed events in their queues. In this case, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a CancellationException instead of a TimeoutException to differentiate the two cases.

The overall design for the expiration mechanism is captured on the Apache wiki and the original issue (KAFKA-15848) has more background on the cause.

Note: this change only handles the event expiration and does not cover the network request expiration. That is handled in a follow-up Jira (KAFKA-16200) that builds atop this change.

This change also includes some minor refactoring of the EventProcessor and its implementations. This allows the event processor logic to focus on processing individual events rather than also the handling of batches of events.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

Don't look at me—I'm hideous!

kirktrue · 2024-05-18T01:56:19Z

Hey @cadonna, the tricky bit is that, for some events, the request managers do expire requests too, so in this flow you described:

The event is processed in the ApplicationEventHandler and a request is added to the commit request manager. Then the commit request manager is polled, the requests are added to the network client and the the network client is polled

When the manager is polled, if the event had timeout 0, it will be expired/cancelled before making it to the network thread. Currently we have 2 managers that do this (that I can remember): TopicMetadataManager and CommitRequestManager. So for those events, even with this PR, if they have timeout 0, they won't have a chance to complete.

My point is not to bring more changes into this PR, only to have the whole situation in mind so we can address it properly (with multiple PRs). This other PR attempts to address this situation I described, but only in the CommitRequestManager for instance. We still have to align on the approach there, and also handle it in the TopicMetadataManager I would say. I would expect that a combination of this PR and those others would allow us to get to a better point (now, even with this PR, we cannot make basic progress with a consumer being continuously polled with timeout 0 because FetchCommittedOffsets is always expired by the manager, for instance). I can easily repro it with the following integration test + poll(ZERO) (that I was surprised we have not covered, because TestUtils always polls with a non-zero timeout)
  // Ensure TestUtils polls with ZERO. This fails for the new consumer only.
  @ParameterizedTest(name = TestInfoUtils.TestWithParameterizedQuorumAndGroupProtocolNames)
  @MethodSource(Array("getTestQuorumAndGroupProtocolParametersAll"))
  def testPollEventuallyReturnsRecordsWithZeroTimeout(quorum: String, groupProtocol: String): Unit = {
    val numMessages = 100
    val producer = createProducer()
    sendRecords(producer, numMessages, tp)

    val consumer = createConsumer()
    consumer.subscribe(Set(topic).asJava)
    val records = awaitNonEmptyRecords(consumer, tp)
    assertEquals(numMessages, records.count())
  }
Makes sense?

Yes, the network layer changes are captured in KAFKA-16200 and build on top of this PR.

kirktrue · 2024-05-18T01:57:33Z

@lianetm @cadonna—I believe I have addressed all the actionable feedback. Are there additional concerns about this PR that prevent it from being merged? Thanks.

cadonna

Thanks for the updates, @kirktrue !

As far as I understand from the discussion, @lianetm and you have a plan for the follow-up PRs. So this PR can be merged.

I just have two minor questions about the tests. You know best if you want to consider my comments.

cadonna · 2024-05-21T10:02:24Z

...nts/src/test/java/org/apache/kafka/clients/consumer/internals/ConsumerNetworkThreadTest.java

    }

    @Test
-    void testEnsureEventsAreCompleted() {


Why did you remove this test without replacement?

Reinstated.

Actually seems to me that we shouldn't have this test here (and maybe this is why @kirktrue removed it before?). As I see it, this unit test is testing something that is not the ConsumerNetworkThread's responsibility (and that's why it ends up being complicated, having to mimic the reaper behaviour and spying). It is testing that events are completed, and that's the reaper.reap responsibility, so seems to me we need to:

test that the ConsumerNetworkThread calls the reaper with the full list of events -> done already in the testCleanupInvokesReaper

test that the CompletableEventReaper.reap(Collection<?> events) completes the events -> done in CompletableEventReaperTest (testIncompleteQueue and testIncompleteTracked)

In the end, as it is, we end up asserting a behaviour we're mocking ourselves in the doAnswer, so not much value I would say? Agree with @cadonna that we need coverage, but I would say that we have it, on my points 1 and 2, and this should be removed. Makes sense?

Yes, the test was a little suspect in terms of its value-add, so I'd removed it.

I was planning to file a Jira to move several of the tests (including this one) from ConsumerNetworkThreadTest to ApplicationEventProcessorTest. Then we could fix up some of the funkiness in this test as a separate task.

That is all fine! I was not arguing that we need to keep the test, but if I see a test removed without replacement, I suspect a mistake. Which did apparently not happen in this case. Next time comment on the PR why you removed the test.

cadonna · 2024-05-21T10:05:39Z

clients/src/test/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumerTest.java

+        consumer = newConsumer();
+        completeUnsubscribeApplicationEventSuccessfully();
+        consumer.unsubscribe();
+        verify(backgroundEventReaper).reap(any(Long.class));


You control the time here. Why do you not verify that reap() is called with the correct time?

Good call. Done!

cadonna · 2024-05-21T10:06:00Z

...nts/src/test/java/org/apache/kafka/clients/consumer/internals/ConsumerNetworkThreadTest.java

+    @Test
+    void testRunOnceInvokesReaper() {
+        consumerNetworkThread.runOnce();
+        verify(applicationEventReaper).reap(any(Long.class));


You control the time here. Why do you not verify that reap() is called with the correct time?

And done here, too.

Do you still have the change locally, because here it does still not verify the correct time?

cadonna · 2024-05-21T12:56:15Z

Hey @cadonna, the tricky bit is that, for some events, the request managers do expire requests too, so in this flow you described:

The event is processed in the ApplicationEventHandler and a request is added to the commit request manager. Then the commit request manager is polled, the requests are added to the network client and the the network client is polled

When the manager is polled, if the event had timeout 0, it will be expired/cancelled before making it to the network thread. Currently we have 2 managers that do this (that I can remember): TopicMetadataManager and CommitRequestManager. So for those events, even with this PR, if they have timeout 0, they won't have a chance to complete.

My point is not to bring more changes into this PR, only to have the whole situation in mind so we can address it properly (with multiple PRs). This other PR attempts to address this situation I described, but only in the CommitRequestManager for instance. We still have to align on the approach there, and also handle it in the TopicMetadataManager I would say. I would expect that a combination of this PR and those others would allow us to get to a better point (now, even with this PR, we cannot make basic progress with a consumer being continuously polled with timeout 0 because FetchCommittedOffsets is always expired by the manager, for instance). I can easily repro it with the following integration test + poll(ZERO) (that I was surprised we have not covered, because TestUtils always polls with a non-zero timeout)
  // Ensure TestUtils polls with ZERO. This fails for the new consumer only.
  @ParameterizedTest(name = TestInfoUtils.TestWithParameterizedQuorumAndGroupProtocolNames)
  @MethodSource(Array("getTestQuorumAndGroupProtocolParametersAll"))
  def testPollEventuallyReturnsRecordsWithZeroTimeout(quorum: String, groupProtocol: String): Unit = {
    val numMessages = 100
    val producer = createProducer()
    sendRecords(producer, numMessages, tp)

    val consumer = createConsumer()
    consumer.subscribe(Set(topic).asJava)
    val records = awaitNonEmptyRecords(consumer, tp)
    assertEquals(numMessages, records.count())
  }
Makes sense?

@lianetm Thanks for the explanation!

lianetm · 2024-05-21T14:17:52Z

clients/src/test/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumerTest.java

+        // Close the consumer here as we know it will cause a FencedInstanceIdException to be thrown.
+        // If we get an error other than the FencedInstanceIdException, we'll raise a ruckus.
+        try {
+            consumer.close();
+        } catch (KafkaException e) {
+            assertNotNull(e.getCause());
+            assertInstanceOf(FencedInstanceIdException.class, e.getCause());
+        } finally {
+            consumer = null;
+        }


Do we expect the close to throw? If so, we should verify that (at the moment our test will just complete successfully if the close does not throw). If that's the expectation, maybe this simpler snippet would cover it all:

Suggested change

// Close the consumer here as we know it will cause a FencedInstanceIdException to be thrown.

// If we get an error other than the FencedInstanceIdException, we'll raise a ruckus.

try {

consumer.close();

} catch (KafkaException e) {

assertNotNull(e.getCause());

assertInstanceOf(FencedInstanceIdException.class, e.getCause());

} finally {

consumer = null;

}

Throwable e = assertThrows(KafkaException.class, () -> consumer.close());

assertInstanceOf(FencedInstanceIdException.class, e.getCause());

consumer = null;

how did we resolve this? I see the section got completely removed, verification not needed?

Yes, it turns out that changes made elsewhere have obviated the need for this check.

lianetm · 2024-05-21T14:38:42Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

-                             final Timer timer) {
-        if (!shouldAutoCommit)
-            return;
+    void maybeAutoCommitSync(final Timer timer) {


This is not a "maybe" anymore, so what about autoCommitSyncAllConsumed?

Changed to just autoCommitSync(). Is that OK?

lianetm · 2024-05-21T14:59:16Z

...src/main/java/org/apache/kafka/clients/consumer/internals/events/CompletableEventReaper.java

+        // First, complete (exceptionally) any events that have passed their deadline AND aren't already complete.
+        tracked.stream()
+            .filter(e -> !e.future().isDone())
+            .filter(e -> currentTimeMs > e.deadlineMs())


Don't we want >= here when identifying expired events? I would expect so (that's the semantic applied in the Timer class isExpired for instance)

This is an interesting point 🤔

If a user provides a timeout of 1000 milliseconds, is it expired at 1000 milliseconds or at 1001 milliseconds?

Regardless, I will change it to >= to be consistent.

lianetm · 2024-05-21T15:18:31Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

+     * could occur when processing the events. In such cases, the processor will take a reference to the first
+     * error, continue to process the remaining events, and then throw the first error that occurred.
+     */
+    private boolean processBackgroundEvents(EventProcessor<BackgroundEvent> processor) {


This processor passed as argument is in the end always a reference to the backgroundEventProcessor, so could we simplify this, remove the arg and directly reference the var? It caught my attention when seeing how this is used, which seems a bit redundant with all calls having to provide the same processBackgroundEvents(backgroundEventProcessor, ... which feels like an internal that the processBackgroundEvents could know about.

There is a unit test that passes in a mocked event processor. Let me look at refactoring this.

Done. That's much better 😄

Co-authored-by: Lianet Magrans <98415067+lianetm@users.noreply.github.com>

…imer.isExpired()

…rocessor

kirktrue · 2024-05-21T18:04:40Z

@lianetm @cadonna—The latest batch of feedback has been addressed. Thanks!

cadonna

Thanks for the updates, @kirktrue !

Once @lianetm approves, I will merge the PR,

lianetm

Thanks for your patience and great effort here @kirktrue, LGTM to merge and move on with the follow ups. Just to recap, this is what I see should be address next related to timeout enforcement:

Also please let's have a jira to address this comment to remove the test we agreed brings no value.

Thanks again!
cc. @cadonna

kirktrue · 2024-05-22T16:52:58Z

I added KAFKA-16818 to cover the cases to refactor/migrate/remove tests.

Thanks @cadonna & @lianetm for your reviews!

…eout (apache#15640) The intention of the CompletableApplicationEvent is for a Consumer to enqueue the event and then block, waiting for it to complete. The application thread will block up to the amount of the timeout. This change introduces a consistent manner in which events are expired out by checking their timeout values. The CompletableEventReaper is a new class that tracks CompletableEvents that are enqueued. Both the application thread and the network I/O thread maintain their own reaper instances. The application thread will track any CompletableBackgroundEvents that it receives and the network I/O thread will do the same with any CompletableApplicationEvents it receives. The application and network I/O threads will check their tracked events, and if any are expired, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a TimeoutException. On closing the AsyncKafkaConsumer, both threads will invoke their respective reapers to cancel any unprocessed events in their queues. In this case, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a CancellationException instead of a TimeoutException to differentiate the two cases. The overall design for the expiration mechanism is captured on the Apache wiki and the original issue (KAFKA-15848) has more background on the cause. Note: this change only handles the event expiration and does not cover the network request expiration. That is handled in a follow-up Jira (KAFKA-16200) that builds atop this change. This change also includes some minor refactoring of the EventProcessor and its implementations. This allows the event processor logic to focus on processing individual events rather than also the handling of batches of events. Reviewers: Lianet Magrans <lianetmr@gmail.com>, Philip Nee <pnee@confluent.io>, Bruno Cadonna <cadonna@apache.org>

kirktrue added 30 commits January 22, 2024 16:41

WIP

666d74b

Don't look at me—I'm hideous!

Lots more changes

9d6ec58

Updates

bbbfec7

Reverting

a0f6fc5

Reverting toString() changes

72b96b5

Revert toString() changes

ca1f16d

More reverting

c88fa3b

Reverts

3b5bc62

Reverting

46af018

Reverts

4998699

Indent SNAFU

1366f08

Indentation reverts

73fd588

Update AsyncKafkaConsumer.java

6030c8e

Everything compiles (for now)

7249756

Update HeartbeatRequestManagerTest.java

338440e

Updates to fix some tests

7ee977b

Updates

eff925d

Update ApplicationEventProcessorTest.java

d1dfdc6

Updates

79880a4

Update TopicMetadataApplicationEvent.java

d96e75b

Updates

1ee795e

Update ErrorBackgroundEvent.java

a685b4e

Updates

dcdb767

Updates

de267b7

Update ApplicationEventProcessor.java

c94bcb9

Update ApplicationEventProcessor.java

337c306

Update ApplicationEventProcessor.java

422f935

Updates to include RelaxedCompletableFuture

d08d4a6

Merge branch 'trunk' into KAFKA-15974-enforce-timeout-in-requests

fe6a890

Updates

1533901

Merge branch 'trunk' into KAFKA-15974-enforce-timeout-in-events

3eb054a

cadonna approved these changes May 21, 2024

View reviewed changes

lianetm reviewed May 21, 2024

View reviewed changes

kirktrue and others added 8 commits May 21, 2024 08:23

Merge branch 'trunk' into KAFKA-15974-enforce-timeout-in-events

2e445c5

Testing expected value in timestamp-related reap tests

ae6a6cd

Added testEnsureEventsAreCompleted() back along with all its trapping

808eb4b

Update testCommitAsyncWithFencedException

e9ea1e6

Co-authored-by: Lianet Magrans <98415067+lianetm@users.noreply.github.com>

Changing expiration logic in CompletableEventReaper.reap() to match T…

62bec7e

…imer.isExpired()

Removed outdated logic in testCommitAsyncWithFencedException

0f168e2

maybeAutoCommitSync > autoCommitSync is there's no longer a conditional

c423560

Refactored processBackgroundEvents to eliminate passing in the EventP…

91af164

…rocessor

kirktrue requested review from cadonna and lianetm May 21, 2024 18:04

cadonna approved these changes May 22, 2024

View reviewed changes

lianetm approved these changes May 22, 2024

View reviewed changes

cadonna merged commit a98c9be into apache:trunk May 22, 2024

kirktrue deleted the KAFKA-15974-enforce-timeout-in-events branch May 22, 2024 16:54

lianetm mentioned this pull request Jun 3, 2024

KAFKA-16637: AsyncKafkaConsumer removes offset fetch responses from cache too aggressively #15844

Closed

3 tasks

Comments

Conversation

kirktrue commented Apr 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

kirktrue commented May 18, 2024

Uh oh!

kirktrue commented May 18, 2024

Uh oh!

cadonna left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianetm May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna commented May 21, 2024

Uh oh!

lianetm May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianetm May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kirktrue commented May 21, 2024

Uh oh!

cadonna left a comment

Choose a reason for hiding this comment

Uh oh!

lianetm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kirktrue commented May 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

kirktrue commented Apr 1, 2024 •

edited

Loading

lianetm May 21, 2024 •

edited

Loading

cadonna May 22, 2024 •

edited

Loading

lianetm May 21, 2024 •

edited

Loading

lianetm May 21, 2024 •

edited

Loading

lianetm left a comment •

edited

Loading