KAFKA-12738: track processing errors and implement constant-time task backoff by ableegoldman · Pull Request #11787 · apache/kafka

ableegoldman · 2022-02-17T11:57:52Z

Part 1 in the initial series of error handling for named topologies*:

Part 1: Track tasks with errors within a named topology & implement constant-time based task backoff
Part 2: Implement exponential task backoff to account for recurring errors
Part 3: Pause/backoff all tasks within a named topology in case of a long backoff/frequent errors for any individual task

*note: not sure whether some or all of this would require a KIP, so I'm limiting the error handling work to just named topologies for now

ableegoldman · 2022-02-22T10:45:55Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskExecutionMetadata.java

Not sure whether it is or will be cleaner in the long run to have this separate class that now has to keep up with topology additions/removals vs just doing all this bookkeeping inside the TopologyMetadata/InternalTopologyBuilder classes -- but until we can carve out time for a real tech debt cleanup of those classes which are already pretty out of control, I felt it best to pull everything out even if it meant duplicated un/registration of topologies

Thanks for bringing this up, I think we can come back and clean this up after we've gained confidence and is ready to extend beyond named topology later.

guozhangwang

Made a pass; just minor comment otherwise LGTM.

guozhangwang · 2022-02-22T18:54:05Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskExecutionMetadata.java

Should we check under if that the topologyName should always be UNNAMED_TOPOLOGY, or null?

See this -- we do check to make sure that if named topologies are used, then the topologyName is not null or UNNAMED_TOPOLOGY -- however for apps that do not use named topologies I think it's fair to just return "true" here and not require any assertions about the name, since it's a dummy name in that case anyways.

SG?

guozhangwang · 2022-02-22T19:00:29Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskExecutor.java

I think it's a bit overkill to first organize all tasks into activeTasksByTopology also may have unexpected scheduling bias compared to what we did today as more or less random-roundrobin. What about just checking for each task, if canProcess(taskName) && canProcessTopology(task.topologyName())?

Guess this was a case of premature optimization -- I'll update with your suggestion for this PR and work out a better solution that doesn't skew processing for the later PR where this matters (for some context, I did this in part because in one of the followups we will back off entire named topologies when one task is failing recurringly , to avoid getting out of sync, in which case it seemed wasteful to check each task in the topology if we already know it's not ready to process.

But we can revisit this when we get to that PR 🙂

guozhangwang · 2022-02-22T19:02:04Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskExecutionMetadata.java

Thanks for bringing this up, I think we can come back and clean this up after we've gained confidence and is ready to extend beyond named topology later.

ableegoldman · 2022-02-24T09:50:54Z

streams/src/test/java/org/apache/kafka/streams/integration/ErrorHandlingIntegrationTest.java

+    }
+
+    @Test
+    public void shouldBackOffTaskAndEmitDataWithinSameTopology() throws Exception {


Would like to add a test for the case of tasks in different named topologies, but perhaps in a followup PR to unblock this one

A meta question: do we really need an integration test to bring up the full stack for this test coverage? I'd feel a unit test with mock time just on the task executor would be sufficient?

I guess the reason I felt an integration test would be good was that there's some subtlety in how we may poll multiple times and also in how the thread replacement interplays with the backoff

Now that the integration test only takes .5s, WDYT about leaving it as an integration test, but moving it to the NamedTopologyIntergrationTest class since what does take a long time is bringing up the CLUSTER for each integration test?

Sounds good to me --- indeed the setup time to bring a CLUSTER is the first concern I had, and the second being vulnerable to system time flakiness than a mock time. Since it seems this test is less exposed for the latter concern moving it into an existing test is sufficient.

ableegoldman · 2022-02-24T10:04:17Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskExecutionMetadata.java

+    public void registerTaskError(final Task task, final Throwable t, final long now) {
+        if (hasNamedTopologies) {
+            final String topologyName = task.id().topologyName();
+            topologyNameToErrorMetadata.computeIfAbsent(topologyName, n -> new NamedTopologyMetadata(topologyName))


Changed it so that we only track topologies in the map here if they have an active backoff/task in error, rather than registering and unregistering named topologies and trying to keep this in sync between the TopologyMetadata and the individual StreamThreads' view (which was starting to look pretty ugly)

Instead we just pop the topology's metadata into the map when one of its tasks hits a new error, and clear it if/when all tasks are healthy again

guozhangwang

Made a second pass.

My major comment is around integration test -- I think we are a bit overkill in adding integration tests for cases like this which not only would increase test time, but also vulnerable to flakiness. A unit test with mock time on task executor looks pretty sufficient to me. All others are minor.

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskExecutionMetadata.java

guozhangwang · 2022-02-24T19:31:38Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskExecutionMetadata.java

+            final Long errorTime = tasksToErrorTime.get(task.id());
+            if (errorTime == null) {
+                return true;
+            } else if (now - errorTime > 15000L) {


Curious why the magic number of 15s?

Because it was actually taking the thread 10s to come back up (in the integration test where we overrode session.timeout to 10s) before we had #11801

Now with that fix it takes .5 - 4s for the thread to be replaced, so there's no particular reason to have it be 15s. I think it makes sense to lower it to maybe 5s for now, and then when we have the true exponential backoff obviously it can start lower and grow from there.

guozhangwang · 2022-02-24T19:35:17Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskExecutor.java

            throw e;
        } catch (final StreamsException e) {
-            log.error("Failed to process stream task {} due to the following error:", task.id(), e);
+            log.error(String.format("Failed to process stream task %s due to the following error:", task.id()), e);


Sorry to see log4j still have not figured out the way for both string param and exception in presentation..

guozhangwang · 2022-02-24T19:36:45Z

streams/src/test/java/org/apache/kafka/streams/integration/ErrorHandlingIntegrationTest.java

+    }
+
+    @Test
+    public void shouldBackOffTaskAndEmitDataWithinSameTopology() throws Exception {


A meta question: do we really need an integration test to bring up the full stack for this test coverage? I'd feel a unit test with mock time just on the task executor would be sufficient?

wcarlson5

LGTM, not sure why you moved the Int test to a separate file but I don't have anything real against it.

ableegoldman · 2022-02-24T20:07:40Z

Test failures are unrelated (filed https://issues.apache.org/jira/browse/KAFKA-13690)

Merging to trunk, but will do an immediate followup PR to:

delete the new ErrorHandlingIntegrationTest file and see if I can consolidate the test in there with the existing tests in NamedTopologyIntegrationTest
see if I can also consolidate/cut down on some of the tests in NamedTopologyIntegrationTest (I believe there are some cases that can be compressed into a single test, saving the overhead of starting up the KafkaStreams app)
add unit tests
address nits

…lacement time (#11801) Quick followup to #11787 to optimize the impact of the task backoff by reducing the time to replace a thread. When a thread is going through a dirty close, ie shutting down from an uncaught exception, we should be sending a LeaveGroup request to make sure the broker acknowledges the thread has died and won't wait up to the `session.timeout` for it to join the group if the user opts to `REPLACE_THREAD` in the handler Reviewers: Walker Carlson <wcarlson@confluent.io>, John Roesler <vvcephei@apache.org>

* apache-kafka/trunk: (49 commits) KAFKA-12738: send LeaveGroup request when thread dies to optimize replacement time (apache#11801) MINOR: Skip fsync on parent directory to start Kafka on ZOS (apache#11793) KAFKA-12738: track processing errors and implement constant-time task backoff (apache#11787) MINOR: Cleanup admin creation logic in integration tests (apache#11790) KAFKA-10199: Add interface for state updater (apache#11499) KAFKA-10000: Utils methods for overriding user-supplied properties and dealing with Enum types (apache#11774) KAFKA-10000: Add new metrics for source task transactions (apache#11772) KAFKA-13676: Commit successfully processed tasks on error (apache#11791) KAFKA-13511: Add support for different unix precisions in TimestampConverter SMT (apache#11575) MINOR: Improve Connect docs (apache#11642) ...

… of PR #11787 (#11812) This PR addresses the remaining nits from the final review of #11787 It also deletes two integration test classes which had only one test in them, and moves the tests to another test class file to save on the time to bring up an entire embedded kafka cluster just for a single run Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>

ableegoldman marked this pull request as draft February 17, 2022 11:57

ableegoldman force-pushed the 12738-ErrorHandling-track-task-errors branch 2 times, most recently from f7c4bc5 to d703436 Compare February 22, 2022 10:34

ableegoldman marked this pull request as ready for review February 22, 2022 10:38

ableegoldman commented Feb 22, 2022

View reviewed changes

guozhangwang reviewed Feb 22, 2022

View reviewed changes

ableegoldman force-pushed the 12738-ErrorHandling-track-task-errors branch from 9466b32 to efaad14 Compare February 23, 2022 07:14

introduce TaskExecutionMetadata

3d90058

ableegoldman force-pushed the 12738-ErrorHandling-track-task-errors branch from efaad14 to 3d90058 Compare February 23, 2022 07:49

ableegoldman added 6 commits February 23, 2022 01:52

fix unregister topology

ba81469

backoff by 15s

b9b270b

don't group tasks by topology during processing

f204b84

improve logging

2f007c2

clean up test

20c0b4d

undo test config change

36560d2

ableegoldman force-pushed the 12738-ErrorHandling-track-task-errors branch from 513d067 to 36560d2 Compare February 24, 2022 09:49

ableegoldman commented Feb 24, 2022

View reviewed changes

remove unused method

c9dc1dd

ableegoldman commented Feb 24, 2022

View reviewed changes

guozhangwang reviewed Feb 24, 2022

View reviewed changes

wcarlson5 approved these changes Feb 24, 2022

View reviewed changes

ableegoldman changed the title ~~KAFKA-12738: tracking task errors~~ KAFKA-12738: track processing errors and implement constant-time task backoff Feb 24, 2022

ableegoldman merged commit cd4a1cb into apache:trunk Feb 24, 2022

ableegoldman mentioned this pull request Feb 24, 2022

KAFKA-12738: send LeaveGroup request when thread dies to optimize replacement time #11801

Merged

ableegoldman added a commit to ableegoldman/kafka that referenced this pull request Feb 26, 2022

leftover suggestions from review of PR apache#11787

bfdae2d

ableegoldman mentioned this pull request Feb 26, 2022

KAFKA-12738: address minor followup and consolidate integration tests of PR #11787 #11812

Merged

Conversation

ableegoldman commented Feb 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ableegoldman Feb 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wcarlson5 left a comment

Choose a reason for hiding this comment

Uh oh!

ableegoldman commented Feb 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ableegoldman commented Feb 17, 2022 •

edited

Loading

ableegoldman Feb 26, 2022 •

edited

Loading