KAFKA-10199: Implement removing active and standby tasks from the state updater #12270

cadonna · 2022-06-08T20:40:22Z

This PR adds removing of active and standby tasks from the default implementation of the state updater. The PR also includes refactorings that clean up the code.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…te updater This PR adds removing of active and standby tasks from the default implementation of the state updater. The PR also includes refactorings to clean up the code.

cadonna · 2022-06-08T20:42:59Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java

        private final AtomicBoolean isRunning = new AtomicBoolean(true);
        private final Consumer<Set<TopicPartition>> offsetResetter;
-        private final Map<TaskId, Task> updatingTasks = new HashMap<>();
+        private final Map<TaskId, Task> updatingTasks = new ConcurrentHashMap<>();


To allow the main thread reading the map in thread-safe way.

cadonna · 2022-06-08T20:44:15Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java

-    enum Action {
-        ADD
-    }
-
-    private static class TaskAndAction {
-        public final Task task;
-        public final Action action;
-
-        public TaskAndAction(final Task task, final Action action) {
-            this.task = task;
-            this.action = action;
-        }
-    }
-


Moved this to a new class to make construction of the object safer.

cadonna · 2022-06-08T20:56:34Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java

+    public Set<StandbyTask> getUpdatingStandbyTasks() {
+        return Collections.unmodifiableSet(new HashSet<>(stateUpdaterThread.getUpdatingStandbyTasks()));
+    }
+
+    public Set<Task> getUpdatingTasks() {
+        return Collections.unmodifiableSet(new HashSet<>(stateUpdaterThread.getUpdatingTasks()));
+    }
+
+    public Set<StreamTask> getRestoredActiveTasks() {


Added a couple of methods get* to improve unit testing. Those methods allow to look into the queues without draining them. They are not part of the StateUpdater interface.

cadonna · 2022-06-08T21:00:41Z

streams/src/test/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdaterTest.java

-        final List<ExceptionAndTasks> failedTasks = getFailedTasks(1);
-        assertEquals(1, failedTasks.size());
-        final ExceptionAndTasks actualFailedTasks = failedTasks.get(0);
-        assertEquals(2, actualFailedTasks.tasks.size());
-        assertTrue(actualFailedTasks.tasks.containsAll(Arrays.asList(task1, task2)));
-        assertTrue(actualFailedTasks.exception instanceof StreamsException);
-        final StreamsException actualException = (StreamsException) actualFailedTasks.exception;
-        assertFalse(actualException.taskId().isPresent());
-        assertEquals(expectedMessage, actualException.getMessage());
-        assertTrue(stateUpdater.getAllTasks().isEmpty());


Replaced with a call to containsAll() in verifyExceptionsAndFailedTasks(). There reference equality is verified for exception and task which I think is fine in this case.

Yup, agreed.

cadonna · 2022-06-08T21:06:02Z

streams/src/test/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdaterTest.java

+        verifyRemovedTasks(task);
+        verifyRestoredActiveTasks();
+        verifyUpdatingTasks();
+        verifyExceptionsAndFailedTasks();


Now, I verify that tasks are in the correct location.

guozhangwang · 2022-06-09T17:02:09Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java

        }

+        private void removeTask(final TaskId taskId) {
+            final Task task = updatingTasks.remove(taskId);


Not a comment: one of the bug fix we want to piggy-back in restoration is to write the new checkpoint when we stop restoring a task otherwise the restoration progress so-far would be lost. Also the restore callback should be triggered as well (KAFKA-10575).

I will try to add this in a separate PR, also a good exercise on how the state updater could interact with the processor state manager.

Yeah, I was also thinking about checkpointing, but was not clear about all the details. Here my thoughts about checkpointing:
We only checkpoint if we are not in EOS mode, because otherwise we would have a checkpoint file when we close dirty. On the other hand, also in EOS the offsets in that checkpoint file should be safe since it was written during restoration and not during a commit.
Is this correct or do I miss something?

During processing: yes today we should not write checkpoint file when we commit.
During restoring: we can always write checkpoint file regardless of EOS or ALOS, since if there's a failure we would just over-restore them upon recovery so no EOS violations happened.

Also during restoring: when we complete restore or remove task, we should enforce a checkpoint as well (for failing cases though, we should not write a new one).

OK, my understanding was correct then. Thank you for the clarification!

guozhangwang · 2022-06-09T17:10:35Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java


    @Override
-    public List<ExceptionAndTasks> getFailedTasksAndExceptions() {
+    public Set<Task> drainRemovedTasks() {


When we add the logic for recycling a task, which is to be done at the task manager still we would need two round-trips: first remove the task as active/standby, then after recycling it add the new task as standby/active. I'm still trying to flesh out the details here, just in case we would need to also have a timeout for removed tasks similar to restored-active-tasks, this data structure may also need to be transformed to a lock+condition+queue manner instead of a blocking-queue.

guozhangwang · 2022-06-09T17:14:40Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java

-        } finally {
-            tasksAndActionsLock.unlock();
-        }
+    public List<ExceptionAndTasks> getExceptionsAndFailedTasks() {


I feel some of the getters here would not end up needed in the non-testing code since we have the corresponding drainXX functions that are declared in the interface and would be used in those non-testing code. If that turns out true let's cluster them together and commented they are for testing only.

Since we will use the interface StateUpdater in production code and not its implementation DefaultStateUpdater those getters should not be visible to production code as long as no casting is used.
I am not a fan of the for testing only comment since it does not enforce anything and only pollutes the code.
Moreover I think there are good chances that we will need those getters to expose the tasks to the task manager. But let's see what the future will bring us.

Sounds good. Not advocating for the for testing only comment here, maybe just my paranoid gene that would like to see those functions clustered together but not feeling strong either way :)

guozhangwang · 2022-06-09T17:22:15Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java

+                log.debug((task.isActive() ? "Active" : "Standby")
+                    + " task " + task.id() + " was removed from the updating tasks and added to the removed tasks.");
+            } else {
+                log.debug("Task " + taskId + " was not removed since it is not updating.");


This is a meta comment: for those tasks that have been restored, or failed, should we still include them into the removed tasks to be returned in the drain function still?

The reason I'm wondering about it is that, the caller of updater.remove would likely expect to eventually see the task show up from the future drain functions (again here one example would be the recycle scenario). If we do not add them there then the caller's logic needs to be a bit complicated as to check the restored / failed set from the updater as well while checking when this task has been removed completely.

The down side of course is that a task can be shown in multiple of such channels, but I feel the caller's logic to "de-dup" such cases would be easier as long as there's a deterministic ordering of checking removed/completed/failed tasks from the updater.

WDYT?

I am not sure that the one way is less complex than the other. In any case users need to keep track of the tasks they want to remove in some way or the other. However, I am open for changes here since it is not trivial how to give feedback users about what happened with the removed tasks.

Yeah fair enough, we can revisit this when we are about to close the loop for sure. Just want to bring this up to your radar earlier than later.

guozhangwang · 2022-06-09T17:25:20Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/StateUpdater.java

-     * Get all tasks (active and standby) that are managed by the state updater.
+     * Drains the removed tasks (active and standbys) from the state updater.
+     *
+     * Removed tasks returned by this method are tasks extraordinarily removed from the state updater. These do not


Like the word "extraordinarily" :)

Jokes aside, I have a slight different thought about the semantics here, i.e. whether the drained removed and the restored/failed tasks should be exclusive or be overlapping possibly, mainly from how easily the caller could handle the overlapping scenarios. I left an early comment above.

guozhangwang · 2022-06-09T17:26:10Z

streams/src/test/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdaterTest.java

-        final List<ExceptionAndTasks> failedTasks = getFailedTasks(1);
-        assertEquals(1, failedTasks.size());
-        final ExceptionAndTasks actualFailedTasks = failedTasks.get(0);
-        assertEquals(2, actualFailedTasks.tasks.size());
-        assertTrue(actualFailedTasks.tasks.containsAll(Arrays.asList(task1, task2)));
-        assertTrue(actualFailedTasks.exception instanceof StreamsException);
-        final StreamsException actualException = (StreamsException) actualFailedTasks.exception;
-        assertFalse(actualException.taskId().isPresent());
-        assertEquals(expectedMessage, actualException.getMessage());
-        assertTrue(stateUpdater.getAllTasks().isEmpty());


Yup, agreed.

guozhangwang · 2022-06-09T17:28:46Z

Merged to trunk, thanks @cadonna ! Please feel free to discuss the comments in follow-up PRs.

…-2022 * apache/trunk: (52 commits) KAFKA-13967: Document guarantees for producer callbacks on transaction commit (apache#12264) [KAFKA-13848] Clients remain connected after SASL re-authentication f… (apache#12179) KAFKA-10000: Zombie fencing logic (apache#11779) KAFKA-13947: Use %d formatting for integers rather than %s (apache#12267) KAFKA-13929: Replace legacy File.createNewFile() with NIO.2 Files.createFile() (apache#12197) KAFKA-13780: Generate OpenAPI file for Connect REST API (apache#12067) KAFKA-13917: Avoid calling lookupCoordinator() in tight loop (apache#12180) KAFKA-10199: Implement removing active and standby tasks from the state updater (apache#12270) MINOR: Update Scala to 2.13.8 in gradle.properties (apache#12273) MINOR: add java 8/scala 2.12 deprecation info in doc (apache#12261) ... Conflicts: gradle.properties

cadonna added 2 commits June 8, 2022 22:41

KAFKA-10199: Implement removing active and standby tasks from the sta…

7b804e5

…te updater This PR adds removing of active and standby tasks from the default implementation of the state updater. The PR also includes refactorings to clean up the code.

Include feedback from previous PR

04fd567

cadonna force-pushed the impl-removing_active_tasks_from_state_updater branch from 1873054 to 04fd567 Compare June 8, 2022 20:41

cadonna commented Jun 8, 2022

View reviewed changes

cadonna requested a review from guozhangwang June 8, 2022 21:08

Remove unnecessary code

a4b7cc6

guozhangwang reviewed Jun 9, 2022

View reviewed changes

guozhangwang merged commit e67408c into apache:trunk Jun 9, 2022

KAFKA-10199: Implement removing active and standby tasks from the state updater #12270

KAFKA-10199: Implement removing active and standby tasks from the state updater #12270

Uh oh!

Conversation

cadonna commented Jun 8, 2022

Committer Checklist (excluded from commit message)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna Jun 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Jun 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cadonna Jun 8, 2022 •

edited

Loading