KAFKA-14124: improve quorum controller fault handling #12447

cmccabe · 2022-07-28T00:11:03Z

Before trying to commit a batch of records to the __cluster_metadata log, the active controller should try to apply them to its current in-memory state. If this application process fails, the active controller process should exit, allowing another node to take leadership. This will prevent most bad metadata records from ending up in the log and help to surface errors during testing.

Similarly, if the active controller attempts to renounce leadership, and the renunciation process itself fails, the process should exit. This will help avoid bugs where the active controller continues in an undefined state.

In contrast, standby controllers that experience metadata application errors should continue on, in order to avoid a scenario where a bad record brings down the whole controller cluster. The intended effect of these changes is to make it harder to commit a bad record to the metadata log, but to continue to ride out the bad record as well as possible if such a record does get committed.

This PR introduces the FaultHandler interface to implement these concepts. In junit tests, we use a FaultHandler implementation which does not exit the process. This allows us to avoid terminating the gradle test runner, which would be very disruptive. It also allows us to ensure that the test surfaces these exceptions, which we previously were not doing (the mock fault handler stores the exception).

In addition to the above, this PR fixes a bug where RaftClient#resign was not being called from the renounce() function. This bug could have resulted in the raft layer not being informed of an active controller resigning.

mumrah

Thanks for the patch, @cmccabe. I'm happy to see us tightening up the error handling :)

I like the new interface for fault handling. I left some comments inline

Just to make sure I understand the behavioral changes in this patch, it looks like we are:

Applying records to the leader before sending to raft (i.e., verified input)
Killing the leader if we cannot apply a record (again, prior to sending to raft)
Not killing a follower if there is an error applying a record
Not killing a follower if there is an error applying a snapshot

It would be nice to amend the javadoc for QuorumController to include some notes on the error handling semantics.

mumrah · 2022-07-29T23:51:55Z

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

+                    }
+                    i++;
+                }
+
                // If the operation returned a batch of records, those records need to be


We should update this comment to something like "if the records could be applied ... "

mumrah · 2022-07-30T00:00:58Z

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

-                        );
+                        fatalFaultHandler.handleFault(String.format("Asked to load snapshot " +
+                            "(%s) when it is the active controller (%d)", reader.snapshotId(),
+                            curClaimEpoch), null);


Can call the default method on the fault handler here instead of passing null

mumrah · 2022-07-30T00:05:28Z

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

+                        int i = 1;
+                        for (ApiMessageAndVersion message : messages) {


We have this pattern in three places now in QuorumController. Any benefit of refactoring into a private method? Maybe we add some helper to iterate the messages along with the index?

Yeah, but the problem is it's subtly different in each of these loops. The message, for example, is different, and whether the snapshot ID is passed in, etc. I think it would just be confusing to try to unify them (it's a single "for" loop after all)

mumrah · 2022-07-30T00:08:14Z

metadata/src/main/java/org/apache/kafka/metadata/fault/MetadataFaultException.java

+
+
+/**
+ * A metadata fault.


Can we elaborate on when it's expected to use this exception? Is it just when applying records?

yeah, will do.

mumrah · 2022-07-30T00:08:25Z

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

+                // NoOpRecord is an empty record and doesn't need to be replayed
+                break;
+            default:
+                throw new RuntimeException("Unhandled record type " + type);


MetadataFaultException?

this gets wrapped by the caller

mumrah · 2022-07-30T00:11:00Z

metadata/src/main/java/org/apache/kafka/metadata/fault/MetadataFaultHandler.java

+
+    @Override
+    public void handleFault(String failureMessage, Throwable cause) {
+        FaultHandler.logFailureMessage(log, failureMessage, cause);


Is this where we would increment one of the metrics being discussed in KIP-859?

yeah. exactly

mumrah

Changes look good, thanks @cmccabe!

Before trying to commit a batch of records to the __cluster_metadata log, the active controller should try to apply them to its current in-memory state. If this application process fails, the active controller process should exit, allowing another node to take leadership. This will prevent most bad metadata records from ending up in the log and help to surface errors during testing. Similarly, if the active controller attempts to renounce leadership, and the renunciation process itself fails, the process should exit. This will help avoid bugs where the active controller continues in an undefined state. In contrast, standby controllers that experience metadata application errors should continue on, in order to avoid a scenario where a bad record brings down the whole controller cluster. The intended effect of these changes is to make it harder to commit a bad record to the metadata log, but to continue to ride out the bad record as well as possible if such a record does get committed. This PR introduces the FaultHandler interface to implement these concepts. In junit tests, we use a FaultHandler implementation which does not exit the process. This allows us to avoid terminating the gradle test runner, which would be very disruptive. It also allows us to ensure that the test surfaces these exceptions, which we previously were not doing (the mock fault handler stores the exception). In addition to the above, this PR fixes a bug where RaftClient#resign was not being called from the renounce() function. This bug could have resulted in the raft layer not being informed of an active controller resigning.

ijuma · 2022-08-05T21:41:28Z

There are a bunch of test failures in the PR builder, how come this was merged?

ijuma · 2022-08-05T22:48:51Z

Synced with Colin offline, he submitted a fix here #12488

Before trying to commit a batch of records to the __cluster_metadata log, the active controller should try to apply them to its current in-memory state. If this application process fails, the active controller process should exit, allowing another node to take leadership. This will prevent most bad metadata records from ending up in the log and help to surface errors during testing. Similarly, if the active controller attempts to renounce leadership, and the renunciation process itself fails, the process should exit. This will help avoid bugs where the active controller continues in an undefined state. In contrast, standby controllers that experience metadata application errors should continue on, in order to avoid a scenario where a bad record brings down the whole controller cluster. The intended effect of these changes is to make it harder to commit a bad record to the metadata log, but to continue to ride out the bad record as well as possible if such a record does get committed. This PR introduces the FaultHandler interface to implement these concepts. In junit tests, we use a FaultHandler implementation which does not exit the process. This allows us to avoid terminating the gradle test runner, which would be very disruptive. It also allows us to ensure that the test surfaces these exceptions, which we previously were not doing (the mock fault handler stores the exception). In addition to the above, this PR fixes a bug where RaftClient#resign was not being called from the renounce() function. This bug could have resulted in the raft layer not being informed of an active controller resigning. Reviewers: David Arthur <mumrah@gmail.com>

…(10 August 2022) Trivial conflict in gradle/dependencies.gradle due to the newer Netty version in confluentinc/kafka. * apache-github/trunk: MINOR: Upgrade gradle to 7.5.1 and bump other build/test dependencies (apache#12495) KAFKA-14140: Ensure an offline or in-controlled-shutdown replica is not eligible to join ISR in ZK mode (apache#12487) KAFKA-14114: Add Metadata Error Related Metrics MINOR: BrokerMetadataSnapshotter must avoid exceeding batch size (apache#12486) MINOR: Upgrade mockito test dependencies (apache#12460) KAFKA-14144:; Compare AlterPartition LeaderAndIsr before fencing partition epoch (apache#12489) KAFKA-14134: Replace EasyMock with Mockito for WorkerConnectorTest (apache#12472) MINOR: Update scala version in bin scripts to 2.13.8 (apache#12477) KAFKA-14104; Add CRC validation when iterating over Metadata Log Records (apache#12457) MINOR: add :server-common test dependency to :storage (apache#12488) KAFKA-14107: Upgrade Jetty version for CVE fixes (apache#12440) KAFKA-14124: improve quorum controller fault handling (apache#12447)

* apache-github/trunk: (447 commits) KAFKA-13959: Controller should unfence Broker with busy metadata log (apache#12274) KAFKA-10199: Expose read only task from state updater (apache#12497) KAFKA-14154; Return NOT_CONTROLLER from AlterPartition if leader is ahead of controller (apache#12506) KAFKA-13986; Brokers should include node.id in fetches to metadata quorum (apache#12498) KAFKA-14163; Retry compilation after zinc compile cache error (apache#12507) Remove duplicate common.message.* from clients:test jar file (apache#12407) KAFKA-13060: Replace EasyMock and PowerMock with Mockito in WorkerGroupMemberTest.java (apache#12484) Fix the rate window size calculation for edge cases (apache#12184) MINOR: Upgrade gradle to 7.5.1 and bump other build/test dependencies (apache#12495) KAFKA-14140: Ensure an offline or in-controlled-shutdown replica is not eligible to join ISR in ZK mode (apache#12487) KAFKA-14114: Add Metadata Error Related Metrics MINOR: BrokerMetadataSnapshotter must avoid exceeding batch size (apache#12486) MINOR: Upgrade mockito test dependencies (apache#12460) KAFKA-14144:; Compare AlterPartition LeaderAndIsr before fencing partition epoch (apache#12489) KAFKA-14134: Replace EasyMock with Mockito for WorkerConnectorTest (apache#12472) MINOR: Update scala version in bin scripts to 2.13.8 (apache#12477) KAFKA-14104; Add CRC validation when iterating over Metadata Log Records (apache#12457) MINOR: add :server-common test dependency to :storage (apache#12488) KAFKA-14107: Upgrade Jetty version for CVE fixes (apache#12440) KAFKA-14124: improve quorum controller fault handling (apache#12447) ...

tamama · 2023-05-13T08:27:38Z

Hi @cmccabe @mumrah

Our production cluster (Apache Kafka 3.3.1) is having occasional problem with KRaft Quorum controller
A little search brings me to this PR.
A quick question: does this PR resolve this issue?
Thank you!

[2023-05-13 00:35:47,509] ERROR Encountered fatal fault: exception while renouncing leadership (org.apache.kafka.server.fault.ProcessExitingFaultHandler) java.lang.NullPointerException: Cannot invoke "org.apache.kafka.timeline.BaseHashTable.baseAddOrReplace(Object)" because "this.deltaTable" is null ...

tamama · 2023-05-14T09:17:29Z

I confirm that this issue is not resolved with latest Apache Kafka 3.4.0

ijuma · 2023-05-14T15:50:58Z

@tamama Can you please file a ticket in JIRA?

showuon · 2023-05-15T03:09:38Z

@tamama , this issue will be fixed in v3.4.1/v3.5.0 via this patch: #13653 . Thanks.

tamama · 2023-05-15T09:08:11Z

@showuon Thanks for your update!

@ijuma Will raise a JIRA ticket if problem remains in 3.5.0 :)

tamama · 2023-06-05T12:26:35Z

Hi @showuon , any chance of Kafka-3.5.0 release any time near?

Thank you very much!

mumrah · 2023-06-05T14:17:46Z

@tamama you can subscribe to the kafka-users mailing list for updates on releases. https://kafka.apache.org/contact.

Edit: looks like 3.5.0 RC1 is out now, so the release is fairly close at this point.

tamama · 2023-06-05T15:33:59Z

@mumrah Subscribed with thanks.

cmccabe force-pushed the faults-II branch from 13298e7 to 7461069 Compare July 28, 2022 23:32

cmccabe changed the title ~~MINOR: improve quorum controller fault handling~~ KAFKA-14124: improve quorum controller fault handling Jul 28, 2022

cmccabe force-pushed the faults-II branch from 7461069 to 42a0b29 Compare July 29, 2022 22:11

mumrah reviewed Jul 30, 2022

View reviewed changes

mumrah approved these changes Aug 3, 2022

View reviewed changes

cmccabe added 2 commits August 4, 2022 10:13

revisions

a2fe4ed

cmccabe force-pushed the faults-II branch from b88f2e1 to a2fe4ed Compare August 4, 2022 17:14

cmccabe added 2 commits August 4, 2022 14:12

improve debugging a bit

c6b9ed9

etc

37cc000

cmccabe merged commit 555744d into apache:trunk Aug 5, 2022

cmccabe deleted the faults-II branch August 5, 2022 05:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-14124: improve quorum controller fault handling #12447

KAFKA-14124: improve quorum controller fault handling #12447

cmccabe commented Jul 28, 2022 •

edited

Loading

mumrah left a comment

mumrah Jul 29, 2022

mumrah Jul 30, 2022

mumrah Jul 30, 2022

cmccabe Aug 1, 2022

mumrah Jul 30, 2022

cmccabe Aug 1, 2022

mumrah Jul 30, 2022

cmccabe Aug 1, 2022

mumrah Jul 30, 2022

cmccabe Aug 1, 2022

mumrah left a comment

ijuma commented Aug 5, 2022

ijuma commented Aug 5, 2022

tamama commented May 13, 2023

tamama commented May 14, 2023 •

edited

Loading

ijuma commented May 14, 2023

showuon commented May 15, 2023

tamama commented May 15, 2023

tamama commented Jun 5, 2023

mumrah commented Jun 5, 2023 •

edited

Loading

tamama commented Jun 5, 2023

KAFKA-14124: improve quorum controller fault handling #12447

KAFKA-14124: improve quorum controller fault handling #12447

Conversation

cmccabe commented Jul 28, 2022 • edited Loading

mumrah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mumrah left a comment

Choose a reason for hiding this comment

ijuma commented Aug 5, 2022

ijuma commented Aug 5, 2022

tamama commented May 13, 2023

tamama commented May 14, 2023 • edited Loading

ijuma commented May 14, 2023

showuon commented May 15, 2023

tamama commented May 15, 2023

tamama commented Jun 5, 2023

mumrah commented Jun 5, 2023 • edited Loading

tamama commented Jun 5, 2023

cmccabe commented Jul 28, 2022 •

edited

Loading

tamama commented May 14, 2023 •

edited

Loading

mumrah commented Jun 5, 2023 •

edited

Loading