KAFKA-14154; Return NOT_CONTROLLER from AlterPartition if leader is ahead of controller #12506

hachikuji · 2022-08-11T16:58:23Z

It is possible for the leader to send an AlterPartition request to a zombie controller which includes either a partition or leader epoch which is larger than what is found in the controller context. Prior to #12032, the controller handled this in the following way:

If the LeaderAndIsr state exactly matches the current state on the controller excluding the partition epoch, then the AlterPartition request is considered successful and no error is returned. The risk with this handling is that this may cause the leader to incorrectly assume that the state had been successfully updated. Since the controller's state is stale, there is no way to know what the latest ISR state is.
Otherwise, the controller will attempt to update the state in zookeeper with the leader/partition epochs from the AlterPartition request. This operation would fail if the controller's epoch was not still current in Zookeeper and the result would be a NOT_CONTROLLER error.

Following #12032, the controller's validation is stricter. If the partition epoch is larger than expected, then the controller will return INVALID_UPDATE_VERSION without attempting the operation. Similarly, if the leader epoch is larger than expected, the controller will return FENCED_LEADER_EPOCH. The problem with this new handling is that the leader treats the errors from the controller as authoritative. For example, if it sees the FENCED_LEADER_EPOCH error, then it will not retry the request and will simply wait until the next leader epoch arrives. The ISR state gets suck in a pending state, which can lead to persistent URPs until the leader epoch gets bumped.

In this patch, we want to fix the issues with this handling, but we don't want to restore the buggy idempotent check. The approach is straightforward. If the controller sees a partition/leader epoch which is larger than what it has in the controller context, then it assumes that has become a zombie and returns NOT_CONTROLLER to the leader. This will cause the leader to attempt to reset the controller from its local metadata cache and retry the AlterPartition request.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…head of controller

hachikuji · 2022-08-11T16:59:06Z

cc @jsancio @dajac

hachikuji · 2022-08-11T17:06:13Z

One note: this patch fixes the problem for the zk controller, but I believe the problem exists for kraft as well. I am thinking we can address that separately with a slightly different approach. Basically we can use the handleLeaderChange notification on RaftClient.Listener in order to push controller changes directly to BrokerToControllerChannelManager. This can give us a guarantee that AlterPartition can only be sent to controllers which have an epoch which is at least as large as any partition change that the request is based off of.

jsancio · 2022-08-11T17:21:39Z

but I believe the problem exists for kraft as well.

Are you planning to use the same Jira (KAFKA-14154) to track this? I assume that the KRaft fix is also a blocker for 3.3.0.

jsancio

Sounds reasonable to assume that it is not the controller if the leader and partition epoch are larger. Since these are async calls it is always correct to return NOT_CONTROLLER and have the client/broker discover the controller and retry.

dajac

LGTM. Left one nit for consideration.

core/src/main/scala/kafka/controller/KafkaController.scala

hachikuji · 2022-08-11T18:01:07Z

@jsancio Yeah, let's use the same jira for the kraft issue if that's ok with you. Let me see if I can get a PR out today.

…head of controller (#12506) It is possible for the leader to send an `AlterPartition` request to a zombie controller which includes either a partition or leader epoch which is larger than what is found in the controller context. Prior to #12032, the controller handled this in the following way: 1. If the `LeaderAndIsr` state exactly matches the current state on the controller excluding the partition epoch, then the `AlterPartition` request is considered successful and no error is returned. The risk with this handling is that this may cause the leader to incorrectly assume that the state had been successfully updated. Since the controller's state is stale, there is no way to know what the latest ISR state is. 2. Otherwise, the controller will attempt to update the state in zookeeper with the leader/partition epochs from the `AlterPartition` request. This operation would fail if the controller's epoch was not still current in Zookeeper and the result would be a `NOT_CONTROLLER` error. Following #12032, the controller's validation is stricter. If the partition epoch is larger than expected, then the controller will return `INVALID_UPDATE_VERSION` without attempting the operation. Similarly, if the leader epoch is larger than expected, the controller will return `FENCED_LEADER_EPOCH`. The problem with this new handling is that the leader treats the errors from the controller as authoritative. For example, if it sees the `FENCED_LEADER_EPOCH` error, then it will not retry the request and will simply wait until the next leader epoch arrives. The ISR state gets suck in a pending state, which can lead to persistent URPs until the leader epoch gets bumped. In this patch, we want to fix the issues with this handling, but we don't want to restore the buggy idempotent check. The approach is straightforward. If the controller sees a partition/leader epoch which is larger than what it has in the controller context, then it assumes that has become a zombie and returns `NOT_CONTROLLER` to the leader. This will cause the leader to attempt to reset the controller from its local metadata cache and retry the `AlterPartition` request. Reviewers: David Jacot <djacot@confluent.io>, José Armando García Sancio <jsancio@users.noreply.github.com>

* apache-github/trunk: (447 commits) KAFKA-13959: Controller should unfence Broker with busy metadata log (apache#12274) KAFKA-10199: Expose read only task from state updater (apache#12497) KAFKA-14154; Return NOT_CONTROLLER from AlterPartition if leader is ahead of controller (apache#12506) KAFKA-13986; Brokers should include node.id in fetches to metadata quorum (apache#12498) KAFKA-14163; Retry compilation after zinc compile cache error (apache#12507) Remove duplicate common.message.* from clients:test jar file (apache#12407) KAFKA-13060: Replace EasyMock and PowerMock with Mockito in WorkerGroupMemberTest.java (apache#12484) Fix the rate window size calculation for edge cases (apache#12184) MINOR: Upgrade gradle to 7.5.1 and bump other build/test dependencies (apache#12495) KAFKA-14140: Ensure an offline or in-controlled-shutdown replica is not eligible to join ISR in ZK mode (apache#12487) KAFKA-14114: Add Metadata Error Related Metrics MINOR: BrokerMetadataSnapshotter must avoid exceeding batch size (apache#12486) MINOR: Upgrade mockito test dependencies (apache#12460) KAFKA-14144:; Compare AlterPartition LeaderAndIsr before fencing partition epoch (apache#12489) KAFKA-14134: Replace EasyMock with Mockito for WorkerConnectorTest (apache#12472) MINOR: Update scala version in bin scripts to 2.13.8 (apache#12477) KAFKA-14104; Add CRC validation when iterating over Metadata Log Records (apache#12457) MINOR: add :server-common test dependency to :storage (apache#12488) KAFKA-14107: Upgrade Jetty version for CVE fixes (apache#12440) KAFKA-14124: improve quorum controller fault handling (apache#12447) ...

… epoch is ahead (#12514) Similar to #12506. For the Kraft controller, we should return NOT_CONTROLLER if the leader/partition epoch in the request is ahead of the controller. Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>

KAFKA-14154; Return NOT_CONTROLLER from AlterPartition if leader is a…

7497c13

…head of controller

jsancio approved these changes Aug 11, 2022

View reviewed changes

jsancio added core Kafka Broker 3.3 labels Aug 11, 2022

dajac approved these changes Aug 11, 2022

View reviewed changes

core/src/main/scala/kafka/controller/KafkaController.scala Outdated Show resolved Hide resolved

Fix alignment of new check

5b2dadc

hachikuji merged commit 520f729 into apache:trunk Aug 11, 2022

hachikuji mentioned this pull request Aug 15, 2022

KAFKA-14154; Kraft controller should return NOT_CONTROLLER if request epochs are ahead #12514

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-14154; Return NOT_CONTROLLER from AlterPartition if leader is ahead of controller #12506

KAFKA-14154; Return NOT_CONTROLLER from AlterPartition if leader is ahead of controller #12506

hachikuji commented Aug 11, 2022

hachikuji commented Aug 11, 2022

hachikuji commented Aug 11, 2022

jsancio commented Aug 11, 2022 •

edited

Loading

jsancio left a comment

dajac left a comment

hachikuji commented Aug 11, 2022

KAFKA-14154; Return NOT_CONTROLLER from AlterPartition if leader is ahead of controller #12506

KAFKA-14154; Return NOT_CONTROLLER from AlterPartition if leader is ahead of controller #12506

Conversation

hachikuji commented Aug 11, 2022

Committer Checklist (excluded from commit message)

hachikuji commented Aug 11, 2022

hachikuji commented Aug 11, 2022

jsancio commented Aug 11, 2022 • edited Loading

jsancio left a comment

Choose a reason for hiding this comment

dajac left a comment

Choose a reason for hiding this comment

hachikuji commented Aug 11, 2022

jsancio commented Aug 11, 2022 •

edited

Loading