-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-14154; Return NOT_CONTROLLER from AlterPartition if leader is ahead of controller #12506
Conversation
…head of controller
One note: this patch fixes the problem for the zk controller, but I believe the problem exists for kraft as well. I am thinking we can address that separately with a slightly different approach. Basically we can use the |
Are you planning to use the same Jira (KAFKA-14154) to track this? I assume that the KRaft fix is also a blocker for 3.3.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to assume that it is not the controller if the leader and partition epoch are larger. Since these are async calls it is always correct to return NOT_CONTROLLER and have the client/broker discover the controller and retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left one nit for consideration.
@jsancio Yeah, let's use the same jira for the kraft issue if that's ok with you. Let me see if I can get a PR out today. |
…head of controller (#12506) It is possible for the leader to send an `AlterPartition` request to a zombie controller which includes either a partition or leader epoch which is larger than what is found in the controller context. Prior to #12032, the controller handled this in the following way: 1. If the `LeaderAndIsr` state exactly matches the current state on the controller excluding the partition epoch, then the `AlterPartition` request is considered successful and no error is returned. The risk with this handling is that this may cause the leader to incorrectly assume that the state had been successfully updated. Since the controller's state is stale, there is no way to know what the latest ISR state is. 2. Otherwise, the controller will attempt to update the state in zookeeper with the leader/partition epochs from the `AlterPartition` request. This operation would fail if the controller's epoch was not still current in Zookeeper and the result would be a `NOT_CONTROLLER` error. Following #12032, the controller's validation is stricter. If the partition epoch is larger than expected, then the controller will return `INVALID_UPDATE_VERSION` without attempting the operation. Similarly, if the leader epoch is larger than expected, the controller will return `FENCED_LEADER_EPOCH`. The problem with this new handling is that the leader treats the errors from the controller as authoritative. For example, if it sees the `FENCED_LEADER_EPOCH` error, then it will not retry the request and will simply wait until the next leader epoch arrives. The ISR state gets suck in a pending state, which can lead to persistent URPs until the leader epoch gets bumped. In this patch, we want to fix the issues with this handling, but we don't want to restore the buggy idempotent check. The approach is straightforward. If the controller sees a partition/leader epoch which is larger than what it has in the controller context, then it assumes that has become a zombie and returns `NOT_CONTROLLER` to the leader. This will cause the leader to attempt to reset the controller from its local metadata cache and retry the `AlterPartition` request. Reviewers: David Jacot <djacot@confluent.io>, José Armando García Sancio <jsancio@users.noreply.github.com>
* apache-github/trunk: (447 commits) KAFKA-13959: Controller should unfence Broker with busy metadata log (apache#12274) KAFKA-10199: Expose read only task from state updater (apache#12497) KAFKA-14154; Return NOT_CONTROLLER from AlterPartition if leader is ahead of controller (apache#12506) KAFKA-13986; Brokers should include node.id in fetches to metadata quorum (apache#12498) KAFKA-14163; Retry compilation after zinc compile cache error (apache#12507) Remove duplicate common.message.* from clients:test jar file (apache#12407) KAFKA-13060: Replace EasyMock and PowerMock with Mockito in WorkerGroupMemberTest.java (apache#12484) Fix the rate window size calculation for edge cases (apache#12184) MINOR: Upgrade gradle to 7.5.1 and bump other build/test dependencies (apache#12495) KAFKA-14140: Ensure an offline or in-controlled-shutdown replica is not eligible to join ISR in ZK mode (apache#12487) KAFKA-14114: Add Metadata Error Related Metrics MINOR: BrokerMetadataSnapshotter must avoid exceeding batch size (apache#12486) MINOR: Upgrade mockito test dependencies (apache#12460) KAFKA-14144:; Compare AlterPartition LeaderAndIsr before fencing partition epoch (apache#12489) KAFKA-14134: Replace EasyMock with Mockito for WorkerConnectorTest (apache#12472) MINOR: Update scala version in bin scripts to 2.13.8 (apache#12477) KAFKA-14104; Add CRC validation when iterating over Metadata Log Records (apache#12457) MINOR: add :server-common test dependency to :storage (apache#12488) KAFKA-14107: Upgrade Jetty version for CVE fixes (apache#12440) KAFKA-14124: improve quorum controller fault handling (apache#12447) ...
It is possible for the leader to send an
AlterPartition
request to a zombie controller which includes either a partition or leader epoch which is larger than what is found in the controller context. Prior to #12032, the controller handled this in the following way:LeaderAndIsr
state exactly matches the current state on the controller excluding the partition epoch, then theAlterPartition
request is considered successful and no error is returned. The risk with this handling is that this may cause the leader to incorrectly assume that the state had been successfully updated. Since the controller's state is stale, there is no way to know what the latest ISR state is.AlterPartition
request. This operation would fail if the controller's epoch was not still current in Zookeeper and the result would be aNOT_CONTROLLER
error.Following #12032, the controller's validation is stricter. If the partition epoch is larger than expected, then the controller will return
INVALID_UPDATE_VERSION
without attempting the operation. Similarly, if the leader epoch is larger than expected, the controller will returnFENCED_LEADER_EPOCH
. The problem with this new handling is that the leader treats the errors from the controller as authoritative. For example, if it sees theFENCED_LEADER_EPOCH
error, then it will not retry the request and will simply wait until the next leader epoch arrives. The ISR state gets suck in a pending state, which can lead to persistent URPs until the leader epoch gets bumped.In this patch, we want to fix the issues with this handling, but we don't want to restore the buggy idempotent check. The approach is straightforward. If the controller sees a partition/leader epoch which is larger than what it has in the controller context, then it assumes that has become a zombie and returns
NOT_CONTROLLER
to the leader. This will cause the leader to attempt to reset the controller from its local metadata cache and retry theAlterPartition
request.Committer Checklist (excluded from commit message)