Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-14154; Return NOT_CONTROLLER from AlterPartition if leader is ahead of controller #12506

Merged
merged 2 commits into from
Aug 11, 2022

Conversation

hachikuji
Copy link
Contributor

It is possible for the leader to send an AlterPartition request to a zombie controller which includes either a partition or leader epoch which is larger than what is found in the controller context. Prior to #12032, the controller handled this in the following way:

  1. If the LeaderAndIsr state exactly matches the current state on the controller excluding the partition epoch, then the AlterPartition request is considered successful and no error is returned. The risk with this handling is that this may cause the leader to incorrectly assume that the state had been successfully updated. Since the controller's state is stale, there is no way to know what the latest ISR state is.
  2. Otherwise, the controller will attempt to update the state in zookeeper with the leader/partition epochs from the AlterPartition request. This operation would fail if the controller's epoch was not still current in Zookeeper and the result would be a NOT_CONTROLLER error.

Following #12032, the controller's validation is stricter. If the partition epoch is larger than expected, then the controller will return INVALID_UPDATE_VERSION without attempting the operation. Similarly, if the leader epoch is larger than expected, the controller will return FENCED_LEADER_EPOCH. The problem with this new handling is that the leader treats the errors from the controller as authoritative. For example, if it sees the FENCED_LEADER_EPOCH error, then it will not retry the request and will simply wait until the next leader epoch arrives. The ISR state gets suck in a pending state, which can lead to persistent URPs until the leader epoch gets bumped.

In this patch, we want to fix the issues with this handling, but we don't want to restore the buggy idempotent check. The approach is straightforward. If the controller sees a partition/leader epoch which is larger than what it has in the controller context, then it assumes that has become a zombie and returns NOT_CONTROLLER to the leader. This will cause the leader to attempt to reset the controller from its local metadata cache and retry the AlterPartition request.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@hachikuji
Copy link
Contributor Author

cc @jsancio @dajac

@hachikuji
Copy link
Contributor Author

One note: this patch fixes the problem for the zk controller, but I believe the problem exists for kraft as well. I am thinking we can address that separately with a slightly different approach. Basically we can use the handleLeaderChange notification on RaftClient.Listener in order to push controller changes directly to BrokerToControllerChannelManager. This can give us a guarantee that AlterPartition can only be sent to controllers which have an epoch which is at least as large as any partition change that the request is based off of.

@jsancio
Copy link
Member

jsancio commented Aug 11, 2022

but I believe the problem exists for kraft as well.

Are you planning to use the same Jira (KAFKA-14154) to track this? I assume that the KRaft fix is also a blocker for 3.3.0.

Copy link
Member

@jsancio jsancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to assume that it is not the controller if the leader and partition epoch are larger. Since these are async calls it is always correct to return NOT_CONTROLLER and have the client/broker discover the controller and retry.

@jsancio jsancio added core Kafka Broker 3.3 labels Aug 11, 2022
Copy link
Member

@dajac dajac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left one nit for consideration.

@hachikuji
Copy link
Contributor Author

@jsancio Yeah, let's use the same jira for the kraft issue if that's ok with you. Let me see if I can get a PR out today.

@hachikuji hachikuji merged commit 520f729 into apache:trunk Aug 11, 2022
hachikuji added a commit that referenced this pull request Aug 11, 2022
…head of controller (#12506)

It is possible for the leader to send an `AlterPartition` request to a zombie controller which includes either a partition or leader epoch which is larger than what is found in the controller context. Prior to #12032, the controller handled this in the following way:

1. If the `LeaderAndIsr` state exactly matches the current state on the controller excluding the partition epoch, then the `AlterPartition` request is considered successful and no error is returned. The risk with this handling is that this may cause the leader to incorrectly assume that the state had been successfully updated. Since the controller's state is stale, there is no way to know what the latest ISR state is.
2. Otherwise, the controller will attempt to update the state in zookeeper with the leader/partition epochs from the `AlterPartition` request. This operation would fail if the controller's epoch was not still current in Zookeeper and the result would be a `NOT_CONTROLLER` error.

Following #12032, the controller's validation is stricter. If the partition epoch is larger than expected, then the controller will return `INVALID_UPDATE_VERSION` without attempting the operation. Similarly, if the leader epoch is larger than expected, the controller will return `FENCED_LEADER_EPOCH`. The problem with this new handling is that the leader treats the errors from the controller as authoritative. For example, if it sees the `FENCED_LEADER_EPOCH` error, then it will not retry the request and will simply wait until the next leader epoch arrives. The ISR state gets suck in a pending state, which can lead to persistent URPs until the leader epoch gets bumped.

In this patch, we want to fix the issues with this handling, but we don't want to restore the buggy idempotent check. The approach is straightforward. If the controller sees a partition/leader epoch which is larger than what it has in the controller context, then it assumes that has become a zombie and returns `NOT_CONTROLLER` to the leader. This will cause the leader to attempt to reset the controller from its local metadata cache and retry the `AlterPartition` request.

Reviewers: David Jacot <djacot@confluent.io>, José Armando García Sancio <jsancio@users.noreply.github.com>
ijuma added a commit to franz1981/kafka that referenced this pull request Aug 12, 2022
* apache-github/trunk: (447 commits)
  KAFKA-13959: Controller should unfence Broker with busy metadata log (apache#12274)
  KAFKA-10199: Expose read only task from state updater (apache#12497)
  KAFKA-14154; Return NOT_CONTROLLER from AlterPartition if leader is ahead of controller (apache#12506)
  KAFKA-13986; Brokers should include node.id in fetches to metadata quorum (apache#12498)
  KAFKA-14163; Retry compilation after zinc compile cache error (apache#12507)
  Remove duplicate common.message.* from clients:test jar file (apache#12407)
  KAFKA-13060: Replace EasyMock and PowerMock with Mockito in WorkerGroupMemberTest.java (apache#12484)
  Fix the rate window size calculation for edge cases (apache#12184)
  MINOR: Upgrade gradle to 7.5.1 and bump other build/test dependencies (apache#12495)
  KAFKA-14140: Ensure an offline or in-controlled-shutdown replica is not eligible to join ISR in ZK mode (apache#12487)
  KAFKA-14114: Add Metadata Error Related Metrics
  MINOR: BrokerMetadataSnapshotter must avoid exceeding batch size (apache#12486)
  MINOR: Upgrade mockito test dependencies (apache#12460)
  KAFKA-14144:; Compare AlterPartition LeaderAndIsr before fencing partition epoch (apache#12489)
  KAFKA-14134: Replace EasyMock with Mockito for WorkerConnectorTest (apache#12472)
  MINOR: Update scala version in bin scripts to 2.13.8 (apache#12477)
  KAFKA-14104; Add CRC validation when iterating over Metadata Log Records (apache#12457)
  MINOR: add :server-common test dependency to :storage (apache#12488)
  KAFKA-14107: Upgrade Jetty version for CVE fixes (apache#12440)
  KAFKA-14124: improve quorum controller fault handling (apache#12447)
  ...
hachikuji added a commit that referenced this pull request Aug 15, 2022
… epoch is ahead (#12514)

Similar to #12506. For the Kraft controller, we should return NOT_CONTROLLER if the leader/partition epoch in the request is ahead of the controller. 

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
hachikuji added a commit that referenced this pull request Aug 15, 2022
… epoch is ahead (#12514)

Similar to #12506. For the Kraft controller, we should return NOT_CONTROLLER if the leader/partition epoch in the request is ahead of the controller. 

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Kafka Broker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants