KAFKA-13916; Fenced replicas should not be allowed to join the ISR in KRaft (KIP-841, Part 2) by dajac · Pull Request #12181 · apache/kafka

dajac · 2022-05-19T13:57:12Z

This PR implements KIP-841. Specifically, it implements the following:

It introduces INELIGIBLE_REPLICA and NEW_LEADER_ELECTED error codes.
The KRaft controller validates the new ISR provided in the AlterPartition request and rejects the call if any replica in the new ISR is not eligible to join the the ISR - e.g. when fenced or shutting down. The leader reverts to the last committed ISR when its request is rejected due to this.
The partition leader also verifies that a replica is not fenced before trying to add it back to the ISR. If it is not eligible, the ISR expansion is not triggered at all.
Updates the AlterPartition API to use topic ids. Updates the AlterPartition manger to handle topic names/ids. Updates the ZK controller and the KRaft controller to handle topic names/ids depending on the version of the request used.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

core/src/main/scala/kafka/cluster/Partition.scala

metadata/src/main/java/org/apache/kafka/controller/BrokerHeartbeatManager.java

… KRaft (KIP-841, Part 2)

metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java

clients/src/main/java/org/apache/kafka/common/errors/IneligibleReplica.java

core/src/main/scala/kafka/cluster/Partition.scala

metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java

hachikuji · 2022-06-07T21:47:22Z

core/src/main/scala/kafka/controller/KafkaController.scala

-    eventManager.put(
-      AlterPartitionReceived(alterPartitionRequest.brokerId, alterPartitionRequest.brokerEpoch, partitionsToAlter, responseCallback)
-    )
+  def alterPartitions(


Just checking my understanding. It looks like we have not modified this logic to use INELIGIBLE_REPLICA. Is that right? Should we?

That's right. Do you think we need it?

I suppose that we could have a similar race condition, especially if the shutting down replica is not in the ISR at the time of shutting down. In this case, we don't bump the leader epoch so it could make it back into the ISR before receiving the stop replica request. We could prevent shutting down replicas to join the ISR. One issue is that the leaders will never learn about this state so they don't have a way to prevent unnecessary retries. This is a similar discussion that we had for KRaft.

Given that we explicitly stop replicas, I tend to believe that this race condition is less likely in ZK mode. I wonder if it is worth fixing it. What do you think?

This is somewhat related to this issue: #12271. I guess once we fix this, then relying on StopReplica and the leader epoch bump may be good enough.

core/src/main/scala/kafka/server/AlterPartitionManager.scala

metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java

jsancio

Thanks for the changes @dajac . Partial review. I need to look at this again tomorrow.

core/src/main/scala/kafka/server/AlterPartitionManager.scala

clients/src/main/resources/common/message/AlterPartitionRequest.json

core/src/main/scala/kafka/server/AlterPartitionManager.scala

metadata/src/main/java/org/apache/kafka/controller/ControllerRequestContext.java

metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java

core/src/main/scala/kafka/cluster/Partition.scala

clients/src/main/java/org/apache/kafka/common/requests/AlterPartitionRequest.java

core/src/main/scala/kafka/cluster/Partition.scala

core/src/test/scala/unit/kafka/cluster/PartitionTest.scala

core/src/test/scala/unit/kafka/controller/ControllerIntegrationTest.scala

metadata/src/test/java/org/apache/kafka/controller/ReplicationControlManagerTest.java

…s not on IBP >= 2.8

dajac · 2022-06-09T15:01:18Z

core/src/main/scala/kafka/server/ApiVersionManager.scala

+    // In ZK mode if the deployed software of the controller uses version 2.8 or above
+    // but the IBP is below 2.8, the controller does not assign topic ids. In this case,
+    // it should not advertise the AlterPartition API version 2 and above.
+    val alterPartitionApiVersion = response.apiVersion(ApiKeys.ALTER_PARTITION.id)
+    if (alterPartitionApiVersion != null) {
+      alterPartitionApiVersion.setMaxVersion(
+        if (metadataVersion.isTopicIdsSupported)
+          alterPartitionApiVersion.maxVersion()
+        else
+          1.toShort
+      )
+    }


@hachikuji There is something that I missed in this PR. If the controller run 2.8 software or above but does not use IBP 2.8 or above yet, topic ids are not assigned. In other words, it is not safe to use the AlterPartition version 2 in this case even if the controller supports it. We don't really have a mechanism to do this at the moment so I have put the logic to do this here. We should definitely think about a better approach. What do you think?

I have filed a JIRA for tracking this: https://issues.apache.org/jira/browse/KAFKA-13975. It seems to me that it would be better to do it separately.

I don't follow. In this true for all RPC dealing with topic ids? The sender has IBP 2.8 but the receiver doesn't support IBP 2.8. I would think that in general the RPC receiver needs to allow for RPC from IBP versions that are greater than the local IBP.

I recall there were some tricky cases when doing the upgrade to use TopicIds. It is possible for the controller to initialize on one of the nodes with the updated IBP and create TopicIds for all topics, but then change to a new node with a lower IBP. How does the controller handle the existence of TopicIds in the zk metadata if the IBP is below 2.8? At a quick glance, it looks like it would still parse it and load it into the ControllerContext. It seems ok in this scenario for the controller to accept AlterPartition with TopicIds even if the local IBP is lower. This is how we usually deal with IBP upgrades. Are there any other scenarios we need to worry about? Perhaps upgrade_test.py is sufficient to test this scenario?

The other scenario that I was considering is the following:

all brokers run software 2.8 or above

controller upgrades to IBP 2.8 or above

controller fails over to a node still on an IBP < 2.8 - topics with ID keep them here as you pointed out.

new topics are created - those won't have topic ids

Is it safe? It seems to be OK because those new topics won't have a topic id so the AlterPartitionManager will downgrade to using version 1 in this case.

OK. I have convinced myself that we don't need this check after all. The AlterPartitionManager's logic to downgrade is sufficient to handle both cases. Thanks for the clarification.

core/src/main/scala/kafka/cluster/Partition.scala

core/src/main/scala/kafka/controller/KafkaController.scala

core/src/main/scala/kafka/server/AlterPartitionManager.scala

core/src/main/scala/kafka/cluster/Partition.scala

core/src/main/scala/kafka/controller/KafkaController.scala

…roller is not on IBP >= 2.8" This reverts commit 6addd52.

artemlivshits

I'm wondering if the topic id change could be done separately? It has a lot of mechanical changes.

artemlivshits · 2022-06-13T21:22:25Z

core/src/main/scala/kafka/cluster/Partition.scala

+    metadataCache match {
+      // In KRaft mode, only replicas which are not fenced nor in controlled shutdown are
+      // allowed to join the ISR. This does not apply to ZK mode.
+      case kRaftMetadataCache: KRaftMetadataCache =>
+        !kRaftMetadataCache.isBrokerFenced(followerReplicaId) &&
+          !kRaftMetadataCache.isBrokerShuttingDown(followerReplicaId)
+
+      case _ => true
+    }


Would it be better to encapsulate the KRaft / ZK behavior difference in the metadataCache? Then this function would just call the metadataCache without explicit checking the kind of the cache.

This is what I did originally but we moved to this solution based on reviewers' feedback. The rational of doing it here is that ZK does not have such information so reviewers felt like having methods, in the ZK metadata cache, which are not implemented could be misleading. Personally, I am fine either ways.

core/src/main/scala/kafka/cluster/Partition.scala

artemlivshits · 2022-06-14T00:42:26Z

core/src/main/scala/kafka/cluster/Partition.scala

+        // and 2) that the request was not applied. Even if the controller that sent the response
+        // is stale, we are guaranteed from the monotonicity of the controller epoch that the
+        // request could not have been applied by any past or future controller.
+        partitionState = proposedIsrState.lastCommittedState


In KRaft mode, could the state be updated via metadata and applied concurrently such that processing this would override a concurrently updated last state?

We rollback the previous partition state here only if the the partition state still matches our proposed partition state. If it does not, it means that the partition was updated via the metadata log in the mean time. This check is in submitAlterPartition before calling handleAlterPartitionError.

Sounds good. This invariant isn't immediately visible from this code so maybe a comment and / or assert would make it more clear.

hachikuji

Thanks for the patch. Left some small comments, but LGTM overall. Will leave you to merge once addressed.

core/src/main/scala/kafka/controller/KafkaController.scala

core/src/main/scala/kafka/server/AlterPartitionManager.scala

core/src/test/scala/unit/kafka/controller/ControllerIntegrationTest.scala

metadata/src/test/java/org/apache/kafka/controller/ControllerRequestContextTest.java

dajac · 2022-06-14T07:37:15Z

I'm wondering if the topic id change could be done separately? It has a lot of mechanical changes.

It is too late for doing this but I do agree that it was a lot of changes in this PR. It is always tricky when multiple changes are tight to bumping the version of an API. We usually prefer to do them together in order to avoid having partial updates in trunk for an API update.

dajac · 2022-06-14T11:08:35Z

Failed test is not related:

Build / JDK 11 and Scala 2.13 / testReplication() – org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest

Merging to trunk.

dajac · 2022-06-14T15:32:20Z

@artemlivshits I merged the PR. If you have any followups, I will address them separately.

dajac mentioned this pull request May 20, 2022

KAFKA-13858; Kraft should not shutdown metadata listener until controller shutdown is finished #12187

Merged

3 tasks

hachikuji reviewed May 24, 2022

View reviewed changes

dajac mentioned this pull request Jun 2, 2022

KAFKA-13916; Fenced replicas should not be allowed to join the ISR in KRaft (KIP-841, Part 1) #12240

Merged

3 tasks

dajac changed the title ~~KAFKA-13916; Fenced replicas should not be allowed to join the ISR in KRaft (KIP-841)~~ KAFKA-13916; Fenced replicas should not be allowed to join the ISR in KRaft (KIP-841, Part 2) Jun 2, 2022

dajac force-pushed the KAFKA-13916 branch from d7aff89 to acdf015 Compare June 3, 2022 12:31

dajac marked this pull request as ready for review June 3, 2022 14:19

dajac added 7 commits June 7, 2022 20:58

KAFKA-13916; Fenced replicas should not be allowed to join the ISR in…

b6cb295

… KRaft (KIP-841, Part 2)

add unit test

e04cb33

refactor

7a6e3cd

refactor

384681a

refactor

4d0621f

introduce NEW_LEADER_ELECTED error

c024f96

cleanup

827512b

dajac force-pushed the KAFKA-13916 branch from c9f7fdc to 827512b Compare June 7, 2022 18:59

hachikuji reviewed Jun 7, 2022

View reviewed changes

address comments

66a0df5

jsancio reviewed Jun 9, 2022

View reviewed changes

hachikuji reviewed Jun 9, 2022

View reviewed changes

dajac added 2 commits June 9, 2022 14:09

address reviews

3b869aa

ensure that AlterPartition v2 is not advertised when the controller i…

6addd52

…s not on IBP >= 2.8

dajac commented Jun 9, 2022

View reviewed changes

jsancio reviewed Jun 10, 2022

View reviewed changes

core/src/main/scala/kafka/cluster/Partition.scala Outdated Show resolved Hide resolved

core/src/main/scala/kafka/controller/KafkaController.scala Outdated Show resolved Hide resolved

core/src/main/scala/kafka/server/AlterPartitionManager.scala Show resolved Hide resolved

hachikuji reviewed Jun 10, 2022

View reviewed changes

core/src/main/scala/kafka/cluster/Partition.scala Outdated Show resolved Hide resolved

hachikuji reviewed Jun 10, 2022

View reviewed changes

core/src/main/scala/kafka/controller/KafkaController.scala Show resolved Hide resolved

dajac added 4 commits June 10, 2022 09:15

Revert "ensure that AlterPartition v2 is not advertised when the cont…

2743a96

…roller is not on IBP >= 2.8" This reverts commit 6addd52.

address reviews and refactor

e567abe

address reviews

ef38b4f

doc

a717d4a

artemlivshits reviewed Jun 14, 2022

View reviewed changes

hachikuji approved these changes Jun 14, 2022

View reviewed changes

address minor comments

b15a6cd

dajac merged commit f83d95d into apache:trunk Jun 14, 2022

dajac deleted the KAFKA-13916 branch June 14, 2022 11:12

Conversation

dajac commented May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsancio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hachikuji Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

artemlivshits left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dajac commented May 19, 2022 •

edited

Loading

hachikuji Jun 10, 2022 •

edited

Loading