Skip to content

KAFKA-13916; Fenced replicas should not be allowed to join the ISR in KRaft (KIP-841, Part 2)#12181

Merged
dajac merged 15 commits intoapache:trunkfrom
dajac:KAFKA-13916
Jun 14, 2022
Merged

KAFKA-13916; Fenced replicas should not be allowed to join the ISR in KRaft (KIP-841, Part 2)#12181
dajac merged 15 commits intoapache:trunkfrom
dajac:KAFKA-13916

Conversation

@dajac
Copy link
Member

@dajac dajac commented May 19, 2022

This PR implements KIP-841. Specifically, it implements the following:

  • It introduces INELIGIBLE_REPLICA and NEW_LEADER_ELECTED error codes.
  • The KRaft controller validates the new ISR provided in the AlterPartition request and rejects the call if any replica in the new ISR is not eligible to join the the ISR - e.g. when fenced or shutting down. The leader reverts to the last committed ISR when its request is rejected due to this.
  • The partition leader also verifies that a replica is not fenced before trying to add it back to the ISR. If it is not eligible, the ISR expansion is not triggered at all.
  • Updates the AlterPartition API to use topic ids. Updates the AlterPartition manger to handle topic names/ids. Updates the ZK controller and the KRaft controller to handle topic names/ids depending on the version of the request used.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@dajac dajac changed the title KAFKA-13916; Fenced replicas should not be allowed to join the ISR in KRaft (KIP-841) KAFKA-13916; Fenced replicas should not be allowed to join the ISR in KRaft (KIP-841, Part 2) Jun 2, 2022
@dajac dajac marked this pull request as ready for review June 3, 2022 14:19
eventManager.put(
AlterPartitionReceived(alterPartitionRequest.brokerId, alterPartitionRequest.brokerEpoch, partitionsToAlter, responseCallback)
)
def alterPartitions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking my understanding. It looks like we have not modified this logic to use INELIGIBLE_REPLICA. Is that right? Should we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Do you think we need it?

I suppose that we could have a similar race condition, especially if the shutting down replica is not in the ISR at the time of shutting down. In this case, we don't bump the leader epoch so it could make it back into the ISR before receiving the stop replica request. We could prevent shutting down replicas to join the ISR. One issue is that the leaders will never learn about this state so they don't have a way to prevent unnecessary retries. This is a similar discussion that we had for KRaft.

Given that we explicitly stop replicas, I tend to believe that this race condition is less likely in ZK mode. I wonder if it is worth fixing it. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat related to this issue: #12271. I guess once we fix this, then relying on StopReplica and the leader epoch bump may be good enough.

Copy link
Member

@jsancio jsancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @dajac . Partial review. I need to look at this again tomorrow.

Comment on lines 91 to 102
// In ZK mode if the deployed software of the controller uses version 2.8 or above
// but the IBP is below 2.8, the controller does not assign topic ids. In this case,
// it should not advertise the AlterPartition API version 2 and above.
val alterPartitionApiVersion = response.apiVersion(ApiKeys.ALTER_PARTITION.id)
if (alterPartitionApiVersion != null) {
alterPartitionApiVersion.setMaxVersion(
if (metadataVersion.isTopicIdsSupported)
alterPartitionApiVersion.maxVersion()
else
1.toShort
)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hachikuji There is something that I missed in this PR. If the controller run 2.8 software or above but does not use IBP 2.8 or above yet, topic ids are not assigned. In other words, it is not safe to use the AlterPartition version 2 in this case even if the controller supports it. We don't really have a mechanism to do this at the moment so I have put the logic to do this here. We should definitely think about a better approach. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have filed a JIRA for tracking this: https://issues.apache.org/jira/browse/KAFKA-13975. It seems to me that it would be better to do it separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow. In this true for all RPC dealing with topic ids? The sender has IBP 2.8 but the receiver doesn't support IBP 2.8. I would think that in general the RPC receiver needs to allow for RPC from IBP versions that are greater than the local IBP.

Copy link
Contributor

@hachikuji hachikuji Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall there were some tricky cases when doing the upgrade to use TopicIds. It is possible for the controller to initialize on one of the nodes with the updated IBP and create TopicIds for all topics, but then change to a new node with a lower IBP. How does the controller handle the existence of TopicIds in the zk metadata if the IBP is below 2.8? At a quick glance, it looks like it would still parse it and load it into the ControllerContext. It seems ok in this scenario for the controller to accept AlterPartition with TopicIds even if the local IBP is lower. This is how we usually deal with IBP upgrades. Are there any other scenarios we need to worry about? Perhaps upgrade_test.py is sufficient to test this scenario?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other scenario that I was considering is the following:

  • all brokers run software 2.8 or above
  • controller upgrades to IBP 2.8 or above
  • controller fails over to a node still on an IBP < 2.8 - topics with ID keep them here as you pointed out.
  • new topics are created - those won't have topic ids

Is it safe? It seems to be OK because those new topics won't have a topic id so the AlterPartitionManager will downgrade to using version 1 in this case.

OK. I have convinced myself that we don't need this check after all. The AlterPartitionManager's logic to downgrade is sufficient to handle both cases. Thanks for the clarification.

Copy link
Contributor

@artemlivshits artemlivshits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the topic id change could be done separately? It has a lot of mechanical changes.

Comment on lines +882 to +890
metadataCache match {
// In KRaft mode, only replicas which are not fenced nor in controlled shutdown are
// allowed to join the ISR. This does not apply to ZK mode.
case kRaftMetadataCache: KRaftMetadataCache =>
!kRaftMetadataCache.isBrokerFenced(followerReplicaId) &&
!kRaftMetadataCache.isBrokerShuttingDown(followerReplicaId)

case _ => true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to encapsulate the KRaft / ZK behavior difference in the metadataCache? Then this function would just call the metadataCache without explicit checking the kind of the cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I did originally but we moved to this solution based on reviewers' feedback. The rational of doing it here is that ZK does not have such information so reviewers felt like having methods, in the ZK metadata cache, which are not implemented could be misleading. Personally, I am fine either ways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

// and 2) that the request was not applied. Even if the controller that sent the response
// is stale, we are guaranteed from the monotonicity of the controller epoch that the
// request could not have been applied by any past or future controller.
partitionState = proposedIsrState.lastCommittedState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In KRaft mode, could the state be updated via metadata and applied concurrently such that processing this would override a concurrently updated last state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We rollback the previous partition state here only if the the partition state still matches our proposed partition state. If it does not, it means that the partition was updated via the metadata log in the mean time. This check is in submitAlterPartition before calling handleAlterPartitionError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. This invariant isn't immediately visible from this code so maybe a comment and / or assert would make it more clear.

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. Left some small comments, but LGTM overall. Will leave you to merge once addressed.

@dajac
Copy link
Member Author

dajac commented Jun 14, 2022

I'm wondering if the topic id change could be done separately? It has a lot of mechanical changes.

It is too late for doing this but I do agree that it was a lot of changes in this PR. It is always tricky when multiple changes are tight to bumping the version of an API. We usually prefer to do them together in order to avoid having partial updates in trunk for an API update.

@dajac
Copy link
Member Author

dajac commented Jun 14, 2022

Failed test is not related:

Build / JDK 11 and Scala 2.13 / testReplication() – org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest

Merging to trunk.

@dajac dajac merged commit f83d95d into apache:trunk Jun 14, 2022
@dajac dajac deleted the KAFKA-13916 branch June 14, 2022 11:12
@dajac
Copy link
Member Author

dajac commented Jun 14, 2022

@artemlivshits I merged the PR. If you have any followups, I will address them separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants