Skip to content

KAFKA-14296; Partition leaders are not demoted during kraft controlled shutdown#12741

Merged
hachikuji merged 1 commit intoapache:trunkfrom
dajac:KAFKA-14296
Oct 13, 2022
Merged

KAFKA-14296; Partition leaders are not demoted during kraft controlled shutdown#12741
hachikuji merged 1 commit intoapache:trunkfrom
dajac:KAFKA-14296

Conversation

@dajac
Copy link
Member

@dajac dajac commented Oct 13, 2022

When the BrokerServer starts its shutting down process, it transitions to SHUTTING_DOWN and sets isShuttingDown to true. With this state change, the follower state changes are short-cutted. This means that a broker which was serving as leader would remain acting as a leader until controlled shutdown completes. Instead, we want the leader and ISR state to be updated so that requests will return NOT_LEADER and the client can find the new leader.

We missed this case while implementing #12187.

This patch fixes the issue and updates an existing test to ensure that isShuttingDown has not effect. We should consider adding integration tests for this as well. We can do this separately.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

): Unit = {
stateChangeLogger.info(s"Transitioning ${localFollowers.size} partition(s) to " +
"local followers.")
val shuttingDown = isShuttingDown.get()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is a bit brittle, can we have something like the following as the field instead?

private Supplier brokerState = null;

Then we don't have to make sure to update isShuttingDown when the logic changes in KafkaServer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that you are saying that relying on isShuttingDown is brittle, right? If that is the case, I do agree that relying on the broker state is better. I think that we can refactor the remaining usages as a follow-up.

val state = info.partition.toLeaderAndIsrPartitionState(tp, isNew)
val isNewLeaderEpoch = partition.makeFollower(state, offsetCheckpoints, Some(info.topicId))

if (isInControlledShutdown && (info.partition.leader == NO_LEADER ||
Copy link
Member

@ijuma ijuma Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, why do we need this special isInControlledShutdown field versus relying on the broker state as I mentioned above? This kind of approach is extremely brittle in my opinion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to do this only when we are in controlled shutdown and it could be disabled. If you look at the caller side, we aligned this on how we do it for the lifecycleManager as both go together. We could revise this though.

mockReplicaFetcherManager: Option[ReplicaFetcherManager] = None,
mockReplicaAlterLogDirsManager: Option[ReplicaAlterLogDirsManager] = None
mockReplicaAlterLogDirsManager: Option[ReplicaAlterLogDirsManager] = None,
isShuttingDown: AtomicBoolean = new AtomicBoolean(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the use of the field a bit confusing. We don't necessarily have to fix it here, but what do you think about letting ReplicaManager own its own shutdown? Basically create a private field which serves the same purpose and then expose a method to explicitly toggle it. It is not obvious when looking at BrokerServer that altering the local field has this remote effect.

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

Copy link
Member

@jsancio jsancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the changes.

@dajac
Copy link
Member Author

dajac commented Oct 13, 2022

Will merge it tomorrow to trunk and 3.3. I am waiting on getting a clean build.

@hachikuji hachikuji merged commit 5cff8f6 into apache:trunk Oct 13, 2022
hachikuji pushed a commit that referenced this pull request Oct 14, 2022
…d shutdown (#12741)

When the `BrokerServer` starts its shutting down process, it transitions to `SHUTTING_DOWN` and sets `isShuttingDown` to `true`. With this state change, the follower state changes are short-cutted. This means that a broker which was serving as leader would remain acting as a leader until controlled shutdown completes. Instead, we want the leader and ISR state to be updated so that requests will return NOT_LEADER and the client can find the new leader.

We missed this case while implementing #12187.

This patch fixes the issue and updates an existing test to ensure that `isShuttingDown` has not effect. We should consider adding integration tests for this as well. We can do this separately.

Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
@dajac dajac deleted the KAFKA-14296 branch October 14, 2022 07:15
guozhangwang pushed a commit to guozhangwang/kafka that referenced this pull request Jan 25, 2023
…d shutdown (apache#12741)

When the `BrokerServer` starts its shutting down process, it transitions to `SHUTTING_DOWN` and sets `isShuttingDown` to `true`. With this state change, the follower state changes are short-cutted. This means that a broker which was serving as leader would remain acting as a leader until controlled shutdown completes. Instead, we want the leader and ISR state to be updated so that requests will return NOT_LEADER and the client can find the new leader.

We missed this case while implementing apache#12187.

This patch fixes the issue and updates an existing test to ensure that `isShuttingDown` has not effect. We should consider adding integration tests for this as well. We can do this separately.

Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
rutvijmehta-harness pushed a commit to rutvijmehta-harness/kafka that referenced this pull request Feb 9, 2024
…d shutdown (apache#12741)

When the `BrokerServer` starts its shutting down process, it transitions to `SHUTTING_DOWN` and sets `isShuttingDown` to `true`. With this state change, the follower state changes are short-cutted. This means that a broker which was serving as leader would remain acting as a leader until controlled shutdown completes. Instead, we want the leader and ISR state to be updated so that requests will return NOT_LEADER and the client can find the new leader.

We missed this case while implementing apache#12187.

This patch fixes the issue and updates an existing test to ensure that `isShuttingDown` has not effect. We should consider adding integration tests for this as well. We can do this separately.

Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
rutvijmehta-harness added a commit to rutvijmehta-harness/kafka that referenced this pull request Feb 9, 2024
…d shutdown (apache#12741) (#59)

When the `BrokerServer` starts its shutting down process, it transitions to `SHUTTING_DOWN` and sets `isShuttingDown` to `true`. With this state change, the follower state changes are short-cutted. This means that a broker which was serving as leader would remain acting as a leader until controlled shutdown completes. Instead, we want the leader and ISR state to be updated so that requests will return NOT_LEADER and the client can find the new leader.

We missed this case while implementing apache#12187.

This patch fixes the issue and updates an existing test to ensure that `isShuttingDown` has not effect. We should consider adding integration tests for this as well. We can do this separately.

Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>

Co-authored-by: David Jacot <djacot@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments