KAFKA-13858; Kraft should not shutdown metadata listener until controller shutdown is finished#12187

Merged

hachikuji merged 5 commits intoapache:trunkfrom

dajac:KAFKA-13858-2

May 25, 2022

Member

dajac commented May 20, 2022 •

edited

Loading

When the kraft broker begins controlled shutdown, it immediately disables the metadata listener. This means that metadata changes as part of the controlled shutdown do not get sent to the respective components. For partitions that the broker is follower of, that is what we want. It prevents the follower from being able to rejoin the ISR while still shutting down. But for partitions that the broker is leading, it means the leader will remain active until controlled shutdown finishes and the socket server is stopped. That delay can be as much as 5 seconds and probably even worse.

This PR revises the controlled shutdown procedure as follow:

The broker signals to the replica manager that it is about to start the controlled shutdown.
The broker requests a controlled shutdown to the controller.
The controller moves leaders off from the broker, removes the broker from any ISR that it is a member of, and writes those changes to the metadata log.
When the broker receives a partition metadata change, it looks if it is in the ISR. If it is, it updates the partition as usual. If it is not, it stops the fetcher/replica. This basically stops all the partitions for which the broker was part of their ISR.

When the broker is a replica of a partition but it is not in the ISR, the controller does not do anything. The leader epoch is not bumped. In this particular case, the follower will continue to run until the replica manager shuts down. In this time, the replica could become in-sync and the leader could try to bring it back to the ISR. We rely on #12181 to ensure that does not happen.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

dajac added 2 commits

May 20, 2022 10:31


          KAFKA-13858; Kraft should not shutdown metadata listener until contro…

920ba4c

…ller shutdown is finished


          cleanups

d18e4a6

hachikuji reviewed

View reviewed changes

core/src/main/scala/kafka/server/ReplicaManager.scala Outdated

                         } else {
-                          val leader = info.partition.leader
-                          if (newImage.cluster.broker(leader) == null) {
+                          if (isInControlledShutdown && !info.partition.isr.contains(config.brokerId)) {

Contributor

hachikuji May 21, 2022

I wonder if we should also count the case when the replica is in the ISR, but no leader is defined. I think this is what we would see if the shutting down replica is the last member of the ISR.

Member Author

dajac May 23, 2022

That makes sense. I forgot about this case.

dajac added 2 commits

May 23, 2022 14:07


          refactor

904690d


          cleanup

512f183

dajac marked this pull request as ready for review

May 23, 2022 12:14

hachikuji reviewed

View reviewed changes

core/src/main/scala/kafka/server/ReplicaManager.scala Outdated Show resolved Hide resolved

hachikuji reviewed

View reviewed changes

core/src/main/scala/kafka/server/ReplicaManager.scala

-                          if (isInControlledShutdown && !info.partition.isr.contains(config.brokerId)) {
-                            // If we are in controlled shutdown and the replica is not in the ISR,
-                            // we stop the replica.
+                          // We always update the follower state.

Contributor

hachikuji May 24, 2022

Good change here. I do like keeping the state up-to-date even if we cannot start fetching. Hard to know when a dependence on that state will emerge.

Member Author

dajac May 24, 2022

Yeah. That seems to be the right thing to do. I have looked at the dependencies on that state and I haven't identified any issues so far.

core/src/main/scala/kafka/cluster/Partition.scala Show resolved Hide resolved

core/src/main/scala/kafka/server/ReplicaManager.scala Outdated

-                  val partitionsToMakeFollower = new mutable.HashMap[TopicPartition, Partition]
+                  val partitionsToStart = new mutable.HashMap[TopicPartition, Partition]
+                  val partitionsToStop = new mutable.HashMap[TopicPartition, Boolean]
                   val newFollowerTopicSet = new mutable.HashSet[String]

Contributor

hachikuji May 24, 2022

Not from this PR, but I find the naming of newLocalFollowers in the parameter list confusing given the presence of the isNew flag. I guess it's really representing new or updated followers.

Member Author

dajac May 24, 2022

That's right. How about localFollowers? newOrUpdatedLocalFollowers seems a bit long to me.

core/src/main/scala/kafka/server/ReplicaManager.scala

+                            partitionsToStart.put(tp, partition)
                           }
                         }
                         changedPartitions.add(partition)

Contributor

hachikuji May 24, 2022

We're maintaining this collection for the start of the log dir fetchers. Similar to the replica fetchers, I guess we only want to update log dir fetchers when there is an epoch bump (leader or follower). Since we're not yet supporting JBOD anyway in KRaft, it seems tempting to get rid of this logic.

Member Author

dajac May 24, 2022

I don't have a strong opinion on this but removing unused logic seems appropriate. That will force us to think it through when we implement it. Do you mind if we do this separately? I would like to keep this PR focused on its primary goal.

Contributor

hachikuji May 25, 2022

Yeah, we can do it separately.

core/src/main/scala/kafka/server/ReplicaManager.scala Outdated

    
                    stateChangeLogger.info(s"Started fetchers as part of become-follower for ${partitionsToStart.size} partitions")

                  partitionsToMakeFollower.keySet.foreach(completeDelayedFetchOrProduceRequests)

                    partitionsToStart.keySet.foreach(completeDelayedFetchOrProduceRequests)

Contributor

hachikuji May 24, 2022

Borderline too verbose perhaps, but how about partitionsToStartFetching?

Member Author

dajac May 24, 2022

That seems like a reasonable ask.


          address minor comments

30507cf

dajac requested a review from hachikuji

May 24, 2022 16:10

Member Author

dajac commented May 24, 2022

@hachikuji Thanks for your review. I have updated the PR to address your comments.

hachikuji reviewed

View reviewed changes

core/src/main/scala/kafka/server/ReplicaManager.scala Show resolved Hide resolved

hachikuji approved these changes

View reviewed changes

Contributor

hachikuji left a comment

Thanks, LGTM.

hachikuji merged commit 76477ff into apache:trunk

dajac deleted the KAFKA-13858-2 branch

May 30, 2022 08:13

dajac mentioned this pull request

KAFKA-14296; Partition leaders are not demoted during kraft controlled shutdown #12741

Merged

3 tasks

hachikuji pushed a commit that referenced this pull request


          KAFKA-14296; Partition leaders are not demoted during kraft controlle…

5cff8f6

…d shutdown (#12741)

When the `BrokerServer` starts its shutting down process, it transitions to `SHUTTING_DOWN` and sets `isShuttingDown` to `true`. With this state change, the follower state changes are short-cutted. This means that a broker which was serving as leader would remain acting as a leader until controlled shutdown completes. Instead, we want the leader and ISR state to be updated so that requests will return NOT_LEADER and the client can find the new leader.

We missed this case while implementing #12187.

This patch fixes the issue and updates an existing test to ensure that `isShuttingDown` has not effect. We should consider adding integration tests for this as well. We can do this separately.

Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>

hachikuji pushed a commit that referenced this pull request


          KAFKA-14296; Partition leaders are not demoted during kraft controlle…

fd64b53

…d shutdown (#12741)

When the `BrokerServer` starts its shutting down process, it transitions to `SHUTTING_DOWN` and sets `isShuttingDown` to `true`. With this state change, the follower state changes are short-cutted. This means that a broker which was serving as leader would remain acting as a leader until controlled shutdown completes. Instead, we want the leader and ISR state to be updated so that requests will return NOT_LEADER and the client can find the new leader.

We missed this case while implementing #12187.

This patch fixes the issue and updates an existing test to ensure that `isShuttingDown` has not effect. We should consider adding integration tests for this as well. We can do this separately.

Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>

guozhangwang pushed a commit to guozhangwang/kafka that referenced this pull request


          KAFKA-14296; Partition leaders are not demoted during kraft controlle…

6750bc5

…d shutdown (apache#12741)

When the `BrokerServer` starts its shutting down process, it transitions to `SHUTTING_DOWN` and sets `isShuttingDown` to `true`. With this state change, the follower state changes are short-cutted. This means that a broker which was serving as leader would remain acting as a leader until controlled shutdown completes. Instead, we want the leader and ISR state to be updated so that requests will return NOT_LEADER and the client can find the new leader.

We missed this case while implementing apache#12187.

This patch fixes the issue and updates an existing test to ensure that `isShuttingDown` has not effect. We should consider adding integration tests for this as well. We can do this separately.

Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>

Beverly21024 commented Feb 20, 2023

Good post. Someone stole $70,000 from me once (crypto). It took me
1 month with the help of Wizesafetyrecovery @ gmail com and a bunch of stress to finally get it back but it was a learning experience. Ever since then I have had to tighten up my online presence. If anyone reading this has a similar issue, I recommend using them to recover your lost funds.

rutvijmehta-harness pushed a commit to rutvijmehta-harness/kafka that referenced this pull request


          KAFKA-14296; Partition leaders are not demoted during kraft controlle…

e4332b2

…d shutdown (apache#12741)

When the `BrokerServer` starts its shutting down process, it transitions to `SHUTTING_DOWN` and sets `isShuttingDown` to `true`. With this state change, the follower state changes are short-cutted. This means that a broker which was serving as leader would remain acting as a leader until controlled shutdown completes. Instead, we want the leader and ISR state to be updated so that requests will return NOT_LEADER and the client can find the new leader.

We missed this case while implementing apache#12187.

This patch fixes the issue and updates an existing test to ensure that `isShuttingDown` has not effect. We should consider adding integration tests for this as well. We can do this separately.

Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>

rutvijmehta-harness mentioned this pull request

KAFKA-14296; Partition leaders are not demoted during kraft controlled shutdown (#12741) rutvijmehta-harness/kafka#59

Merged

rutvijmehta-harness added a commit to rutvijmehta-harness/kafka that referenced this pull request


          KAFKA-14296; Partition leaders are not demoted during kraft controlle…

64bf5ab

…d shutdown (apache#12741) (#59)

When the `BrokerServer` starts its shutting down process, it transitions to `SHUTTING_DOWN` and sets `isShuttingDown` to `true`. With this state change, the follower state changes are short-cutted. This means that a broker which was serving as leader would remain acting as a leader until controlled shutdown completes. Instead, we want the leader and ISR state to be updated so that requests will return NOT_LEADER and the client can find the new leader.

We missed this case while implementing apache#12187.

This patch fixes the issue and updates an existing test to ensure that `isShuttingDown` has not effect. We should consider adding integration tests for this as well. We can do this separately.

Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>

Co-authored-by: David Jacot <djacot@confluent.io>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet