Skip to content

Comments

MINOR: shutdown KafkaScheduler at appropriate time#10538

Merged
cmccabe merged 2 commits intoapache:trunkfrom
rondagostino:minor_scheduler_shutdown
Apr 15, 2021
Merged

MINOR: shutdown KafkaScheduler at appropriate time#10538
cmccabe merged 2 commits intoapache:trunkfrom
rondagostino:minor_scheduler_shutdown

Conversation

@rondagostino
Copy link
Contributor

Both the ZooKeeper-based and KRaft brokers invoke KafkaScheduler.shutdown() too early -- before LogManager.shutdown() is invoked. So it is possible for LogManager to try to use the scheduler after the scheduler has been shutdown, which results in an exception. This patch moves the shutdown of the scheduler to a point after the shutdown of LogManager.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Contributor

@cmccabe cmccabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmccabe cmccabe added the kraft label Apr 14, 2021
@cmccabe cmccabe merged commit 2e25a1d into apache:trunk Apr 15, 2021
junrao pushed a commit that referenced this pull request Sep 23, 2021
…an shutdown (#11351)

This also fixes KAFKA-13070.

We have seen a problem caused by shutting down the scheduler before shutting down LogManager.

When LogManager was closing partitions one by one, the scheduler called to delete old segments due to retention. However, the old segments could have been closed by the LogManager, which caused an exception and subsequently marked logdir as offline. As a result, the broker didn't flush the remaining partitions and didn't write the clean shutdown marker. Ultimately the broker took hours to recover the log during restart.

This PR essentially reverts #10538

Reviewers: Ismael Juma <ismael@juma.me.uk>, Kowshik Prakasam <kprakasam@confluent.io>, Jun Rao <junrao@gmail.com>
junrao pushed a commit that referenced this pull request Sep 23, 2021
…an shutdown (#11351)

This also fixes KAFKA-13070.

We have seen a problem caused by shutting down the scheduler before shutting down LogManager.

When LogManager was closing partitions one by one, the scheduler called to delete old segments due to retention. However, the old segments could have been closed by the LogManager, which caused an exception and subsequently marked logdir as offline. As a result, the broker didn't flush the remaining partitions and didn't write the clean shutdown marker. Ultimately the broker took hours to recover the log during restart.

This PR essentially reverts #10538

Reviewers: Ismael Juma <ismael@juma.me.uk>, Kowshik Prakasam <kprakasam@confluent.io>, Jun Rao <junrao@gmail.com>
xdgrulez pushed a commit to xdgrulez/kafka that referenced this pull request Dec 22, 2021
…an shutdown (apache#11351)

This also fixes KAFKA-13070.

We have seen a problem caused by shutting down the scheduler before shutting down LogManager.

When LogManager was closing partitions one by one, the scheduler called to delete old segments due to retention. However, the old segments could have been closed by the LogManager, which caused an exception and subsequently marked logdir as offline. As a result, the broker didn't flush the remaining partitions and didn't write the clean shutdown marker. Ultimately the broker took hours to recover the log during restart.

This PR essentially reverts apache#10538

Reviewers: Ismael Juma <ismael@juma.me.uk>, Kowshik Prakasam <kprakasam@confluent.io>, Jun Rao <junrao@gmail.com>
lmr3796 pushed a commit to lmr3796/kafka that referenced this pull request Jun 2, 2022
…an shutdown (apache#11351)

This also fixes KAFKA-13070.

We have seen a problem caused by shutting down the scheduler before shutting down LogManager.

When LogManager was closing partitions one by one, the scheduler called to delete old segments due to retention. However, the old segments could have been closed by the LogManager, which caused an exception and subsequently marked logdir as offline. As a result, the broker didn't flush the remaining partitions and didn't write the clean shutdown marker. Ultimately the broker took hours to recover the log during restart.

This PR essentially reverts apache#10538

Reviewers: Ismael Juma <ismael@juma.me.uk>, Kowshik Prakasam <kprakasam@confluent.io>, Jun Rao <junrao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants