Skip to content

Comments

KAFKA-13315: log layer exception during shutdown that caused an unclean shutdown#11351

Merged
junrao merged 12 commits intoapache:trunkfrom
ccding:scheduler
Sep 23, 2021
Merged

KAFKA-13315: log layer exception during shutdown that caused an unclean shutdown#11351
junrao merged 12 commits intoapache:trunkfrom
ccding:scheduler

Conversation

@ccding
Copy link
Contributor

@ccding ccding commented Sep 21, 2021

This also fixes KAFKA-13070.

We have seen a problem caused by shutting down the scheduler before shutting down LogManager.

When LogManager was closing partitions one by one, the scheduler called to delete old segments due to retention. However, the old segments could have been closed by the LogManager, which caused an exception and subsequently marked logdir as offline. As a result, the broker didn't flush the remaining partitions and didn't write the clean shutdown marker. Ultimately the broker took hours to recover the log during restart.

This PR essentially reverts #10538

I believe the exception #10538 saw is at

deleteSegmentFiles(
List(seg),
asyncDelete = true,
dir,
topicPartition,
config,
scheduler,
logDirFailureChannel,
logPrefix)
which called the scheduler and crashed the compaction thread. The effect of this exception has been mitigated by #10763

cc @rondagostino @ijuma @cmccabe @junrao @dhruvilshah3 as authors/reviewers of the PRs mentioned above to make sure this change look okay.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccding : Thanks for the PR. A few comments below.

if (logManager != null)
CoreUtils.swallow(logManager.shutdown(), this)
// be sure to shutdown scheduler after log manager
// be sure to shutdown scheduler before log manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) Could we add some comment to explain why we want to shut down the log manager later?
(2) If we are shutting down, it makes sense to stop the scheduler first since there is no guarantee any asynchronously scheduled task will complete. If this is the case, should we shut down the scheduler before any other component (e.g. ReplicaManger also uses scheduler)?
(3) Once the scheduler is shut down, scheduling new tasks causes IllegalStateException. That's probably the exception that #10538 tries to fix. To avoid polluting the log, perhaps we could change KafkaScheduler such that it avoids throwing an exception once it's shutdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all tasks scheduled of the async variety? It would be nice if we would stop issuing new async tasks when shutdown starts, but let the existing ones complete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that none of the scheduled calls depends on the completion of the returned future. When shutting down KafkaScheduler, we call ScheduledThreadPoolExecutor.shutdown(), which doesn't wait for all existing tasks to complete. This seems ok since for important tasks (e.g. flushing the log), we make explicit calls during shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the code review. I moved kafkaScheduler.shutdown upfront, changed the exception to an info-level log, and added comments.

For the exception to log change: I looked at the code that calls startup and shutdown. It appears we always call startup right after creating the scheduler object and call shutdown when calling the parent's shutdown or close. I think the callers make sure that they won't call scheduler.schedule to a scheduler that has been shutdown, unless we are shutting down the broker. So it should be okay to change the IllegalStateException to an info-level log

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junrao shutdown:

"Initiates an orderly shutdown in which previously submitted tasks are executed, but no new tasks will be accepted. Invocation has no additional effect if already shut down."

Maybe you were thinking of shutdownNow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the code (1) lets running tasks to complete, (2) waits for tasks scheduled with executor.schedule(runnable, delay, unit) to finish after the delay, and (3) cancels future tasks scheduled by executor.scheduleAtFixedRate(runnable, delay, period, unit)

if (period >= 0)
executor.scheduleAtFixedRate(runnable, delay, period, unit)
else
executor.schedule(runnable, delay, unit)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijuma and @ccding : The description of ScheduledThreadPoolExecutor.shutdown() says

This method does not wait for previously submitted tasks to complete execution. Use awaitTermination to do that.

So, we probably want to call awaitTermination() with a timeout like 10secs to make sure all existing tasks complete before shutting down other components.

We probably can't use shutdownNow() since it interrupts the task and could cause IOException when blocking operations (e.g., force) are applied on a file channel. This will then lead to unclean shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it does not wait for previously submitted tasks to complete execution, also does not kill the tasks. KafkaScheduler does call awaitTermination with a 1-day timeout. Do you think we want a shorter timeout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccding : Missed that part. awaitTermination with a 1-day timeout is fine.

@ccding
Copy link
Contributor Author

ccding commented Sep 22, 2021

FYI: there is a side effect of not throwing IllegalStateException. The deleteSegment function will not throw and we will be able to rename the .swap files to regular files during shutdown. This is a good thing.

deleteSegmentFiles(
List(seg),
asyncDelete = true,
dir,
topicPartition,
config,
scheduler,
logDirFailureChannel,
logPrefix)
if (newSegmentBaseOffsets.contains(seg.baseOffset)) Option.empty else Some(seg)
}.filter(item => item.isDefined).map(item => item.get)
// okay we are safe now, remove the swap suffix
sortedNewSegments.foreach(_.changeFileSuffixes(SwapFileSuffix, ""))
Utils.flushDir(dir.toPath)

this synchronized {
ensureRunning()
if (!isStarted) {
info("Kafka scheduler is not running at the time '%s' is scheduled.".format(name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could add that we are ignoring the named task.

ensureRunning()
if (!isStarted) {
info("Kafka scheduler is not running at the time '%s' is scheduled.".format(name))
return null
Copy link
Contributor

@junrao junrao Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In UnifiedLog, we have code uses the returned future.

val producerExpireCheck = scheduler.schedule(name = "PeriodicProducerExpirationCheck", fun = () => {

With this change, perhaps we could return an Option and let the caller deal with it accordingly?

* For example, if we didn't shutdown the scheduler first, when LogManager was closing
* partitions one by one, the scheduler might concurrently delete old segments due to
* retention. However, the old segments could have been closed by the LogManager, which would
* cause an exception and subsequently mark logdir as offline. As a result, the broker would
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exception => IOException ?

* cause an exception and subsequently mark logdir as offline. As a result, the broker would
* not flush the remaining partitions or write the clean shutdown marker. Ultimately, the
* broker would have to take hours to recover the log during restart and are subject to
* potential data loss.
Copy link
Contributor

@junrao junrao Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think unclean shutdown will cause data loss if acks = all is used. So, we can just remove that statement.

if (logManager != null)
CoreUtils.swallow(logManager.shutdown(), this)
// be sure to shutdown scheduler after log manager
// be sure to shutdown scheduler before log manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijuma and @ccding : The description of ScheduledThreadPoolExecutor.shutdown() says

This method does not wait for previously submitted tasks to complete execution. Use awaitTermination to do that.

So, we probably want to call awaitTermination() with a timeout like 10secs to make sure all existing tasks complete before shutting down other components.

We probably can't use shutdownNow() since it interrupts the task and could cause IOException when blocking operations (e.g., force) are applied on a file channel. This will then lead to unclean shutdown.

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccding : Thanks for the updated PR. A few more comments below.

@junrao
Copy link
Contributor

junrao commented Sep 22, 2021

@ccding : For the side effect that you mentioned, a segment to be renamed to .delete, but won't be physically deleted. This seems fine since on broker restart, we have the logic to delete all .delete files.

@ccding
Copy link
Contributor Author

ccding commented Sep 22, 2021

@junrao for the side effect, I am saying

sortedNewSegments.foreach(_.changeFileSuffixes(SwapFileSuffix, "")) 

will execute. #10763 mainly targets the case that this line doesn't run before shutting down.

@ccding
Copy link
Contributor Author

ccding commented Sep 22, 2021

Addressed all comments above. PTAL

* @return A Future object to manage the task scheduled.
*/
def schedule(name: String, fun: ()=>Unit, delay: Long = 0, period: Long = -1, unit: TimeUnit = TimeUnit.MILLISECONDS) : ScheduledFuture[_]
def schedule(name: String, fun: ()=>Unit, delay: Long = 0, period: Long = -1, unit: TimeUnit = TimeUnit.MILLISECONDS) : Option[ScheduledFuture[_]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change the comment on return value accordingly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an Option here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we decided not to throw an exception, to avoid polluting the log during broker shutdown. If we don't throw, we must return something: Option looks better than null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can return a Future that does nothing too, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is, do we need to let the caller know that the schedule call has failed? I am not sure if we will have this use case in the future, though the current codebase doesn't have such a case.

Please let me know if you still think we should return a Future that does nothing. I can update the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems that we are not using this functionality at the moment so not sure if the additional complexity helps. Also, we can't guarantee that the submission will not go through just before the scheduler is closed. So, it gives a misleading impression that we can count on this behavior.

}, period = producerIdExpirationCheckIntervalMs, delay = producerIdExpirationCheckIntervalMs, unit = TimeUnit.MILLISECONDS)

if (producerExpireCheckOption.isEmpty)
throw new IllegalStateException("Failed to schedule PeriodicProducerExpirationCheck witch KafkaScheduler.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get here, it's because we are shutting down the broker. So, it doesn't seem it's an illegal state. Perhaps we could just let it go?

if (logManager != null)
CoreUtils.swallow(logManager.shutdown(), this)
// be sure to shutdown scheduler after log manager
// be sure to shutdown scheduler before log manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccding : Missed that part. awaitTermination with a 1-day timeout is fine.

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccding : Thanks for the updated PR. A couple of more comments.

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccding : Thanks for the updated PR. LGTM. It seems that tests didn't run?

@ccding
Copy link
Contributor Author

ccding commented Sep 22, 2021

I pushed twice with the two commits. It appears Jenkins is currently running on the first commit, and the second one is pending. Unfortunately, I don't have permission to stop it from running the first one.

Copy link
Contributor

@kowshik kowshik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccding Thanks for the PR! It seems this PR also fixes KAFKA-13070. Can we keep just one of the jiras and refer to it in the PR description (KAFKA-13070 was the pre-existing jira)?

ensureRunning()
if (!isStarted) {
info("Kafka scheduler is not running at the time task '%s' is scheduled. The task is ignored.".format(name))
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With little bit refactor, it looks possible to avoid the return statement here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed it. I somehow got the habit from other languages of returning first in error cases.

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccding : Thanks for the updated PR. Just one comment below.

override def get(): Unit = {}
override def get(timeout: Long, unit: TimeUnit): Unit = {}
override def getDelay(unit: TimeUnit): Long = 0
override def compareTo(o: Delayed): Int = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Delayed be ScheduledFuture? Also, instead of always returning 0, it seems that it's better to return 0 if the other instance is NoOpScheduledFutureTask and return -1 or 1 otherwise?

Copy link
Contributor Author

@ccding ccding Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NoOpScheduledFutureTask extends ScheduledFuture, ScheduledFuture extends Delayed, Delayed extends Comparable<Delayed>: therefore we should use Delayed here. Also, it doesn't compile if I changed it to ScheduledFuture

Fixed the return value.

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccding : Thanks for the updated PR. LGTM

@junrao junrao merged commit d08e3ad into apache:trunk Sep 23, 2021
junrao pushed a commit that referenced this pull request Sep 23, 2021
…an shutdown (#11351)

This also fixes KAFKA-13070.

We have seen a problem caused by shutting down the scheduler before shutting down LogManager.

When LogManager was closing partitions one by one, the scheduler called to delete old segments due to retention. However, the old segments could have been closed by the LogManager, which caused an exception and subsequently marked logdir as offline. As a result, the broker didn't flush the remaining partitions and didn't write the clean shutdown marker. Ultimately the broker took hours to recover the log during restart.

This PR essentially reverts #10538

Reviewers: Ismael Juma <ismael@juma.me.uk>, Kowshik Prakasam <kprakasam@confluent.io>, Jun Rao <junrao@gmail.com>
@junrao
Copy link
Contributor

junrao commented Sep 23, 2021

cherry-picked to 3.0 branch too.

xdgrulez pushed a commit to xdgrulez/kafka that referenced this pull request Dec 22, 2021
…an shutdown (apache#11351)

This also fixes KAFKA-13070.

We have seen a problem caused by shutting down the scheduler before shutting down LogManager.

When LogManager was closing partitions one by one, the scheduler called to delete old segments due to retention. However, the old segments could have been closed by the LogManager, which caused an exception and subsequently marked logdir as offline. As a result, the broker didn't flush the remaining partitions and didn't write the clean shutdown marker. Ultimately the broker took hours to recover the log during restart.

This PR essentially reverts apache#10538

Reviewers: Ismael Juma <ismael@juma.me.uk>, Kowshik Prakasam <kprakasam@confluent.io>, Jun Rao <junrao@gmail.com>
lmr3796 pushed a commit to lmr3796/kafka that referenced this pull request Jun 2, 2022
…an shutdown (apache#11351)

This also fixes KAFKA-13070.

We have seen a problem caused by shutting down the scheduler before shutting down LogManager.

When LogManager was closing partitions one by one, the scheduler called to delete old segments due to retention. However, the old segments could have been closed by the LogManager, which caused an exception and subsequently marked logdir as offline. As a result, the broker didn't flush the remaining partitions and didn't write the clean shutdown marker. Ultimately the broker took hours to recover the log during restart.

This PR essentially reverts apache#10538

Reviewers: Ismael Juma <ismael@juma.me.uk>, Kowshik Prakasam <kprakasam@confluent.io>, Jun Rao <junrao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants