KAFKA-12520: Ensure log loading does not truncate producer state unless required by ccding · Pull Request #10763 · apache/kafka

ccding · 2021-05-25T21:23:10Z

When we find a .swap file on startup, we typically want to rename and replace it as .log, .index, .timeindex, etc. as a way to complete any ongoing replace operations. These swap files are usually known to have been flushed to disk before the replace operation begins.

One flaw in the current logic is that we recover these swap files on startup and as part of that, end up truncating the producer state and rebuild it from scratch. This is unneeded as the replace operation does not mutate the producer state by itself. It is only meant to replace the .log file along with corresponding indices. Because of this unneeded producer state rebuild operation, we have seen multi-hour startup times for clusters that have large compacted topics.

This patch fixes the issue. With ext4 ordered mode, the metadata are ordered and no matter it is a clean/unclean shutdown. As a result, we rework the recovery workflow as follows.

If there are any .cleaned files, we delete all .swap files with higher/equal offsets due to KAFKA-6264. We also delete the .cleaned files. If no .cleaned file, do nothing for this step.
If there are any .log.swap files left after step 1, they, together with their index files, must be renamed from .cleaned and are complete (renaming from .cleaned to .swap is in reverse offset order). We rename these .log.swap files and their corresponding index files to regular files, while deleting the original files from compaction or segment split if they haven't been deleted.
Do log splitting for legacy log segments with offset overflow (KAFKA-6264)
If there are any other index swap files left, they must come from partial renaming from .swap files to regular files. We can simply rename them to regular files.

credit: some code is copied from @dhruvilshah3 's PR: #10388

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…ss required

ccding · 2021-06-07T22:03:32Z

core/src/main/scala/kafka/log/LogLoader.scala

+      // Check whether swap index files exist: if not, the cleaned files must exist due to the
+      // existence of swap log file. Therefore, we rename the cleaned files to swap files and continue.
+      var recoverable = true
+      val swapOffsetIndexFile = Log.offsetIndexFile(swapFile.getParentFile, baseOffset, Log.SwapFileSuffix)


Is it possible to write something like the following in Scala?

Vector(Log.offsetIndexFile, Log.timeIndexFile, Log.transactionIndexFile).foreach{ fn => { swapIndexFile = fn(swapFile.getParentFile, baseOffset, Log.SwapFileSuffix) if (!swapIndexFile.exists()) { // ... } // other things } }

You could perhaps define a method like this:

def maybeCompleteInterruptedSwap(fn: (File, Long, String) => File): Boolean = { val swapIndexFile = fn(swapFile.getParentFile, baseOffset, Log.SwapFileSuffix) if (!swapIndexFile.exists()) { val cleanedIndexFile = fn(swapFile.getParentFile, baseOffset, Log.CleanedFileSuffix) if (cleanedIndexFile.exists()) { cleanedIndexFile.renameTo(swapIndexFile) true } else { false } } }

and then invoke it as

var recoverable = true recoverable = maybeCompleteInterruptedSwap(Log.offsetIndexFile) if (recoverable) recoverable = maybeCompleteInterruptedSwap(Log.timeIndexFile) if (recoverable) recoverable = maybeCompleteInterruptedSwap(Log.transactionIndexFile)

dhruvilshah3 · 2021-06-15T20:53:40Z

core/src/main/scala/kafka/log/LogLoader.scala

+        }
+      }
+      info(s"${params.logIdentifier}Found log file ${swapFile.getPath} from interrupted swap operation, which is not recoverable from ${Log.CleanedFileSuffix} files, repairing.")
      recoverSegment(swapSegment, params)


The main thing we want to avoid is running this recovery logic for scenarios where the rename operation was interrupted, as it rebuilds the producer state from scratch. Could we make this recovery conditional on whether we have all the relevant log files and indices?

dhruvilshah3 · 2021-06-15T20:56:04Z

core/src/main/scala/kafka/log/LogLoader.scala

        time = params.time,
        fileSuffix = Log.SwapFileSuffix)
-      info(s"${params.logIdentifier}Found log file ${swapFile.getPath} from interrupted swap operation, repairing.")
+      if (recoverable) {


Could you elaborate a bit on what this block of code is doing?

The whole logic is that, if the segment.swap file exists, then all index files should exist as .cleaned or .swap. We find them and rename them to .swap [before this block of code]. Then do a sanity check and rename all the .swap files to non-suffix log files [within this block of code].

This could fix the issue caused by the compaction as we discussed before.

For all other cases, I think it is in an inconsistent state and we will have to do the original recovery.

Does this make sense to you?

ccding · 2021-06-16T18:35:27Z

core/src/main/scala/kafka/log/LogLoader.scala

+          swapSegment.sanityCheck(true)
+          info(s"Found log file ${swapFile.getPath} from interrupted swap operation, which is recoverable from ${Log.CleanedFileSuffix} files.")
+          swapSegment.changeFileSuffixes(Log.SwapFileSuffix, "")
+          return


return might be wrong. Didn't realize it was within a for loop. Maybe continue instead.

Would be good to avoid return or continue and instead make the call to recoverSegment conditional so that the code is easier to read.

core/src/main/scala/kafka/log/LogLoader.scala

dhruvilshah3 · 2021-06-16T20:11:46Z

core/src/main/scala/kafka/log/LogLoader.scala

+
+      // Check whether swap index files exist: if not, the cleaned files must exist due to the
+      // existence of swap log file. Therefore, we rename the cleaned files to swap files and continue.
+      val recoverable = maybeCompleteInterruptedSwap(Log.offsetIndexFile) &&


recoverable sounds a bit incorrect in this context, given that we actually end up calling recoverSegment when recoverable == false.

Perhaps we could call this something like needsRecovery which is set to true if we do not find the relevant index files or when the sanity check fails.

dhruvilshah3 · 2021-06-16T20:12:35Z

core/src/main/scala/kafka/log/LogLoader.scala

+          swapSegment.sanityCheck(true)
+          info(s"Found log file ${swapFile.getPath} from interrupted swap operation, which is recoverable from ${Log.CleanedFileSuffix} files.")
+          swapSegment.changeFileSuffixes(Log.SwapFileSuffix, "")
+          return


Would be good to avoid return or continue and instead make the call to recoverSegment conditional so that the code is easier to read.

dhruvilshah3 · 2021-06-16T20:14:46Z

core/src/test/scala/unit/kafka/log/LogCleanerTest.scala

  }

+  @Test
+  def testRecoveryRebuildsIndices(): Unit = {


Would be good to enumerate different cases that we could run into during recovery and ensure we have coverage for them. eg.

All .swap files are present. We should validate that producer state is not rebuilt in this case.

Some .swap and some .clean files are present.

All .clean files are present.

One of the index files was not found when completing the swap operation, which triggers a full recovery and rebuild of producer state.

core/src/main/scala/kafka/log/LogLoader.scala

ccding · 2021-06-17T15:57:57Z

@dhruvilshah3 addressed your comments. Will work on the tests later today. Please take a look

ccding · 2021-06-22T02:14:25Z

@dhruvilshah3 @junrao This PR is ready for review. Please take a look

kafka/core/src/test/scala/unit/kafka/log/LogCleanerTest.scala

Line 1407 in f914ed7

def testRecoveryAfterCrash(): Unit = {

The above function tests all the possible cases, thus we don't need additional tests.

One thing I am not sure about is how to test whether certain recovery goes the renaming path or recovery path. Current test cases only validate the results are correct. If you have any ideas, please let me know.

ccding · 2021-06-22T22:52:52Z

added two tests: they should cover all the cases around file renaming during compaction

junrao

@ccding : Thanks for the PR. A few comments below.

junrao · 2021-06-23T22:40:15Z

core/src/main/scala/kafka/log/LogLoader.scala

    })

-    completeSwapOperations(swapFiles, params)
+    // Do the actual recovery for toRecoverSwapFiles, as discussed above.


Hmm, I am not sure why we need this step. We have processed all .swap files before and no new .swap files should be introduced if we get to here.

you are right if we don't need to do sanity checks. Removed this

junrao · 2021-06-23T22:40:29Z

core/src/main/scala/kafka/log/LogLoader.scala

+   *
   * @param params The parameters for the log being loaded from disk
-   * @return Set of .swap files that are valid to be swapped in as segment files
+   * @return Set of .swap files that are valid to be swapped in as segment files and index files


The PR descriptions says "as a result, if at least one .swap file exists for a segment, all other files for the segment must exist as .cleaned files or .swap files. Therefore, we rename the .cleaned files to .swap files, then make them normal segment files.". Are we implementing the renaming of .clean files to .swap files?

No, we are not renaming .cleaned files to .swap files due to KAFKA-6264. I forgot to update the description of the PR. Just updated it: please see the updated one.

junrao · 2021-06-23T22:40:39Z

core/src/main/scala/kafka/log/LogLoader.scala

    val swapFiles = removeTempFilesAndCollectSwapFiles(params)

-    // Now do a second pass and load all the log and index files.
+    // The remaining valid swap files must come from compaction operation. We can simply rename them


It seems that those swap files could be the result of segment split too?

are you concerned about the logic or the comment? If comment only, I fixed it.

junrao · 2021-06-23T22:40:54Z

core/src/main/scala/kafka/log/LogLoader.scala

+        time = params.time,
+        fileSuffix = Log.SwapFileSuffix)
+      try {
+        segment.sanityCheck(false)


It doesn't seem we need this since we call segment.sanityCheck() on all segments later in loadSegmentFiles().

ccding · 2021-06-24T01:58:15Z

@junrao Thanks for the review. Addressed your comments.

junrao

@ccding : Thanks for the updated PR. A few more comments.

junrao · 2021-06-25T21:55:07Z

core/src/main/scala/kafka/log/LogLoader.scala

+      toRenameSwapFiles += f
+      info(s"${params.logIdentifier}Found log file ${f.getPath} from interrupted swap operation, which is recoverable from ${Log.SwapFileSuffix} files by renaming.")
+      minSwapFileOffset = Math.min(segment.baseOffset, minSwapFileOffset)
+      maxSwapFileOffset = Math.max(segment.offsetIndex.lastOffset, maxSwapFileOffset)


This is an existing problem. Calculating the end offset that a segment covers can be tricky. The problem is that in compaction, we remove records in the .clean and .swap files. So, the offset of the last record in a segment doesn't tell us the true end offset of the original segment.

One possibility is to use the base offset of the next segment if present.

How can we get the next segment before finishing the recovery process?

I could be wrong, but I think if it is compaction, the last record will never be removed. The reason is that compaction always removes earlier records of each key, and the last record will never be an earlier one.

Split should be similar.

I could be wrong, but I think if it is compaction, the last record will never be removed. The reason is that compaction always removes earlier records of each key, and the last record will never be an earlier one.

Split should be similar.

It's true that we generally don't remove the last record during compaction. However, during a round of cleaning, we clean segments in groups and each group generates a single .clean file. The group is formed to make sure that offsets are still within 2 billion in offset gap and the .clean file won't exceed 2GB in size. If multiple groups are formed, it's possible that a group that's not the last doesn't preserve the last record.

How can we get the next segment before finishing the recovery process?

We could potentially scan all .log files and sort them in offset order.

segment.offsetIndex.lastOffset doesn't give the exact last offset in a segment since the index is sparse. We need to use segment.nextOffset().

junrao · 2021-06-25T21:55:15Z

core/src/main/scala/kafka/log/LogLoader.scala

+    }
+
+    // Second pass: delete segments that are between minSwapFileOffset and maxSwapFileOffset. As
+    // discussed above, these segments were compacted but haven't been renamed to .delete before


The swap files can also be created during splitting.

fixed the comment

junrao · 2021-06-25T21:55:20Z

core/src/main/scala/kafka/log/LogLoader.scala

    })

-    completeSwapOperations(swapFiles, params)
+    // Forth pass: rename remaining index swap files. They must be left due to a broker crash when


Hmm, not sure why we still have swap files at the point. We have renamed all existing swap files and no new swap files are created.

We have renamed all .log.swap files and their corresponding index swap files. If there is a single .index.swap file, it is not renamed previously in the recovery process. A single .index.swap file could happen if it crashed in the middle of this line:

kafka/core/src/main/scala/kafka/log/Log.scala

Line 2381 in bd668e9

sortedNewSegments.foreach(_.changeFileSuffixes(Log.SwapFileSuffix, ""))

Should we rename all .swap files in https://github.com/apache/kafka/pull/10763/files#diff-54b3df71b1e0697a211d23a9018a91aef773fca0b9cbd1abafbdca6c79664930R138 no matter the file is in toRenameSwapFiles or not? I am not sure if we have to put the retryOnOffsetOverflow call between renaming files in toRenameSwapFiles and renaming the rest .swap files.

I guess we don't need to put retryOnOffsetOverflow call in between. Removed the toRenameSwapFiles variable and combined the renaming in 76c197c

junrao

@ccding : Thanks for the updated PR. A few more comments.

junrao · 2021-06-28T17:00:41Z

core/src/main/scala/kafka/log/LogLoader.scala

+      toRenameSwapFiles += f
+      info(s"${params.logIdentifier}Found log file ${f.getPath} from interrupted swap operation, which is recoverable from ${Log.SwapFileSuffix} files by renaming.")
+      minSwapFileOffset = Math.min(segment.baseOffset, minSwapFileOffset)
+      maxSwapFileOffset = Math.max(segment.offsetIndex.lastOffset, maxSwapFileOffset)


segment.offsetIndex.lastOffset doesn't give the exact last offset in a segment since the index is sparse. We need to use segment.nextOffset().

junrao · 2021-06-28T17:02:31Z

core/src/main/scala/kafka/log/LogLoader.scala

+      try {
+        if (!file.getName.endsWith(SwapFileSuffix)) {
+          val offset = offsetFromFile(file)
+          if (offset >= minSwapFileOffset && offset <= maxSwapFileOffset) {


If we use segment.nextOffset() to calculate maxSwapFileOffset, it's exclusive.

junrao · 2021-06-28T17:03:32Z

core/src/main/scala/kafka/log/LogLoader.scala

-    // Now do a second pass and load all the log and index files.
+    // The remaining valid swap files must come from compaction or segment split operation. We can
+    // simply rename them to regular segment files. But, before renaming, we should figure out which
+    // segments are compacted and delete these segment files: this is done by calculating min/maxSwapFileOffset.


"which segments are compacted": .swap files are also generated from splitting.

junrao · 2021-06-28T17:11:56Z

core/src/main/scala/kafka/log/LogLoader.scala

-          deleteIndicesIfExist(baseFile)
-          swapFiles += file
-        }
+        swapFiles += file


It's possible that during renaming, we have only renamed the .log file to .swap, but not the corresponding index files. Should we find those .clean files with the same offset and rename them to .swap?

Due to KAFKA-6264, if there are any .cleaned files (no matter they are .index.cleaned or .log.cleaned), we delete all .cleaned files and .swap files that have larger/equal base offsets. Basically, this reverts ongoing compaction/split operations. Therefore, we don't have any additional .index.cleaned files.

Is that fair?

Thanks for the explanation. Make sense.

ccding · 2021-06-28T20:40:10Z

@junrao Thanks for the review. I have addressed the comments. Please take a look

junrao

@ccding : Thanks for the updated PR. Just a minor comment below.

junrao · 2021-06-28T22:50:40Z

core/src/test/scala/unit/kafka/log/LogCleanerTest.scala

+    // 5) Simulate recovery after a subset of swap files are renamed to regular files and old segments files are renamed
+    //    to .deleted. Clean operation is resumed during recovery.
+    log.logSegments.head.timeIndex.file.renameTo(new File(CoreUtils.replaceSuffix(log.logSegments.head.timeIndex.file.getPath, "", Log.SwapFileSuffix)))
+      // .changeFileSuffixes("", Log.SwapFileSuffix)


Is this still needed?

My bad. Forgot to delete this line.

junrao

@ccding : Thanks for the latest PR. LGTM

@dhruvilshah3

…ss required (apache#10763) When we find a .swap file on startup, we typically want to rename and replace it as .log, .index, .timeindex, etc. as a way to complete any ongoing replace operations. These swap files are usually known to have been flushed to disk before the replace operation begins. One flaw in the current logic is that we recover these swap files on startup and as part of that, end up truncating the producer state and rebuild it from scratch. This is unneeded as the replace operation does not mutate the producer state by itself. It is only meant to replace the .log file along with corresponding indices. Because of this unneeded producer state rebuild operation, we have seen multi-hour startup times for clusters that have large compacted topics. This patch fixes the issue. With ext4 ordered mode, the metadata are ordered and no matter it is a clean/unclean shutdown. As a result, we rework the recovery workflow as follows. If there are any .cleaned files, we delete all .swap files with higher/equal offsets due to KAFKA-6264. We also delete the .cleaned files. If no .cleaned file, do nothing for this step. If there are any .log.swap files left after step 1, they, together with their index files, must be renamed from .cleaned and are complete (renaming from .cleaned to .swap is in reverse offset order). We rename these .log.swap files and their corresponding index files to regular files, while deleting the original files from compaction or segment split if they haven't been deleted. Do log splitting for legacy log segments with offset overflow (KAFKA-6264) If there are any other index swap files left, they must come from partial renaming from .swap files to regular files. We can simply rename them to regular files. credit: some code is copied from @dhruvilshah3 's PR: apache#10388 Reviewers: Dhruvil Shah <dhruvil@confluent.io>, Jun Rao <junrao@gmail.com>

ccding added 3 commits May 25, 2021 14:16

KAFKA-12520: Ensure log loading does not truncate producer state unle…

da6419b

…ss required

Merge branch 'trunk' into seg

595ee71

add unit test

4a37642

ccding changed the title ~~[WIP] KAFKA-12520: Ensure log loading does not truncate producer state unless required~~ KAFKA-12520: Ensure log loading does not truncate producer state unless required Jun 7, 2021

indent

29d54cf

ccding commented Jun 7, 2021

View reviewed changes

dhruvilshah3 mentioned this pull request Jun 9, 2021

KAFKA-12520: Ensure log loading does not truncate producer state unless required #10388

Closed

dhruvilshah3 reviewed Jun 15, 2021

View reviewed changes

clean up with maybeCompleteInterruptedSwap

97ded25

ccding commented Jun 16, 2021

View reviewed changes

dhruvilshah3 reviewed Jun 16, 2021

View reviewed changes

ccding added 3 commits June 17, 2021 08:51

address some of the comments

f85ae63

rename

51adf4f

minor fix

0a05f8f

ccding added 2 commits June 21, 2021 12:54

draft

6e83939

draft: passed test cases, to clean code

0b2d1cc

junrao mentioned this pull request Jun 22, 2021

KAFKA-12964: Collect and rename snapshot files prior to async deletion. #10896

Merged

ccding added 2 commits June 21, 2021 19:07

clean up code, todo: add more test cases

5a69630

remove duplicate tests

a087986

ccding added 7 commits June 21, 2021 19:35

better logging

2c6a157

fix test cases

80b2116

reuse the completeSwapOperations function

9b03ebe

Merge branch 'trunk' into seg

5319d3b

clean code

0306f23

fix a corner case

aefc8e3

add two tests

11bd602

fix compiling error

f4b99b3

junrao reviewed Jun 23, 2021

View reviewed changes

ccding added 2 commits June 23, 2021 18:23

address jun s comments

7f8fc05

fix comment and comparison

d3c8600

junrao reviewed Jun 25, 2021

View reviewed changes

ccding added 2 commits June 25, 2021 17:44

address comments

e5e1b54

remove the toRenameSwapFiles variable the combine renaming

76c197c

junrao reviewed Jun 28, 2021

View reviewed changes

address comments

822f49e

junrao reviewed Jun 28, 2021

View reviewed changes

remove unused lines

3c31418

ccding requested a review from junrao June 29, 2021 00:29

junrao approved these changes Jun 29, 2021

View reviewed changes

junrao merged commit 0b6d6b1 into apache:trunk Jun 29, 2021

ccding mentioned this pull request Sep 21, 2021

KAFKA-13315: log layer exception during shutdown that caused an unclean shutdown #11351

Merged

3 tasks

Comments

Conversation

ccding commented May 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccding Jun 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ccding commented Jun 17, 2021

Uh oh!

ccding commented Jun 22, 2021

Uh oh!

ccding commented Jun 22, 2021

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccding Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccding commented Jun 24, 2021

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccding Jun 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccding commented May 25, 2021 •

edited

Loading

ccding Jun 15, 2021 •

edited

Loading

ccding Jun 24, 2021 •

edited

Loading

ccding Jun 26, 2021 •

edited

Loading

ccding Jun 28, 2021 •

edited

Loading