Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-6835 Fix first time compacted segment upload #22812

Merged
merged 3 commits into from
Aug 17, 2024

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Aug 9, 2024

This PR fixes the issue which manifests if the TS is enabled for pre-existing compacted topic. Normally, the TS prevents compaction from compacting any offset range which is not uploaded to the cloud storage yet. Our upload path always deals with non-compacted data without gaps. The only exception is when the compaction is enabled for existing compacted topic. In this case the data is already compacted but has to be uploaded to the cloud storage anyway. And the upload is performed by the normal upload path (not compacted reupload).

The problem is that there is a bug in the archival_policy::find_segment which makes it skip compacted segment. If the segment that the policy have found is compacted it just increments the iterator once and creates a gap. The segment can't be added to the archival STM. New segment does not line up with previous segment error is added to the log and the operation is retried later. The uploads for the topic can't progress.

This PR fixes the bug and adds a ducktape test that reproduces the problem.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

Bug Fixes

  • Fix New segment does not line up with previous segment error for compacted topics.

The test creates compacted topic without TS. It produces the data and
waits until the data is compacted. After that it enables TS for the
topic. The archiver has to be able to upload the data.
@Lazin Lazin force-pushed the fix/upload-compacted-range branch from c8bb6ed to fe97fcf Compare August 9, 2024 11:49
Comment on lines 188 to 199
if (it == set.end() || eligible_for_compacted_reupload(**it)) {
// Skip forward if we hit a gap or compacted segment
for (auto i = set.begin(); i != set.end(); i++) {
const auto& sg = *i;
if (start_offset < sg->offsets().get_base_offset()) {
// Move last offset forward
it = i;
start_offset = sg->offsets().get_base_offset();
break;
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clause used to take care of these conditions:

  • if the start_offset lies in a gap, then make it point to the segment after the gap
  • if it is compacted, then point to the segment starting after start_offset (which might also be compacted)

From the cover letter it is not possible for this code path to encounter a compacted segment because of the condition imposed on compaction (no compaction before upload). This path is also not used for compacted re-upload.

That still leaves the case where the start_offset lies in a gap. Do we no longer expect this to happen? From what I can see now the progress will stop if the offset lies in a gap between two segments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the start_offset lies in the gap the lower_bound will find the next segment in the list (which will have base-offset >= start_offset).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looked to me (the code is a little convoluted in segment set) like it will return end because from https://github.com/redpanda-data/redpanda/blob/dev/src/v/storage/segment_set.cc#L94-L113 both needle_in_range checks after the segment after the offset is selected, will fail because the offset is in neither the segment right after the gap nor in the segment before the gap.

But if it's known that the next segment will be returned and not end then that sounds fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the alternative implementation for gap handling. But the code now doesn't care if the segment compacted or not.

@Lazin Lazin force-pushed the fix/upload-compacted-range branch from fe97fcf to 5b0ed9e Compare August 9, 2024 18:31
@Lazin Lazin requested a review from abhijat August 9, 2024 18:31
The archival_policy::find_segment method has a bug which causes it to
skip compacted segments unconditionally. The code checks if the segment
that it found is compacted and if this is the case it advances the
iterator to the next segment. This operation causes ntp_archiver to skip
one segment. The uploaded segment doesn't line up with the previous one
so the validation error is triggered.

The commit also implements handling of gaps in a safe way. We're not
supposed to have gaps in the log (even in compacted logs) so the
implementation has linear complexity. It's not supposed to be used and
is good enough to hadnle unexpected unique glitch.
The test validates logic which was removed. Previousy, the uploader was
skipping single compacted segment and now it will upload compacted
segments. So the upload in the test should start from 0.
@Lazin Lazin force-pushed the fix/upload-compacted-range branch from 5b0ed9e to e2fd980 Compare August 9, 2024 21:53
@Lazin
Copy link
Contributor Author

Lazin commented Aug 10, 2024

/ci-repeat 10

@vbotbuildovich
Copy link
Collaborator

new failures in https://buildkite.com/redpanda/redpanda/builds/52758#01913cd4-cdef-4073-a78c-732caa5ca574:

"rptest.tests.e2e_iam_role_test.AWSRoleFetchTests.test_write"

@piyushredpanda piyushredpanda merged commit 0fe6230 into redpanda-data:dev Aug 17, 2024
16 of 20 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-22812-v23.3.x-84 remotes/upstream/v23.3.x
git cherry-pick -x 9bbe429c73f52da28a8749cc7b2462f85f6c17be de013217e1887ca87330144695bfd6cb94dbb589 e2fd9806a13c916cb7a090847754976b18018437

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants