cloud_storage: fix segment materialisation race #13796

VladLazar · 2023-09-29T09:52:48Z

In order to create a new materialised segment, one needs to grab units
from materialized_resources first. This is an async operation. By the
time units are acquired, said segment might have already been via a
different code path, resulting in the assertion in
remote_partition::materialize_segment triggering.
remote_partition::aborted_transactions was particularly susceptible to
this.

This patch fixes the issue by checking for the existence of the segment
and creating a segment (if needed) in the same scheduling task.
Functionally, for the read path, nothing should change.

Backports Required

Release Notes

Bug Fixes

Fix race leading to assertion on materialisation of cloud storage segments

src/v/cloud_storage/remote_partition.cc

vbotbuildovich · 2023-09-29T13:20:11Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37935#018ae0d0-e1bc-492d-bd8b-628d7ef1c1c4

vbotbuildovich · 2023-09-29T13:43:20Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37935#018ae0d0-e1bf-456c-b7d1-ebee89e6d16b

src/v/cloud_storage/remote_partition.cc

In order to create a new materialised segment, one needs to grab units from `materialized_resources` first. This is an async operation. By the time units are acquired, said segment might have already been via a different code path, resulting in the assertion in `remote_partition::materialize_segment` triggering. `remote_partition::aborted_transactions` was particularly susceptible to this. This patch fixes the issue by checking for the existence of the segment and creating a segment (if needed) in the same scheduling task. Functionally, for the read path, nothing should change.

Lazin

LGTM

andrwng

Overall LGTM, but might be nice to avoid changes to the non-transactional path if we can avoid it

andrwng · 2023-09-29T15:52:20Z

src/v/cloud_storage/remote_partition.cc

@@ -163,7 +180,7 @@ remote_partition::borrow_result_t remote_partition::borrow_next_segment_reader(
    }
    if (iter == _segments.end()) {
        auto path = manifest.generate_segment_path(*mit);
-        iter = materialize_segment(path, *mit, std::move(segment_unit));
+        iter = get_or_materialize_segment(path, *mit, std::move(segment_unit));


Maybe we should have a separate call that expects callers to have made the segment check in the same task. Or maybe we can tweak the above find() to use get_or_materialize_segment()?

Just noting that this is seeking twice for the same offset with no scheduling points in between.

I think the offloading logic above should move in get_or_materialize_segment, but I chose not to do it in this PR to keep the change less intrusive.

vbotbuildovich · 2023-09-29T15:59:44Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37961#018ae173-4b0a-4ec4-945c-7129e98068e1

vbotbuildovich · 2023-09-29T16:06:04Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37961#018ae17e-9ec4-48a7-94d5-85ea1b264cfc

vbotbuildovich · 2023-09-29T16:11:18Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37961#018ae173-4b05-4356-b2bf-851ca1428c9a

VladLazar · 2023-09-29T17:04:53Z

Kinda hard to interpret with all the retries but failures are:

CI Failure (critical check offset_anomalies.size() == 3 has failed [2 != 3]) in cloud_storage_rpfixture.test_metadata_anomalies #13760
CI Failure (BadLogLines) in CloudStorageScrubberTest.cloud_storage_scrubber_test #13732

VladLazar · 2023-09-29T17:06:28Z

Will trigger another run to see if the instability is caused by this change or if we go unlucky.

VladLazar · 2023-09-29T17:06:34Z

/ci-repeat

vbotbuildovich · 2023-09-29T18:24:45Z

/backport v23.2.x

vbotbuildovich · 2023-09-29T19:54:20Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37985#018ae246-dee9-4316-9c80-b67619bb0cd8

vbotbuildovich · 2023-09-29T19:54:22Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37985#018ae246-def1-4340-9215-5af5ca51f8eb

vbotbuildovich · 2023-09-29T20:46:34Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37985#018ae254-800a-46a3-8307-64a6adee4249

dotnwat

👍

github-actions bot added the area/redpanda label Sep 29, 2023

VladLazar force-pushed the fix-segment-materialisation-race branch from dde1d87 to 1656069 Compare September 29, 2023 09:53

VladLazar requested review from Lazin and abhijat September 29, 2023 09:53

abhijat reviewed Sep 29, 2023

View reviewed changes

src/v/cloud_storage/remote_partition.cc Outdated Show resolved Hide resolved

abhijat previously approved these changes Sep 29, 2023

View reviewed changes

VladLazar dismissed abhijat’s stale review via fc1c08a September 29, 2023 10:35

VladLazar force-pushed the fix-segment-materialisation-race branch from 1656069 to fc1c08a Compare September 29, 2023 10:35

VladLazar requested a review from abhijat September 29, 2023 10:36

abhijat previously approved these changes Sep 29, 2023

View reviewed changes

Lazin reviewed Sep 29, 2023

View reviewed changes

src/v/cloud_storage/remote_partition.cc Show resolved Hide resolved

src/v/cloud_storage/remote_partition.cc Outdated Show resolved Hide resolved

VladLazar dismissed abhijat’s stale review via 3100f40 September 29, 2023 14:00

VladLazar force-pushed the fix-segment-materialisation-race branch from fc1c08a to 3100f40 Compare September 29, 2023 14:00

VladLazar requested a review from Lazin September 29, 2023 14:17

Lazin approved these changes Sep 29, 2023

View reviewed changes

andrwng reviewed Sep 29, 2023

View reviewed changes

piyushredpanda merged commit ed38970 into redpanda-data:dev Sep 29, 2023
9 checks passed

vbotbuildovich mentioned this pull request Sep 29, 2023

[v23.2.x] cloud_storage: fix segment materialisation race #13823

Merged

dotnwat reviewed Oct 6, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: fix segment materialisation race #13796

cloud_storage: fix segment materialisation race #13796

VladLazar commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

Lazin left a comment

andrwng left a comment

andrwng Sep 29, 2023

VladLazar Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

VladLazar commented Sep 29, 2023

VladLazar commented Sep 29, 2023

VladLazar commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

dotnwat left a comment

cloud_storage: fix segment materialisation race #13796

cloud_storage: fix segment materialisation race #13796

Conversation

VladLazar commented Sep 29, 2023

Backports Required

Release Notes

Bug Fixes

vbotbuildovich commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

Lazin left a comment

Choose a reason for hiding this comment

andrwng left a comment

Choose a reason for hiding this comment

andrwng Sep 29, 2023

Choose a reason for hiding this comment

VladLazar Sep 29, 2023

Choose a reason for hiding this comment

vbotbuildovich commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

VladLazar commented Sep 29, 2023

VladLazar commented Sep 29, 2023

VladLazar commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

vbotbuildovich commented Sep 29, 2023

dotnwat left a comment

Choose a reason for hiding this comment