storage: fix windowed compaction error handling #16928

andrwng · 2024-03-07T07:56:07Z

We were previously using the l-value-variant of
model::record_batch_reader::consume(), which doesn't have a built-in
call to finally(). For a log_reader, this meant that in some scenarios
(e.g. if the reducer were to fail) we could end up not calling
finally(), and therefore wouldn't clear the underlying segment reader.

This is exactly what happened, as we hit:

std::runtime_error (lz4f_compressend:ERROR_dstMaxSize_tooSmall)

errors as we compressed/decompressed batches, and then ultimately
crashed because the log_reader was not closed:

Assert failure: (.../redpanda/redpanda/src/v/storage/log_reader.h:141) '!_iterator.reader' log reader destroyed with live reader"

Backports Required

Release Notes

Bug Fixes

Fixes a bug in windowed compaction that could cause Redpanda to crash when an error occurs while reading batches.

The staging file was missing a '.'. Note that these files are transient, so there isn't a compactibility concern here.

An upcoming test will inject an exception to simulate a failure case seen in the wild that is otherwise hard to reproduce. I'd considered using the file_sanitizer-based error injection, but it seems that infra only really shines at making Redpanda hit asserts, so it's not quite what I'm looking for (my upcoming test will assert that we _don't_ crash in a particular error scenario).

This will be useful to test methods that need a feature table (e.g. deduplicate_segments() has a feature table as an argument).

We were previously using the l-value-variant of model::record_batch_reader::consume(), which doesn't have a built-in call to finally(). For a log_reader, this meant that in some scenarios (e.g. if the reducer were to fail) we could end up not calling finally(), and therefore wouldn't clear the underlying segment reader. This is exactly what happened, as we hit: std::runtime_error (lz4f_compressend:ERROR_dstMaxSize_tooSmall) errors as we compressed/decompressed batches, and then ultimately crashed because the log_reader was not closed: Assert failure: (.../redpanda/redpanda/src/v/storage/log_reader.h:141) '!_iterator.reader' log reader destroyed with live reader"

andijcr · 2024-03-07T16:43:26Z

src/v/storage/segment_deduplication_utils.cc

@@ -227,7 +227,7 @@ ss::future<index_state> deduplicate_segment(
      &cmp_idx_writer,
      inject_reader_failure);

-    auto new_idx = co_await rdr.consume(
+    auto new_idx = co_await std::move(rdr).consume(


nice. i just lost my foot looking at this.

I think this if fine, but i would leave a ticket to fix record_batch_reader::impl::consume, by chaining inside consume()& an then_wrapped() to clean up in case of exceptions

we have append_entries_request::serde_async_write that does a consume like this, maybe it warrants a similar treatment

or at least a comment around this in reader::consume about the exception safety

Yeah I wasn't sure what the rationale was but I assumed it was intentional that one doesn't call the other. At least I see that some other call-sites of the l-value variant explicitly include a finally() call

vbotbuildovich · 2024-03-07T18:00:09Z

/backport v23.3.x

vbotbuildovich · 2024-03-07T18:01:04Z

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-16928-v23.3.x-857 remotes/upstream/v23.3.x
git cherry-pick -x 62fd4f83eee12ca0bd52d63de58e1b2ad4c80102 13f55371a6d23e64cf0729614c5eb715bcd3a49c 958aaf766b032c18bae659cceecb3078c55bd1b6 cc06447e3f3974427591bc2a7e16491aef653c17

Workflow run logs.

andrwng added 4 commits March 6, 2024 23:40

storage: fix compaction index staging file namme

62fd4f8

The staging file was missing a '.'. Note that these files are transient, so there isn't a compactibility concern here.

storage: expose disk_log_builder::feature_table

958aaf7

This will be useful to test methods that need a feature table (e.g. deduplicate_segments() has a feature table as an argument).

andrwng requested review from Lazin, dotnwat and abhijat March 7, 2024 07:56

github-actions bot added the area/redpanda label Mar 7, 2024

andijcr reviewed Mar 7, 2024

View reviewed changes

andijcr approved these changes Mar 7, 2024

View reviewed changes

andrwng merged commit 3261c5e into redpanda-data:dev Mar 7, 2024
17 checks passed

vbotbuildovich mentioned this pull request Mar 7, 2024

[v23.3.x] storage: fix windowed compaction error handling #16937

Closed

andrwng mentioned this pull request Mar 7, 2024

[v23.3.x] storage: fix windowed compaction error handling #16940

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: fix windowed compaction error handling #16928

storage: fix windowed compaction error handling #16928

andrwng commented Mar 7, 2024

andijcr Mar 7, 2024

andijcr Mar 7, 2024

andijcr Mar 7, 2024

andrwng Mar 7, 2024

vbotbuildovich commented Mar 7, 2024

vbotbuildovich commented Mar 7, 2024

storage: fix windowed compaction error handling #16928

storage: fix windowed compaction error handling #16928

Conversation

andrwng commented Mar 7, 2024

Backports Required

Release Notes

Bug Fixes

andijcr Mar 7, 2024

Choose a reason for hiding this comment

andijcr Mar 7, 2024

Choose a reason for hiding this comment

andijcr Mar 7, 2024

Choose a reason for hiding this comment

andrwng Mar 7, 2024

Choose a reason for hiding this comment

vbotbuildovich commented Mar 7, 2024

vbotbuildovich commented Mar 7, 2024