[CORE-7229] `storage`: add tombstone deletion implementation to local storage compaction #23231

WillemKauf · 2024-09-06T20:33:51Z

This PR adds the underlying logic for tombstone removal to the local storage compaction subsystem.

tombstone.retention.ms is added as a field in storage::compaction_config, and can be used to remove tombstone records during or past the second time they are "seen" by the compaction subsystem.

A tombstone record is first considered "seen" when the owning segment is fully indexed during sliding window compaction (therefore, the owning segment is fully de-duplicated, and thus "clean"- i.e, no keys in that segment exist as potential duplicates in the log up to that point). This is the only time a segment can be considered "cleaned" by compaction.

A tombstone record can be considered "seen" for the second time either in self-compaction or again in sliding window compaction. At this point, it is safe to remove the tombstone record completely from the segment, if timestamp::now() > clean_compact_timestamp + tombstone.retention.ms.

This PR does NOT add user facing configuration options for tombstone.retention.ms or any way to enable this feature yet, as this is coming in future PRs (along with more end to end testing of tombstone removal). This parameter is intentionally left as std::nullopt to ensure the log_manager does not execute any tombstone deletion during housekeeping.

Backports Required

Release Notes

none

WillemKauf · 2024-09-06T20:35:46Z

src/v/storage/log_manager.cc

@@ -275,6 +275,8 @@ log_manager::housekeeping_scan(model::timestamp collection_threshold) {
          collection_threshold,
          _config.retention_bytes(),
          current_log.handle->stm_manager()->max_collectible_offset(),
+          /*TODO: current_log.handle->config().tombstone_retention_ms()*/
+          std::nullopt,


This parameter is intentionally left as std::nullopt to ensure the log_manager does not execute any tombstone deletion during housekeeping.

WillemKauf · 2024-09-06T20:37:27Z

src/v/model/record.h

@@ -360,7 +361,7 @@ class record_batch_attributes final {
    record_batch_attributes& operator|=(model::compression c) {
        // clang-format off
        _attributes |=
-        static_cast<std::underlying_type_t<model::compression>>(c) 
+        static_cast<std::underlying_type_t<model::compression>>(c)


trailing whitespace removal

src/v/storage/tests/storage_e2e_test.cc

src/v/model/record.h

src/v/kafka/server/tests/produce_consume_utils.h

src/v/storage/tests/compaction_e2e_test.cc

src/v/storage/segment_index.h

nvartolomei · 2024-09-09T12:41:00Z

Empty strings/byte buffers vs null values

/opt/kafka/bin $ ./kafka-topics.sh --bootstrap-server localhost:9092 --create --topic test-topic --config cleanup.policy=compact --config max.compaction.lag.ms=10000 --config min
.cleanable.dirty.ratio=0.0 --config segment.ms=10000 --config delete.retention.ms=10000

/opt/kafka/bin $ ./kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test-topic  --property "parse.key=true" --property "key.separator=:" --property "null.marke
r=foo"
>a:1
>a:2
>a:
>b:1
>b:foo
>c:foo

/opt/kafka/bin $ ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-topic --from-beginning --property print.key=true
a	1
a	2
a
b	1
b	null
c	null

/opt/kafka/bin $ ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-topic --from-beginning --property print.key=true
a
b	null
c	null
d	a

/opt/kafka/bin $ ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-topic --from-beginning --property print.key=true
a
d	a
d	b

with real kafka, d appeared because i produced some other value thinking that it might help get to tombstone delition faster

andrwng

At a high level looks good!

src/v/storage/types.h

src/v/storage/tests/segment_deduplication_test.cc

andrwng · 2024-09-10T20:22:35Z

src/v/storage/index_state.h

+    // If set, the timestamp at which every record up to and including
+    // those in this segment were first compacted via sliding window.
+    // If not yet set, sliding window compaction has not yet been applied to
+    // every previous record in the log.


Food for thought:

Are we guaranteed to eventually get some clean segments? What if we keep getting new segments, could we starve out sliding window compaction? Wondering if we need to update the policy for handling new segments by always finishing the current sliding window range.

This is a very good point. I would think we would need a relatively high key cardinality/ingress rate/small segment size to encounter this starvation behavior, but potentially changing the behavior around _last_compaction_window_start_offset in the presence of new segments could be beneficial (driving _last_compaction_window_start_offset to the first segment's base_offset() before considering new segments in the window in order to ensure we are constantly producing clean segments?)

Right, we'd see this behavior when the log has a cardinality higher than what fits in a single offset map, and we have new segments being rolled.

Agreed that cleaning down to the log start before proceeding makes sense

Sounds great, will have this change in a follow-up PR.

src/v/storage/segment_utils.h

src/v/storage/segment_utils.cc

src/v/storage/tests/compaction_e2e_test.cc

WillemKauf · 2024-09-11T14:54:01Z

Force push to:

Revert changes to record::has_value() considering records with value size == 0 as tombstones.
Correct produce_consume_utils.h tombstone producers per the above reversion.
Use std::ranges::all_of in disk_log_impl::do_compact_adjacent_segments().
Improve code comments per Andrew's request
Move do_sliding_window_compact into CompactionFixtureTest as a method
Add tests for segments that are expected to have may_have_compactible_records() evaluate to false.
Remove accidentally duplicated code in segment_utils.h
Various test improvements
Add tests for sliding window compaction that require multiple passes.
Fix a bug in which segments at the front of a log marked as having finished sliding window compaction due to having no compactible records were not properly marked with a _clean_compact_timestamp.
Remove changes to test_offset_range_size2_compacted, pulled into PR storage: reduce num_test_cases in test_offset_range_size2_compacted #23276

vbotbuildovich · 2024-09-11T19:50:18Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54319#0191e23f-e4af-4df4-b932-9f0f912acc43

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54602#019200f3-a70b-460f-929a-9ba70d2305f3

Improves readability.

WillemKauf · 2024-09-17T15:43:34Z

Force push to:

Rebase to upstream/dev post clang-format PR merge.

Consider the case in which we have 3 segments: S: [S1] [S2] [ S3 ] K: |K1| |K2| | K1 | V: |V1| |V2| |null| The current condition for `num_compactible_records > 1` in `may_have_compactible_records()` would result in these segments being removed from the range used for window compaction, and prevent the tombstone value for `K1` in `S3` from being applied to `K1` in `S1`. This condition is mostly due to historical reasons, in which we didn't want to have completely empty segments post compaction. This issue is solved by the placeholder feature. Adjust it to `num_compactible_records > 0` to allow the above case to work as expected. This change should not have any other dramatic effects on the process of compaction. Also modify tests that use `may_have_compactible_records()` to reflect the updated behavior.

Persist the timestamp at which every record up to and including those in this segment were first compacted via sliding window in the `index_state`. This will indicate whether or not a segment can be considered "clean" or "dirty" still during compaction.

...in the `segment_index`.

We use `seg->mark_as_finished_window_compaction()` to indicate that a segment has been through a full round of window compaction, whether it is completely de-duplicated ("clean") or only partially indexed (still "dirty"). Add `mark_segment_as_finished_window_compaction()` to `segment_utils` as a helper function to help mark a segment as completed window compaction, and whether it is "clean" (in which case we mark the `clean_compact_timestamp` in the `segment_index`).

For use during the self-compaction and window compaction process in order to tell whether a record should be retained or not (in the case that it is a tombstone record, with a value set for `tombstone_delete_horizon`).

Utility function for getting the optional `timestamp` past which tombstones can be removed. This returns a value iff the segment `s` has been marked as cleanly compacted, and the compaction_config has a value assigned for `tombstone_retention_ms`. In all other cases, `std::nullopt` is returned, indicating that tombstone records will not be removed if encountered.

During the copying process in self compaction, we can check for any tombstone record that has been marked clean by the sliding window compaction process. If it has been marked clean, and the current timestamp is past the tombstone delete horizon defined by `clean_compact_timestamp + tombstone.retention.ms`, it is eligible for deletion. Add logic to the `should_keep()` function used in the `copy_reducer` which removes tombstones during the copy process.

During the deduplication process in sliding window compaction, if a tombstone record has already been seen and is past the tombstone horizon set by the `clean_compact_timestamp + tombstone.retention.ms`, it is eligible for deletion. Add logic to the `copy_reducer` which removes tombstones during the deduplication process.

WillemKauf · 2024-09-17T15:56:12Z

Force push to:

Correctly apply clang-format with updated version 18 via bazel run //tools:clang_format

andrwng

Overall looks good, I think the consumer test util should be changed though?

andrwng · 2024-09-17T17:16:18Z

src/v/storage/index_state.h

+    // If set, the timestamp at which every record up to and including
+    // those in this segment were first compacted via sliding window.
+    // If not yet set, sliding window compaction has not yet been applied to
+    // every previous record in the log.


Right, we'd see this behavior when the log has a cardinality higher than what fits in a single offset map, and we have new segments being rolled.

Agreed that cleaning down to the log start before proceeding makes sense

src/v/storage/segment_utils.cc

src/v/storage/segment_index.h

src/v/storage/segment_utils.cc

src/v/kafka/server/tests/produce_consume_utils.h

andrwng · 2024-09-17T17:57:46Z

src/v/storage/tests/compaction_e2e_test.cc

+    // to compact the tombstone records will be eligible for deletion.
+    ss::sleep(tombstone_retention_ms).get();
+
+    // Generate one record, so that sliding window compaction can occur.


Maybe as a follow-up, it seems like a reasonable improvement to have compaction examine the deletion horizon against any of the segments, and whether there are tombstones in a given segment. Then we wouldn't need to rely on this

Sounds good, will have this enhancement in the aforementioned follow up PR.

src/v/storage/tests/compaction_e2e_test.cc

For ease of adding tombstone records to a partition in fixture tests.

WillemKauf · 2024-09-17T20:16:16Z

Force push to:

Make tests::kv_t::val an std::optional<ss::string> to give it proper tombstone semantics, and correct consumer utils behavior for tombstone records.
Alter tombstone tests in compaction_e2e_test to assert on kv_t.is_tombstone() where necessary.
Alter tombstone tests in compaction_e2e_test with restart() calls instead of producing an additional record in order to force additional rounds of sliding window compaction.
Remove parameterization of tombstone tests in compaction_e2e_test with additional rounds of generated TombstonesRandomArgs

dotnwat · 2024-12-02T18:55:25Z

src/v/storage/disk_log_impl.cc

-    bool all_window_compacted = true;
-    for (const auto& seg : segments) {
-        if (!seg->finished_windowed_compaction()) {
-            all_window_compacted = false;
-            break;
-        }
-    }
+    const bool all_window_compacted = std::ranges::all_of(
+      segments, &segment::finished_windowed_compaction);

-    auto all_segments_self_compacted = std::ranges::all_of(
+    const bool all_segments_self_compacted = std::ranges::all_of(


changes like this are great, but don't feel like they need to sit indefinitely in a PR. you can pluck them out and get them merged separately much faster and keep the PR leaner.

dotnwat · 2024-12-02T18:56:47Z

src/v/storage/types.h

+    // The retention time for tombstones. Tombstone removal occurs only for
+    // "clean" compacted segments past the tombstone deletion horizon timestamp,
+    // which is a segment's clean_compact_timestamp + tombstone_retention_ms.
+    // This means tombstones take at least two rounds of compaction to remove a
+    // tombstone: at least one pass to make a segment clean, and another pass
+    // some time after tombstone.retention.ms to remove tombstones.
+    //
+    // Tombstone removal is only supported for topics with remote writes
+    // disabled. As a result, this field will only have a value for compaction
+    // ran on non-archival topics.


dotnwat · 2024-12-02T18:58:16Z

src/v/storage/segment.cc

@@ -897,7 +897,7 @@ bool segment::may_have_compactible_records() const {
        // that there were no data records, so err on the side of caution.
        return true;
    }
-    return num_compactible_records.value() > 1;


This condition is mostly due to historical reasons, in which we didn't
want to have completely empty segments post compaction. This issue is solved
by the placeholder feature.

What is the "placeholder" feature?

WillemKauf requested review from dotnwat and andrwng September 6, 2024 20:33

github-actions bot added the area/redpanda label Sep 6, 2024

WillemKauf commented Sep 6, 2024

View reviewed changes

WillemKauf changed the title ~~storage: add tombstone deletion implementation to local storage compaction~~ [CORE-7229]: storage: add tombstone deletion implementation to local storage compaction Sep 8, 2024

WillemKauf changed the title ~~[CORE-7229]: storage: add tombstone deletion implementation to local storage compaction~~ [CORE-7229] storage: add tombstone deletion implementation to local storage compaction Sep 8, 2024