Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-7229] storage: add tombstone deletion implementation to local storage compaction #23231

Merged
merged 13 commits into from
Sep 18, 2024

Conversation

WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented Sep 6, 2024

This PR adds the underlying logic for tombstone removal to the local storage compaction subsystem.

tombstone.retention.ms is added as a field in storage::compaction_config, and can be used to remove tombstone records during or past the second time they are "seen" by the compaction subsystem.

A tombstone record is first considered "seen" when the owning segment is fully indexed during sliding window compaction (therefore, the owning segment is fully de-duplicated, and thus "clean"- i.e, no keys in that segment exist as potential duplicates in the log up to that point). This is the only time a segment can be considered "cleaned" by compaction.

A tombstone record can be considered "seen" for the second time either in self-compaction or again in sliding window compaction. At this point, it is safe to remove the tombstone record completely from the segment, if timestamp::now() > clean_compact_timestamp + tombstone.retention.ms.

This PR does NOT add user facing configuration options for tombstone.retention.ms or any way to enable this feature yet, as this is coming in future PRs (along with more end to end testing of tombstone removal). This parameter is intentionally left as std::nullopt to ensure the log_manager does not execute any tombstone deletion during housekeeping.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

  • none

@@ -275,6 +275,8 @@ log_manager::housekeeping_scan(model::timestamp collection_threshold) {
collection_threshold,
_config.retention_bytes(),
current_log.handle->stm_manager()->max_collectible_offset(),
/*TODO: current_log.handle->config().tombstone_retention_ms()*/
std::nullopt,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is intentionally left as std::nullopt to ensure the log_manager does not execute any tombstone deletion during housekeeping.

@@ -360,7 +361,7 @@ class record_batch_attributes final {
record_batch_attributes& operator|=(model::compression c) {
// clang-format off
_attributes |=
static_cast<std::underlying_type_t<model::compression>>(c)
static_cast<std::underlying_type_t<model::compression>>(c)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing whitespace removal

@WillemKauf WillemKauf changed the title storage: add tombstone deletion implementation to local storage compaction [CORE-7229]: storage: add tombstone deletion implementation to local storage compaction Sep 8, 2024
@WillemKauf WillemKauf changed the title [CORE-7229]: storage: add tombstone deletion implementation to local storage compaction [CORE-7229] storage: add tombstone deletion implementation to local storage compaction Sep 8, 2024
src/v/model/record.h Outdated Show resolved Hide resolved
@nvartolomei
Copy link
Contributor

Empty strings/byte buffers vs null values

/opt/kafka/bin $ ./kafka-topics.sh --bootstrap-server localhost:9092 --create --topic test-topic --config cleanup.policy=compact --config max.compaction.lag.ms=10000 --config min
.cleanable.dirty.ratio=0.0 --config segment.ms=10000 --config delete.retention.ms=10000

/opt/kafka/bin $ ./kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test-topic  --property "parse.key=true" --property "key.separator=:" --property "null.marke
r=foo"
>a:1
>a:2
>a:
>b:1
>b:foo
>c:foo
/opt/kafka/bin $ ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-topic --from-beginning --property print.key=true
a	1
a	2
a
b	1
b	null
c	null
/opt/kafka/bin $ ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-topic --from-beginning --property print.key=true
a
b	null
c	null
d	a
/opt/kafka/bin $ ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-topic --from-beginning --property print.key=true
a
d	a
d	b

with real kafka, d appeared because i produced some other value thinking that it might help get to tombstone delition faster

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level looks good!

src/v/storage/types.h Outdated Show resolved Hide resolved
Comment on lines +135 to +138
// If set, the timestamp at which every record up to and including
// those in this segment were first compacted via sliding window.
// If not yet set, sliding window compaction has not yet been applied to
// every previous record in the log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Food for thought:

Are we guaranteed to eventually get some clean segments? What if we keep getting new segments, could we starve out sliding window compaction? Wondering if we need to update the policy for handling new segments by always finishing the current sliding window range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good point. I would think we would need a relatively high key cardinality/ingress rate/small segment size to encounter this starvation behavior, but potentially changing the behavior around _last_compaction_window_start_offset in the presence of new segments could be beneficial (driving _last_compaction_window_start_offset to the first segment's base_offset() before considering new segments in the window in order to ensure we are constantly producing clean segments?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we'd see this behavior when the log has a cardinality higher than what fits in a single offset map, and we have new segments being rolled.

Agreed that cleaning down to the log start before proceeding makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great, will have this change in a follow-up PR.

src/v/storage/segment_utils.h Show resolved Hide resolved
src/v/storage/segment_utils.cc Show resolved Hide resolved
src/v/storage/tests/compaction_e2e_test.cc Outdated Show resolved Hide resolved
src/v/storage/tests/compaction_e2e_test.cc Outdated Show resolved Hide resolved
src/v/storage/tests/compaction_e2e_test.cc Outdated Show resolved Hide resolved
src/v/storage/tests/compaction_e2e_test.cc Show resolved Hide resolved
src/v/storage/tests/compaction_e2e_test.cc Outdated Show resolved Hide resolved
@WillemKauf WillemKauf force-pushed the tombstone_implementation branch from f95c28a to 7e18d2d Compare September 11, 2024 14:53
@WillemKauf
Copy link
Contributor Author

WillemKauf commented Sep 11, 2024

Force push to:

  • Revert changes to record::has_value() considering records with value size == 0 as tombstones.
  • Correct produce_consume_utils.h tombstone producers per the above reversion.
  • Use std::ranges::all_of in disk_log_impl::do_compact_adjacent_segments().
  • Improve code comments per Andrew's request
  • Move do_sliding_window_compact into CompactionFixtureTest as a method
  • Add tests for segments that are expected to have may_have_compactible_records() evaluate to false.
  • Remove accidentally duplicated code in segment_utils.h
  • Various test improvements
  • Add tests for sliding window compaction that require multiple passes.
  • Fix a bug in which segments at the front of a log marked as having finished sliding window compaction due to having no compactible records were not properly marked with a _clean_compact_timestamp.
  • Remove changes to test_offset_range_size2_compacted, pulled into PR storage: reduce num_test_cases in test_offset_range_size2_compacted #23276

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Sep 11, 2024

@WillemKauf WillemKauf force-pushed the tombstone_implementation branch from 27b7e7d to 457bd5b Compare September 17, 2024 15:43
@WillemKauf
Copy link
Contributor Author

Force push to:

Consider the case in which we have 3 segments:

S: [S1] [S2] [ S3 ]
K: |K1| |K2| | K1 |
V: |V1| |V2| |null|

The current condition for `num_compactible_records > 1` in
`may_have_compactible_records()` would result in these segments being removed
from the range used for window compaction, and prevent the tombstone value
for `K1` in `S3` from being applied to `K1` in `S1`.

This condition is mostly due to historical reasons, in which we didn't
want to have completely empty segments post compaction. This issue is solved
by the placeholder feature.

Adjust it to `num_compactible_records > 0` to allow the above case to work
as expected. This change should not have any other dramatic effects on the
process of compaction.

Also modify tests that use `may_have_compactible_records()` to reflect
the updated behavior.
Persist the timestamp at which every record up to and including
those in this segment were first compacted via sliding window in the
`index_state`.

This will indicate whether or not a segment can be considered "clean"
or "dirty" still during compaction.
We use `seg->mark_as_finished_window_compaction()` to indicate
that a segment has been through a full round of window compaction,
whether it is completely de-duplicated ("clean") or only partially
indexed (still "dirty").

Add `mark_segment_as_finished_window_compaction()` to `segment_utils`
as a helper function to help mark a segment as completed window compaction,
and whether it is "clean" (in which case we mark the `clean_compact_timestamp`
in the `segment_index`).
For use during the self-compaction and window compaction process in order to
tell whether a record should be retained or not (in the case that it is a tombstone
record, with a value set for `tombstone_delete_horizon`).
Utility function for getting the optional `timestamp` past which
tombstones can be removed.

This returns a value iff the segment `s` has been marked as cleanly
compacted, and the compaction_config has a value assigned for
`tombstone_retention_ms`. In all other cases, `std::nullopt` is returned,
indicating that tombstone records will not be removed if encountered.
During the copying process in self compaction, we can check for any tombstone
record that has been marked clean by the sliding window compaction process.

If it has been marked clean, and the current timestamp is past the tombstone
delete horizon defined by `clean_compact_timestamp + tombstone.retention.ms`,
it is eligible for deletion.

Add logic to the `should_keep()` function used in the `copy_reducer` which
removes tombstones during the copy process.
During the deduplication process in sliding window compaction,
if a tombstone record has already been seen and is past the tombstone
horizon set by the `clean_compact_timestamp + tombstone.retention.ms`,
it is eligible for deletion.

Add logic to the `copy_reducer` which removes tombstones during the
deduplication process.
@WillemKauf WillemKauf force-pushed the tombstone_implementation branch from 457bd5b to 8673444 Compare September 17, 2024 15:55
@WillemKauf
Copy link
Contributor Author

Force push to:

  • Correctly apply clang-format with updated version 18 via bazel run //tools:clang_format

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, I think the consumer test util should be changed though?

Comment on lines +135 to +138
// If set, the timestamp at which every record up to and including
// those in this segment were first compacted via sliding window.
// If not yet set, sliding window compaction has not yet been applied to
// every previous record in the log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we'd see this behavior when the log has a cardinality higher than what fits in a single offset map, and we have new segments being rolled.

Agreed that cleaning down to the log start before proceeding makes sense

src/v/storage/segment_utils.cc Show resolved Hide resolved
src/v/storage/segment_index.h Show resolved Hide resolved
src/v/storage/segment_utils.cc Show resolved Hide resolved
src/v/kafka/server/tests/produce_consume_utils.h Outdated Show resolved Hide resolved
// to compact the tombstone records will be eligible for deletion.
ss::sleep(tombstone_retention_ms).get();

// Generate one record, so that sliding window compaction can occur.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe as a follow-up, it seems like a reasonable improvement to have compaction examine the deletion horizon against any of the segments, and whether there are tombstones in a given segment. Then we wouldn't need to rely on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will have this enhancement in the aforementioned follow up PR.

src/v/storage/tests/compaction_e2e_test.cc Outdated Show resolved Hide resolved
src/v/storage/tests/compaction_e2e_test.cc Outdated Show resolved Hide resolved
src/v/storage/tests/compaction_e2e_test.cc Outdated Show resolved Hide resolved
src/v/storage/tests/compaction_e2e_test.cc Show resolved Hide resolved
For ease of adding tombstone records to a partition in fixture tests.
@WillemKauf
Copy link
Contributor Author

WillemKauf commented Sep 17, 2024

Force push to:

  • Make tests::kv_t::val an std::optional<ss::string> to give it proper tombstone semantics, and correct consumer utils behavior for tombstone records.
  • Alter tombstone tests in compaction_e2e_test to assert on kv_t.is_tombstone() where necessary.
  • Alter tombstone tests in compaction_e2e_test with restart() calls instead of producing an additional record in order to force additional rounds of sliding window compaction.
  • Remove parameterization of tombstone tests in compaction_e2e_test with additional rounds of generated TombstonesRandomArgs

@WillemKauf WillemKauf force-pushed the tombstone_implementation branch 2 times, most recently from 9234243 to 9e1456f Compare September 17, 2024 20:19
@WillemKauf WillemKauf requested a review from andrwng September 17, 2024 20:53
andrwng
andrwng previously approved these changes Sep 17, 2024
@WillemKauf WillemKauf merged commit 2e1cdb6 into redpanda-data:dev Sep 18, 2024
17 checks passed
Comment on lines -889 to +892
bool all_window_compacted = true;
for (const auto& seg : segments) {
if (!seg->finished_windowed_compaction()) {
all_window_compacted = false;
break;
}
}
const bool all_window_compacted = std::ranges::all_of(
segments, &segment::finished_windowed_compaction);

auto all_segments_self_compacted = std::ranges::all_of(
const bool all_segments_self_compacted = std::ranges::all_of(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes like this are great, but don't feel like they need to sit indefinitely in a PR. you can pluck them out and get them merged separately much faster and keep the PR leaner.

Comment on lines +488 to +497
// The retention time for tombstones. Tombstone removal occurs only for
// "clean" compacted segments past the tombstone deletion horizon timestamp,
// which is a segment's clean_compact_timestamp + tombstone_retention_ms.
// This means tombstones take at least two rounds of compaction to remove a
// tombstone: at least one pass to make a segment clean, and another pass
// some time after tombstone.retention.ms to remove tombstones.
//
// Tombstone removal is only supported for topics with remote writes
// disabled. As a result, this field will only have a value for compaction
// ran on non-archival topics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -897,7 +897,7 @@ bool segment::may_have_compactible_records() const {
// that there were no data records, so err on the side of caution.
return true;
}
return num_compactible_records.value() > 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is mostly due to historical reasons, in which we didn't
want to have completely empty segments post compaction. This issue is solved
by the placeholder feature.

What is the "placeholder" feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants