-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE-7229] storage
: add tombstone deletion implementation to local storage compaction
#23231
[CORE-7229] storage
: add tombstone deletion implementation to local storage compaction
#23231
Conversation
@@ -275,6 +275,8 @@ log_manager::housekeeping_scan(model::timestamp collection_threshold) { | |||
collection_threshold, | |||
_config.retention_bytes(), | |||
current_log.handle->stm_manager()->max_collectible_offset(), | |||
/*TODO: current_log.handle->config().tombstone_retention_ms()*/ | |||
std::nullopt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter is intentionally left as std::nullopt to ensure the log_manager does not execute any tombstone deletion during housekeeping.
@@ -360,7 +361,7 @@ class record_batch_attributes final { | |||
record_batch_attributes& operator|=(model::compression c) { | |||
// clang-format off | |||
_attributes |= | |||
static_cast<std::underlying_type_t<model::compression>>(c) | |||
static_cast<std::underlying_type_t<model::compression>>(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trailing whitespace removal
storage
: add tombstone deletion implementation to local storage compactionstorage
: add tombstone deletion implementation to local storage compaction
storage
: add tombstone deletion implementation to local storage compactionstorage
: add tombstone deletion implementation to local storage compaction
Empty strings/byte buffers vs null values
with real kafka, d appeared because i produced some other value thinking that it might help get to tombstone delition faster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level looks good!
// If set, the timestamp at which every record up to and including | ||
// those in this segment were first compacted via sliding window. | ||
// If not yet set, sliding window compaction has not yet been applied to | ||
// every previous record in the log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Food for thought:
Are we guaranteed to eventually get some clean segments? What if we keep getting new segments, could we starve out sliding window compaction? Wondering if we need to update the policy for handling new segments by always finishing the current sliding window range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point. I would think we would need a relatively high key cardinality/ingress rate/small segment size to encounter this starvation behavior, but potentially changing the behavior around _last_compaction_window_start_offset
in the presence of new segments could be beneficial (driving _last_compaction_window_start_offset
to the first segment's base_offset()
before considering new segments in the window in order to ensure we are constantly producing clean segments?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we'd see this behavior when the log has a cardinality higher than what fits in a single offset map, and we have new segments being rolled.
Agreed that cleaning down to the log start before proceeding makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great, will have this change in a follow-up PR.
f95c28a
to
7e18d2d
Compare
Force push to:
|
7e18d2d
to
27b7e7d
Compare
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54319#0191e23f-e4af-4df4-b932-9f0f912acc43 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54602#019200f3-a70b-460f-929a-9ba70d2305f3 |
Improves readability.
27b7e7d
to
457bd5b
Compare
Force push to:
|
Consider the case in which we have 3 segments: S: [S1] [S2] [ S3 ] K: |K1| |K2| | K1 | V: |V1| |V2| |null| The current condition for `num_compactible_records > 1` in `may_have_compactible_records()` would result in these segments being removed from the range used for window compaction, and prevent the tombstone value for `K1` in `S3` from being applied to `K1` in `S1`. This condition is mostly due to historical reasons, in which we didn't want to have completely empty segments post compaction. This issue is solved by the placeholder feature. Adjust it to `num_compactible_records > 0` to allow the above case to work as expected. This change should not have any other dramatic effects on the process of compaction. Also modify tests that use `may_have_compactible_records()` to reflect the updated behavior.
Persist the timestamp at which every record up to and including those in this segment were first compacted via sliding window in the `index_state`. This will indicate whether or not a segment can be considered "clean" or "dirty" still during compaction.
...in the `segment_index`.
We use `seg->mark_as_finished_window_compaction()` to indicate that a segment has been through a full round of window compaction, whether it is completely de-duplicated ("clean") or only partially indexed (still "dirty"). Add `mark_segment_as_finished_window_compaction()` to `segment_utils` as a helper function to help mark a segment as completed window compaction, and whether it is "clean" (in which case we mark the `clean_compact_timestamp` in the `segment_index`).
For use during the self-compaction and window compaction process in order to tell whether a record should be retained or not (in the case that it is a tombstone record, with a value set for `tombstone_delete_horizon`).
Utility function for getting the optional `timestamp` past which tombstones can be removed. This returns a value iff the segment `s` has been marked as cleanly compacted, and the compaction_config has a value assigned for `tombstone_retention_ms`. In all other cases, `std::nullopt` is returned, indicating that tombstone records will not be removed if encountered.
During the copying process in self compaction, we can check for any tombstone record that has been marked clean by the sliding window compaction process. If it has been marked clean, and the current timestamp is past the tombstone delete horizon defined by `clean_compact_timestamp + tombstone.retention.ms`, it is eligible for deletion. Add logic to the `should_keep()` function used in the `copy_reducer` which removes tombstones during the copy process.
During the deduplication process in sliding window compaction, if a tombstone record has already been seen and is past the tombstone horizon set by the `clean_compact_timestamp + tombstone.retention.ms`, it is eligible for deletion. Add logic to the `copy_reducer` which removes tombstones during the deduplication process.
457bd5b
to
8673444
Compare
Force push to:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, I think the consumer test util should be changed though?
// If set, the timestamp at which every record up to and including | ||
// those in this segment were first compacted via sliding window. | ||
// If not yet set, sliding window compaction has not yet been applied to | ||
// every previous record in the log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we'd see this behavior when the log has a cardinality higher than what fits in a single offset map, and we have new segments being rolled.
Agreed that cleaning down to the log start before proceeding makes sense
// to compact the tombstone records will be eligible for deletion. | ||
ss::sleep(tombstone_retention_ms).get(); | ||
|
||
// Generate one record, so that sliding window compaction can occur. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe as a follow-up, it seems like a reasonable improvement to have compaction examine the deletion horizon against any of the segments, and whether there are tombstones in a given segment. Then we wouldn't need to rely on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will have this enhancement in the aforementioned follow up PR.
For ease of adding tombstone records to a partition in fixture tests.
Force push to:
|
9234243
to
9e1456f
Compare
9e1456f
to
c4fa727
Compare
bool all_window_compacted = true; | ||
for (const auto& seg : segments) { | ||
if (!seg->finished_windowed_compaction()) { | ||
all_window_compacted = false; | ||
break; | ||
} | ||
} | ||
const bool all_window_compacted = std::ranges::all_of( | ||
segments, &segment::finished_windowed_compaction); | ||
|
||
auto all_segments_self_compacted = std::ranges::all_of( | ||
const bool all_segments_self_compacted = std::ranges::all_of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes like this are great, but don't feel like they need to sit indefinitely in a PR. you can pluck them out and get them merged separately much faster and keep the PR leaner.
// The retention time for tombstones. Tombstone removal occurs only for | ||
// "clean" compacted segments past the tombstone deletion horizon timestamp, | ||
// which is a segment's clean_compact_timestamp + tombstone_retention_ms. | ||
// This means tombstones take at least two rounds of compaction to remove a | ||
// tombstone: at least one pass to make a segment clean, and another pass | ||
// some time after tombstone.retention.ms to remove tombstones. | ||
// | ||
// Tombstone removal is only supported for topics with remote writes | ||
// disabled. As a result, this field will only have a value for compaction | ||
// ran on non-archival topics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -897,7 +897,7 @@ bool segment::may_have_compactible_records() const { | |||
// that there were no data records, so err on the side of caution. | |||
return true; | |||
} | |||
return num_compactible_records.value() > 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition is mostly due to historical reasons, in which we didn't
want to have completely empty segments post compaction. This issue is solved
by the placeholder feature.
What is the "placeholder" feature?
This PR adds the underlying logic for tombstone removal to the local storage compaction subsystem.
tombstone.retention.ms
is added as a field instorage::compaction_config
, and can be used to remove tombstone records during or past the second time they are "seen" by the compaction subsystem.A tombstone record is first considered "seen" when the owning segment is fully indexed during sliding window compaction (therefore, the owning segment is fully de-duplicated, and thus "clean"- i.e, no keys in that segment exist as potential duplicates in the log up to that point). This is the only time a segment can be considered "cleaned" by compaction.
A tombstone record can be considered "seen" for the second time either in self-compaction or again in sliding window compaction. At this point, it is safe to remove the tombstone record completely from the segment, if
timestamp::now() > clean_compact_timestamp + tombstone.retention.ms
.This PR does NOT add user facing configuration options for
tombstone.retention.ms
or any way to enable this feature yet, as this is coming in future PRs (along with more end to end testing of tombstone removal). This parameter is intentionally left asstd::nullopt
to ensure thelog_manager
does not execute any tombstone deletion during housekeeping.Backports Required
Release Notes