Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report minimum storage threshold targets #10710
Report minimum storage threshold targets #10710
Changes from all commits
9ac6534
d189bf5
abb52d5
4288a53
43ca4f7
fc8486f
46750b6
072b9ae
982ca09
ca3ce06
b240a85
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following why we're not applying the cloud overrides here. Aren't they honored when applying retention, even on a compacted topic? (i could be wrong, would appreciate a pointer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me explain, and if it makes sense to you, i'll update this comment to clarify the case in the code:
Compacted topics aren't subject to local retention (they always remain whole on local storage), so the only way that retention comes into play for a compacted topic is if the policy is
compat,delete
. However, theoverride_retention_config
helper doesn't take compaction into account. So if we were to apply the cloud overrides here for a compacted topic, then the retention policy we used in calculating "nice to have" wouldn't reflect what would actually happen when housekeeping ran.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand: it wasn't clicking that retention settings applied to a
compact,delete
topic followed the overarching retention settings rather than the local settings for tiered storage. If that's the case, I think this policy makes sense.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is saying "I expect to see same throughput that ingested the latest segments for the entire remainder of the retention period". This seems like a reasonable, but it makes me wonder about bursty workloads, where a large number of records are written in the span of a few dozens of seconds, and then stopped. In such a case, we could end up with a small
duration
and largemissing
and end up with a very large value formissing_bytes
.Maybe it makes sense to compute the expected throughput with
now() - start_timestamp
instead ofend_timestamp - start_timestamp
, though given these are user-input timestamps, it's probably best to avoid usingnow()
I suppose regardless it might be better to overestimate the wanted bytes anyway for the goal of avoiding out of space issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to capture this issue with a heuristic like not making any estimate unless we have say 10 seconds worth of data. But this is going to just be a whack-a-mole game I'm afraid. Perhaps we would do well to do add a cap here, too, such as double or triple the amount of data currently on disk. This may work well since the goal isn't necessarily to observe the system once and know how much space we need. Rather, this will be monitoring periodically. So capping will let the estimate continue to portray growth requirement, while also eliminating huge bursts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, picking something that's good in all cases is hard. I'd lean harder away from this if this significantly impacted workloads (e.g. if we were to base rejection of produce messages on this). But given the ultimate goal here is to just apply retention more aggressively, this is probably fine.
Might be worth calling out in a header comment that this is an estimate and is subject to being way off, so don't use it too aggressively (like rejecting writes based on it)