-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pageserver): add automatic trigger for gc-compaction #10798
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Alex Chi Z <chi@neon.tech>
9d09ed7
to
cdabd73
Compare
Signed-off-by: Alex Chi Z <chi@neon.tech>
cdabd73
to
f1c6a61
Compare
7480 tests run: 7099 passed, 0 failed, 381 skipped (full report)Flaky tests (2)Postgres 17
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
f1c6a61 at 2025-02-12T23:03:11.233Z :recycle: |
@@ -381,6 +381,8 @@ def test_pageserver_gc_compaction_interrupt(neon_env_builder: NeonEnvBuilder): | |||
"pitr_interval": "0s", | |||
"gc_horizon": "1024", | |||
"lsn_lease_length": "0s", | |||
# Do not generate image layers with create_image_layers | |||
"image_layer_creation_check_threshold": "100", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should move this to L495, will fix tmr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall idea of the trigger is quite concise, I like that.
Did you check how the trigger code interacts with branches? Will it only run on the root branch?
I can't judge anything pertaining management of the gc_block stuff and general invariants of the queue structures.
A bunch of my comments are probably due to the fact that I missed reviews for most compaction changes in the last month, so, apply your own judgement whether they warrant changes, a follow-up , or can be discarded.
Since this is still not enabled in prod, I'm fine with merging early.
And I'm off next week, so, to unblock you, I'm approving right away.
But please work through the review comments and queue follow-ups as needed. I likely won't remember context after vacations.
@@ -2493,6 +2496,31 @@ impl Timeline { | |||
) | |||
} | |||
|
|||
fn get_gc_compaction_settings(&self) -> (bool, u64, u64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer this be a named struct
guard | ||
.entry(timeline.timeline_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side question: do we clean entries up properly when a timeline gets deleted?
} | ||
}) else { | ||
self.trigger_auto_compaction(timeline).await; | ||
// Always yield after triggering auto-compaction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... why? Would be helpful to know for future readers.
@@ -267,27 +415,35 @@ impl GcCompactionQueue { | |||
) -> Result<CompactionOutcome, CompactionError> { | |||
let _one_op_at_a_time_guard = self.consumer_lock.lock().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lock should never be held at this point, right? Because there's only one compaction task per tenant?
(I guess the caveat is compact mgmt API calls in testing?)
I generally would prefer returning an error if it's taken, like CompactionError::AlreadyRunning
or whatever.
Avoids deadlock risk which is hard to debug with tokio mutexes.
let Ok(permit) = CONCURRENT_GC_COMPACTION_TASKS.clone().try_acquire_owned() else { | ||
// Only allow one compaction run at a time | ||
return; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bailing here prevents fairness among timelines & tenants.
let layers = { | ||
let guard = timeline.layers.read().await; | ||
let layer_map = guard.layer_map().unwrap(); | ||
layer_map.iter_historic_layers().collect_vec() | ||
}; | ||
let mut l2_size: u64 = 0; | ||
let mut l1_size = 0; | ||
let gc_cutoff = *timeline.get_latest_gc_cutoff_lsn(); | ||
for layer in layers { | ||
if layer.lsn_range.start <= l2_lsn { | ||
l2_size += layer.file_size(); | ||
} else if layer.lsn_range.start <= gc_cutoff { | ||
l1_size += layer.file_size(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will do this for every timeline every 20s (default compaction_period).
Even ones that haven't changed.
Granted, with the global compaction semaphore concurrency limit.
So it's not like this will overwhelm us CPU-wise.
But it's quite wasteful.
Should queue a follow-up to compute the trigger incrementally - seems quite simple to do if we keep track of l1_size and l2_size incrementally inside layer map.
self.schedule_auto_compaction( | ||
CompactOptions { | ||
flags: { | ||
let mut flags = EnumSet::new(); | ||
flags |= CompactFlags::EnhancedGcBottomMostCompaction; | ||
flags | ||
}, | ||
sub_compaction: true, | ||
compact_key_range: None, | ||
compact_lsn_range: None, | ||
sub_compaction_max_job_size_mb: None, | ||
}, | ||
permit, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To compute the trigger condition, we just walked the layer map, did lsn range checks, etc.
Basically it's half the filtering work that the schedule code does.
Now the schedule code will do most of that work again.
And then I think each job that is queued will do the work again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if we don't specify the compact_lsn_range here, then gc_cutoff_horizon can move between scheduling the MetaJob here, and executing the MetaJob later.
If the pageserver is too busy and can't keep up with compactions, the turnaround between scheduling & executing the MetaJob may be many minutes.
The layer map can drift substantially in the menatime, and thus the announcements about l1_size / l2_size etc in the log messages below will be out of date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another source of drift is if the timeline starts ingesting a lot inbetween scheduling MetaJob and executing MetaJob.
L0 compaction will take priority, delaying execution of MetaJob indefinitely.
That again results in very stale info! log lines below by the time we finally start executing MetaJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, maybe specify compact_lsn_range: Some((l2_lsn..gc_cutoff))
right here, so that parameters are fixed?
I don't know if ls_lsn..gc_cutoff
is correct or whether it needs to be ..=
or 0..(=?)gc_cutoff
or ..gc_cutoff
. Hope you get the idea of the suggestion though.
// TODO: error handling, clear the queue if any task fails? | ||
let _ = timeline | ||
.compact_with_options(cancel, options, ctx) | ||
.instrument(info_span!("scheduled_compact_timeline", %timeline.timeline_id)) | ||
.await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We pulled this failing task out of the queue, so, if we don't enqueue it again, what happens? What's the behavior of the next
- trigger and
- schedule_auto_compaction invocation
in the case where some sub-jobs have finished and others have failed?
Will the job accept a layer map in that shape?
Will the trigger trigger again?
Will we re-do the work of the successfully completed sub-jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to clear the queue, what are the implications for Notify
? (Still need to understand that one)
if l2_lsn >= timeline.get_l2_lsn() { | ||
info!("l2_lsn updated to {}", l2_lsn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can <
happen? Feels like we should at least make noise?
timeline | ||
.update_l2_lsn(l2_lsn) | ||
.map_err(CompactionError::Other)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most common case where this will error is during shutdown, right?
IIUC the work itself isn't lost (we updated the layer map with the new bottommost image layer).
But the next trigger evaluation after restart will
- still see the old l2_lsn and
- see a slightly advanced gc_horizon_cutoff
Will the trigger fire again? That'd be bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that it will fire again immediately because we're considering all layers, image & delta alike, for the trigger.
Is that really the right thing to do? Should we not be considering deltas only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hacky solution if we should consider both images and deltas: make the l2_lsn
stateful
enum BottomMostCompactionStatus {
NotRunning { l2_lsn: Lsn }
InProgress {
old_l2_lsn: Lsn,
new_l2_lsn: Lsn,
}
}
Trigger then checks
match status {
NotRunning { } => { /* do what we do today, cap lsn range at gc_cutoff */ },
InProgress { old_l2_lsn, new_l2_lsn } => { /* cap lsn range at min(gc_cutoff, new_l2_lsn); probably min is redundant because the code that picked new_l2_lsn already used gc_cutoff, and gc_cutoff can only ever advance (?) */ }
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is the first piece of code that will set the IndexPart::l2_lsn
to a Some()
value.
I think it's a suboptimal name and we'll have to live with it for a long time, so now is the chance to fix it.
Why is it a bad name?
There's a reasonable chance we'll have multi-level compaction in the future.
Bottom-most compaction will then not be L2, but, well, L_bottom.
Also another aspect that I think we talked about this 1-2 months ago:
the state that we store in the index part here,
is not some generic state of the compaction algorithm you developed.
It is specific to the usage of the algorithm for bottommost compaction.
(Whereas the algorithm in general can do much more than just bottommost-compaction, i.e., arbitrary rectangles, and in branches).
So, all in all, gc_compaction
feels like a way more appropriate field name to me.
Also, I think we should plan ahead for evolution of that state field.
So, here's my proposal, along with some free doc comment that took me way too long to write:
pub struct IndexPart {
...
/// State for the garbage-collecting compaction pass.
///
/// Garbage-collecting compaction (gc-compaction) prunes `Value`s that are outside
/// the PITR window and not needed by child timelines.
///
/// A commonly used synonym for this compaction pass is
/// "bottommost-compaction" because the affected LSN range
/// is the "bottom" of the (key,lsn) map.
///
/// Gc-compaction is a quite expensive operation; that's why we use
/// trigger condition.
/// This field here holds the state pertaining to that trigger condition
/// and (in future) to the progress of the gc-compaction, so that it's
/// resumable across restarts & migrations.
///
/// Note that the underlying algorithm is _also_ called `gc-compaction`
/// in most places & design docs; but in fact it is more flexible than
// just the specific use case here; it needs a new name.
///
gc_compaction: struct GcCompactionPersistence {
/// The upper bound of the last completed garbage-collecting compaction.
last_completed_lsn: Lsn,
}
Problem
part of #9114
Summary of changes
Add the auto trigger for gc-compaction. It computes two values: L1 size and L2 size. When L1 size >= initial trigger threshold, we will trigger an initial gc-compaction. When l1_size / l2_size >= gc_compaction_ratio_percent, we will trigger the "tiered" gc-compaction.