Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pageserver): add automatic trigger for gc-compaction #10798

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

skyzh
Copy link
Member

@skyzh skyzh commented Feb 12, 2025

Problem

part of #9114

Summary of changes

Add the auto trigger for gc-compaction. It computes two values: L1 size and L2 size. When L1 size >= initial trigger threshold, we will trigger an initial gc-compaction. When l1_size / l2_size >= gc_compaction_ratio_percent, we will trigger the "tiered" gc-compaction.

@skyzh skyzh requested review from problame and arpad-m February 12, 2025 21:34
@skyzh skyzh requested a review from a team as a code owner February 12, 2025 21:34
Signed-off-by: Alex Chi Z <chi@neon.tech>
Signed-off-by: Alex Chi Z <chi@neon.tech>
@skyzh skyzh force-pushed the skyzh/auto-trigger branch 2 times, most recently from 9d09ed7 to cdabd73 Compare February 12, 2025 22:04
Signed-off-by: Alex Chi Z <chi@neon.tech>
@skyzh skyzh force-pushed the skyzh/auto-trigger branch from cdabd73 to f1c6a61 Compare February 12, 2025 22:05
Copy link

7480 tests run: 7099 passed, 0 failed, 381 skipped (full report)


Flaky tests (2)

Postgres 17

Postgres 14

Code coverage* (full report)

  • functions: 33.2% (8580 of 25880 functions)
  • lines: 49.0% (72299 of 147556 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
f1c6a61 at 2025-02-12T23:03:11.233Z :recycle:

@@ -381,6 +381,8 @@ def test_pageserver_gc_compaction_interrupt(neon_env_builder: NeonEnvBuilder):
"pitr_interval": "0s",
"gc_horizon": "1024",
"lsn_lease_length": "0s",
# Do not generate image layers with create_image_layers
"image_layer_creation_check_threshold": "100",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should move this to L495, will fix tmr

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall idea of the trigger is quite concise, I like that.


Did you check how the trigger code interacts with branches? Will it only run on the root branch?


I can't judge anything pertaining management of the gc_block stuff and general invariants of the queue structures.

A bunch of my comments are probably due to the fact that I missed reviews for most compaction changes in the last month, so, apply your own judgement whether they warrant changes, a follow-up , or can be discarded.

Since this is still not enabled in prod, I'm fine with merging early.
And I'm off next week, so, to unblock you, I'm approving right away.

But please work through the review comments and queue follow-ups as needed. I likely won't remember context after vacations.

@@ -2493,6 +2496,31 @@ impl Timeline {
)
}

fn get_gc_compaction_settings(&self) -> (bool, u64, u64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer this be a named struct

Comment on lines +3032 to +3033
guard
.entry(timeline.timeline_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side question: do we clean entries up properly when a timeline gets deleted?

}
}) else {
self.trigger_auto_compaction(timeline).await;
// Always yield after triggering auto-compaction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... why? Would be helpful to know for future readers.

@@ -267,27 +415,35 @@ impl GcCompactionQueue {
) -> Result<CompactionOutcome, CompactionError> {
let _one_op_at_a_time_guard = self.consumer_lock.lock().await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock should never be held at this point, right? Because there's only one compaction task per tenant?
(I guess the caveat is compact mgmt API calls in testing?)

I generally would prefer returning an error if it's taken, like CompactionError::AlreadyRunning or whatever.
Avoids deadlock risk which is hard to debug with tokio mutexes.

Comment on lines +228 to +231
let Ok(permit) = CONCURRENT_GC_COMPACTION_TASKS.clone().try_acquire_owned() else {
// Only allow one compaction run at a time
return;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bailing here prevents fairness among timelines & tenants.

Comment on lines +235 to +249
let layers = {
let guard = timeline.layers.read().await;
let layer_map = guard.layer_map().unwrap();
layer_map.iter_historic_layers().collect_vec()
};
let mut l2_size: u64 = 0;
let mut l1_size = 0;
let gc_cutoff = *timeline.get_latest_gc_cutoff_lsn();
for layer in layers {
if layer.lsn_range.start <= l2_lsn {
l2_size += layer.file_size();
} else if layer.lsn_range.start <= gc_cutoff {
l1_size += layer.file_size();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will do this for every timeline every 20s (default compaction_period).
Even ones that haven't changed.

Granted, with the global compaction semaphore concurrency limit.
So it's not like this will overwhelm us CPU-wise.
But it's quite wasteful.

Should queue a follow-up to compute the trigger incrementally - seems quite simple to do if we keep track of l1_size and l2_size incrementally inside layer map.

Comment on lines +293 to +306
self.schedule_auto_compaction(
CompactOptions {
flags: {
let mut flags = EnumSet::new();
flags |= CompactFlags::EnhancedGcBottomMostCompaction;
flags
},
sub_compaction: true,
compact_key_range: None,
compact_lsn_range: None,
sub_compaction_max_job_size_mb: None,
},
permit,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To compute the trigger condition, we just walked the layer map, did lsn range checks, etc.
Basically it's half the filtering work that the schedule code does.

Now the schedule code will do most of that work again.

And then I think each job that is queued will do the work again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if we don't specify the compact_lsn_range here, then gc_cutoff_horizon can move between scheduling the MetaJob here, and executing the MetaJob later.

If the pageserver is too busy and can't keep up with compactions, the turnaround between scheduling & executing the MetaJob may be many minutes.
The layer map can drift substantially in the menatime, and thus the announcements about l1_size / l2_size etc in the log messages below will be out of date.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another source of drift is if the timeline starts ingesting a lot inbetween scheduling MetaJob and executing MetaJob.
L0 compaction will take priority, delaying execution of MetaJob indefinitely.
That again results in very stale info! log lines below by the time we finally start executing MetaJob.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, maybe specify compact_lsn_range: Some((l2_lsn..gc_cutoff)) right here, so that parameters are fixed?

I don't know if ls_lsn..gc_cutoff is correct or whether it needs to be ..= or 0..(=?)gc_cutoff or ..gc_cutoff. Hope you get the idea of the suggestion though.

Comment on lines +468 to 472
// TODO: error handling, clear the queue if any task fails?
let _ = timeline
.compact_with_options(cancel, options, ctx)
.instrument(info_span!("scheduled_compact_timeline", %timeline.timeline_id))
.await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We pulled this failing task out of the queue, so, if we don't enqueue it again, what happens? What's the behavior of the next

  1. trigger and
  2. schedule_auto_compaction invocation

in the case where some sub-jobs have finished and others have failed?

Will the job accept a layer map in that shape?

Will the trigger trigger again?

Will we re-do the work of the successfully completed sub-jobs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to clear the queue, what are the implications for Notify? (Still need to understand that one)

Comment on lines +477 to +478
if l2_lsn >= timeline.get_l2_lsn() {
info!("l2_lsn updated to {}", l2_lsn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can < happen? Feels like we should at least make noise?

Comment on lines +479 to +481
timeline
.update_l2_lsn(l2_lsn)
.map_err(CompactionError::Other)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most common case where this will error is during shutdown, right?

IIUC the work itself isn't lost (we updated the layer map with the new bottommost image layer).
But the next trigger evaluation after restart will

  1. still see the old l2_lsn and
  2. see a slightly advanced gc_horizon_cutoff

Will the trigger fire again? That'd be bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that it will fire again immediately because we're considering all layers, image & delta alike, for the trigger.

Is that really the right thing to do? Should we not be considering deltas only?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hacky solution if we should consider both images and deltas: make the l2_lsn stateful

enum BottomMostCompactionStatus {
    NotRunning { l2_lsn: Lsn }
    InProgress { 
        old_l2_lsn: Lsn,
        new_l2_lsn: Lsn,
    }
}

Trigger then checks

match status {
    NotRunning { } => { /* do what we do today, cap lsn range at gc_cutoff */ },
    InProgress { old_l2_lsn, new_l2_lsn } => { /* cap lsn range at min(gc_cutoff, new_l2_lsn); probably min is redundant because the code that picked new_l2_lsn already used gc_cutoff, and gc_cutoff can only ever advance (?) */ }

}

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is the first piece of code that will set the IndexPart::l2_lsn to a Some() value.

I think it's a suboptimal name and we'll have to live with it for a long time, so now is the chance to fix it.

Why is it a bad name?
There's a reasonable chance we'll have multi-level compaction in the future.
Bottom-most compaction will then not be L2, but, well, L_bottom.

Also another aspect that I think we talked about this 1-2 months ago:
the state that we store in the index part here,
is not some generic state of the compaction algorithm you developed.
It is specific to the usage of the algorithm for bottommost compaction.
(Whereas the algorithm in general can do much more than just bottommost-compaction, i.e., arbitrary rectangles, and in branches).

So, all in all, gc_compaction feels like a way more appropriate field name to me.

Also, I think we should plan ahead for evolution of that state field.

So, here's my proposal, along with some free doc comment that took me way too long to write:

pub struct IndexPart {
  ...

  /// State for the garbage-collecting compaction pass.
  ///
  /// Garbage-collecting compaction (gc-compaction) prunes `Value`s that are outside
  /// the PITR window and not needed by child timelines.
  /// 
  /// A commonly used synonym for this compaction pass is
  /// "bottommost-compaction"  because the affected LSN range
  /// is the "bottom" of the (key,lsn) map.
  ///
  /// Gc-compaction is a quite expensive operation; that's why we use
  /// trigger condition.
  /// This field here holds the state pertaining to that trigger condition
  /// and (in future) to the progress of the gc-compaction, so that it's
  /// resumable across restarts & migrations.
  ///
  /// Note that the underlying algorithm is _also_ called `gc-compaction`
  /// in most places & design docs; but in fact it is more flexible than
  // just the specific use case here; it needs a new name.
  /// 
  gc_compaction: struct GcCompactionPersistence {
    /// The upper bound of the last completed garbage-collecting compaction.
    last_completed_lsn: Lsn,
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants