feat(pageserver): add automatic trigger for gc-compaction #10798

skyzh · 2025-02-12T21:34:34Z

Problem

part of #9114

Summary of changes

Add the auto trigger for gc-compaction. It computes two values: L1 size and L2 size. When L1 size >= initial trigger threshold, we will trigger an initial gc-compaction. When l1_size / l2_size >= gc_compaction_ratio_percent, we will trigger the "tiered" gc-compaction.

Signed-off-by: Alex Chi Z <chi@neon.tech>

github-actions · 2025-02-12T23:03:12Z

7480 tests run: 7099 passed, 0 failed, 381 skipped (full report)

Flaky tests (2)

Postgres 17

test_timeline_archive[4]: release-arm64-with-lfc

Postgres 14

test_pageserver_gc_compaction_idempotent[after_restart]: release-arm64-with-lfc

Code coverage* (full report)

functions: 33.2% (8580 of 25880 functions)
lines: 49.0% (72299 of 147556 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
f1c6a61 at 2025-02-12T23:03:11.233Z :recycle:}

skyzh · 2025-02-13T05:46:10Z

test_runner/regress/test_compaction.py

@@ -381,6 +381,8 @@ def test_pageserver_gc_compaction_interrupt(neon_env_builder: NeonEnvBuilder):
        "pitr_interval": "0s",
        "gc_horizon": "1024",
        "lsn_lease_length": "0s",
+        # Do not generate image layers with create_image_layers
+        "image_layer_creation_check_threshold": "100",


should move this to L495, will fix tmr

problame

Overall idea of the trigger is quite concise, I like that.

Did you check how the trigger code interacts with branches? Will it only run on the root branch?

I can't judge anything pertaining management of the gc_block stuff and general invariants of the queue structures.

A bunch of my comments are probably due to the fact that I missed reviews for most compaction changes in the last month, so, apply your own judgement whether they warrant changes, a follow-up , or can be discarded.

Since this is still not enabled in prod, I'm fine with merging early.
And I'm off next week, so, to unblock you, I'm approving right away.

But please work through the review comments and queue follow-ups as needed. I likely won't remember context after vacations.

problame · 2025-02-14T18:14:02Z

pageserver/src/tenant/timeline.rs

@@ -2493,6 +2496,31 @@ impl Timeline {
            )
    }

+    fn get_gc_compaction_settings(&self) -> (bool, u64, u64) {


Would prefer this be a named struct

problame · 2025-02-14T18:16:59Z

pageserver/src/tenant.rs

+                    guard
+                        .entry(timeline.timeline_id)


Side question: do we clean entries up properly when a timeline gets deleted?

problame · 2025-02-14T18:20:40Z

pageserver/src/tenant/timeline/compaction.rs

+            }
+        }) else {
+            self.trigger_auto_compaction(timeline).await;
+            // Always yield after triggering auto-compaction


... why? Would be helpful to know for future readers.

problame · 2025-02-14T18:24:47Z

pageserver/src/tenant/timeline/compaction.rs

@@ -267,27 +415,35 @@ impl GcCompactionQueue {
    ) -> Result<CompactionOutcome, CompactionError> {
        let _one_op_at_a_time_guard = self.consumer_lock.lock().await;


This lock should never be held at this point, right? Because there's only one compaction task per tenant?
(I guess the caveat is compact mgmt API calls in testing?)

I generally would prefer returning an error if it's taken, like CompactionError::AlreadyRunning or whatever.
Avoids deadlock risk which is hard to debug with tokio mutexes.

problame · 2025-02-14T18:30:45Z

pageserver/src/tenant/timeline/compaction.rs

+        let Ok(permit) = CONCURRENT_GC_COMPACTION_TASKS.clone().try_acquire_owned() else {
+            // Only allow one compaction run at a time
+            return;
+        };


Bailing here prevents fairness among timelines & tenants.

problame · 2025-02-14T18:33:40Z

pageserver/src/tenant/timeline/compaction.rs

+        let layers = {
+            let guard = timeline.layers.read().await;
+            let layer_map = guard.layer_map().unwrap();
+            layer_map.iter_historic_layers().collect_vec()
+        };
+        let mut l2_size: u64 = 0;
+        let mut l1_size = 0;
+        let gc_cutoff = *timeline.get_latest_gc_cutoff_lsn();
+        for layer in layers {
+            if layer.lsn_range.start <= l2_lsn {
+                l2_size += layer.file_size();
+            } else if layer.lsn_range.start <= gc_cutoff {
+                l1_size += layer.file_size();
+            }
+        }


We will do this for every timeline every 20s (default compaction_period).
Even ones that haven't changed.

Granted, with the global compaction semaphore concurrency limit.
So it's not like this will overwhelm us CPU-wise.
But it's quite wasteful.

Should queue a follow-up to compute the trigger incrementally - seems quite simple to do if we keep track of l1_size and l2_size incrementally inside layer map.

problame · 2025-02-14T18:40:12Z

pageserver/src/tenant/timeline/compaction.rs

+            self.schedule_auto_compaction(
+                CompactOptions {
+                    flags: {
+                        let mut flags = EnumSet::new();
+                        flags |= CompactFlags::EnhancedGcBottomMostCompaction;
+                        flags
+                    },
+                    sub_compaction: true,
+                    compact_key_range: None,
+                    compact_lsn_range: None,
+                    sub_compaction_max_job_size_mb: None,
+                },
+                permit,
+            );


To compute the trigger condition, we just walked the layer map, did lsn range checks, etc.
Basically it's half the filtering work that the schedule code does.

Now the schedule code will do most of that work again.

And then I think each job that is queued will do the work again?

Also if we don't specify the compact_lsn_range here, then gc_cutoff_horizon can move between scheduling the MetaJob here, and executing the MetaJob later.

If the pageserver is too busy and can't keep up with compactions, the turnaround between scheduling & executing the MetaJob may be many minutes.
The layer map can drift substantially in the menatime, and thus the announcements about l1_size / l2_size etc in the log messages below will be out of date.

Another source of drift is if the timeline starts ingesting a lot inbetween scheduling MetaJob and executing MetaJob.
L0 compaction will take priority, delaying execution of MetaJob indefinitely.
That again results in very stale info! log lines below by the time we finally start executing MetaJob.

So, maybe specify compact_lsn_range: Some((l2_lsn..gc_cutoff)) right here, so that parameters are fixed?

I don't know if ls_lsn..gc_cutoff is correct or whether it needs to be ..= or 0..(=?)gc_cutoff or ..gc_cutoff. Hope you get the idea of the suggestion though.

problame · 2025-02-14T18:54:18Z

pageserver/src/tenant/timeline/compaction.rs

+                // TODO: error handling, clear the queue if any task fails?
                let _ = timeline
                    .compact_with_options(cancel, options, ctx)
                    .instrument(info_span!("scheduled_compact_timeline", %timeline.timeline_id))
                    .await?;


We pulled this failing task out of the queue, so, if we don't enqueue it again, what happens? What's the behavior of the next

trigger and

schedule_auto_compaction invocation

in the case where some sub-jobs have finished and others have failed?

Will the job accept a layer map in that shape?

Will the trigger trigger again?

Will we re-do the work of the successfully completed sub-jobs?

If we decide to clear the queue, what are the implications for Notify? (Still need to understand that one)

problame · 2025-02-14T19:21:28Z

pageserver/src/tenant/timeline/compaction.rs

+                    if l2_lsn >= timeline.get_l2_lsn() {
+                        info!("l2_lsn updated to {}", l2_lsn);


When can < happen? Feels like we should at least make noise?

problame · 2025-02-14T19:21:43Z

pageserver/src/tenant/timeline/compaction.rs

+                        timeline
+                            .update_l2_lsn(l2_lsn)
+                            .map_err(CompactionError::Other)?;


The most common case where this will error is during shutdown, right?

IIUC the work itself isn't lost (we updated the layer map with the new bottommost image layer).
But the next trigger evaluation after restart will

still see the old l2_lsn and

see a slightly advanced gc_horizon_cutoff

Will the trigger fire again? That'd be bad.

My understanding is that it will fire again immediately because we're considering all layers, image & delta alike, for the trigger.

Is that really the right thing to do? Should we not be considering deltas only?

Hacky solution if we should consider both images and deltas: make the l2_lsn stateful

enum BottomMostCompactionStatus { NotRunning { l2_lsn: Lsn } InProgress { old_l2_lsn: Lsn, new_l2_lsn: Lsn, } }

Trigger then checks

match status { NotRunning { } => { /* do what we do today, cap lsn range at gc_cutoff */ }, InProgress { old_l2_lsn, new_l2_lsn } => { /* cap lsn range at min(gc_cutoff, new_l2_lsn); probably min is redundant because the code that picked new_l2_lsn already used gc_cutoff, and gc_cutoff can only ever advance (?) */ } }

problame

IIUC this is the first piece of code that will set the IndexPart::l2_lsn to a Some() value.

I think it's a suboptimal name and we'll have to live with it for a long time, so now is the chance to fix it.

Why is it a bad name?
There's a reasonable chance we'll have multi-level compaction in the future.
Bottom-most compaction will then not be L2, but, well, L_bottom.

Also another aspect that I think we talked about this 1-2 months ago:
the state that we store in the index part here,
is not some generic state of the compaction algorithm you developed.
It is specific to the usage of the algorithm for bottommost compaction.
(Whereas the algorithm in general can do much more than just bottommost-compaction, i.e., arbitrary rectangles, and in branches).

So, all in all, gc_compaction feels like a way more appropriate field name to me.

Also, I think we should plan ahead for evolution of that state field.

So, here's my proposal, along with some free doc comment that took me way too long to write:

pub struct IndexPart {
  ...

  /// State for the garbage-collecting compaction pass.
  ///
  /// Garbage-collecting compaction (gc-compaction) prunes `Value`s that are outside
  /// the PITR window and not needed by child timelines.
  /// 
  /// A commonly used synonym for this compaction pass is
  /// "bottommost-compaction"  because the affected LSN range
  /// is the "bottom" of the (key,lsn) map.
  ///
  /// Gc-compaction is a quite expensive operation; that's why we use
  /// trigger condition.
  /// This field here holds the state pertaining to that trigger condition
  /// and (in future) to the progress of the gc-compaction, so that it's
  /// resumable across restarts & migrations.
  ///
  /// Note that the underlying algorithm is _also_ called `gc-compaction`
  /// in most places & design docs; but in fact it is more flexible than
  // just the specific use case here; it needs a new name.
  /// 
  gc_compaction: struct GcCompactionPersistence {
    /// The upper bound of the last completed garbage-collecting compaction.
    last_completed_lsn: Lsn,
  }

skyzh requested review from problame and arpad-m February 12, 2025 21:34

skyzh requested a review from a team as a code owner February 12, 2025 21:34

skyzh added 2 commits February 12, 2025 22:58

feat(pageserver): add gc-compaction auto trigger

e96e06e

Signed-off-by: Alex Chi Z <chi@neon.tech>

add tests

ca4c261

Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh force-pushed the skyzh/auto-trigger branch 2 times, most recently from 9d09ed7 to cdabd73 Compare February 12, 2025 22:04

create queue if not exists

f1c6a61

Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh force-pushed the skyzh/auto-trigger branch from cdabd73 to f1c6a61 Compare February 12, 2025 22:05

skyzh commented Feb 13, 2025

View reviewed changes

problame approved these changes Feb 14, 2025

View reviewed changes

problame reviewed Feb 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pageserver): add automatic trigger for gc-compaction #10798

feat(pageserver): add automatic trigger for gc-compaction #10798

skyzh commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

Postgres 17

Postgres 14

skyzh Feb 13, 2025

problame left a comment

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame Feb 14, 2025

problame left a comment

		@@ -267,27 +415,35 @@ impl GcCompactionQueue {
		) -> Result<CompactionOutcome, CompactionError> {
		let _one_op_at_a_time_guard = self.consumer_lock.lock().await;

		if l2_lsn >= timeline.get_l2_lsn() {
		info!("l2_lsn updated to {}", l2_lsn);

feat(pageserver): add automatic trigger for gc-compaction #10798

Are you sure you want to change the base?

feat(pageserver): add automatic trigger for gc-compaction #10798

Conversation

skyzh commented Feb 12, 2025

Problem

Summary of changes

github-actions bot commented Feb 12, 2025

7480 tests run: 7099 passed, 0 failed, 381 skipped (full report)

Postgres 17

Postgres 14

Code coverage* (full report)

Choose a reason for hiding this comment

problame left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

problame left a comment

Choose a reason for hiding this comment