Optimize branch creation #2101

aome510 · 2022-07-14T19:45:37Z

Resolves #2054

Context: branch creation needs to wait for GC to acquire gc_cs lock, which prevents creating new timelines during GC. However, because individual timeline GC iteration also requires compaction_cs lock, branch creation may also need to wait for compactions of multiple timelines. This results in large latency when creating a new branch, which we advertised as "instantly".

This PR optimizes the latency of branch creation by separating GC into two phases:

Collect GC data (branching points, cutoff LSNs, etc)
Perform GC for each timeline

The GC bottleneck comes from step 2, which must wait for compaction of multiple timelines. This PR modifies the branch creation and GC functions to allow GC to hold the GC lock only in step 1. As a result, branch creation doesn't need to wait for compaction to finish but only needs to wait for GC data collection step, which is fast.

aome510 · 2022-07-14T19:50:13Z

The speedup can be measured by running the test_branch_creation_heavy_write test against the PR and main. At least based on runs in my local laptop, the maximum latency for this PR should be 3-5x smaller than what of main.

hlinnaka · 2022-07-15T07:46:12Z

This doesn't look safe to me. This can happen:

In the beginning, there is only one branch, 'main'. Thread A calls branch_timeline('main', 'new', '1/11112222) to create new branch called 'new', at LSN '1/10000000`. At the same time, thread B calls gc_iteration_internal().

A acquires layer_removal_cs
A reads latest_gc_cutoff_lsn as 1/11110000.
B scans the list of timelines. The new timeline creation hasn't finished yet, so it only sees 'main'
B calls gc() on the source timeline. It blocks on layer_removal_cs
A finishes creating the new timeline
B goes ahead with the garbage collection, updates latest_gc_cutoff_lsn to 1/22220000, and removes layers older than that.

Step 6 will remove data that is still needed by the new timeline.

In a nutshell, GC needs to be careful to not remove data that is still needed by child branches. There's a race condition between new branch creation and GC. If a new branch is created after GC has collected its list of branches and their branch-points (retain_lsns), but before it has updated latest_gc_cutoff_lsn on the source timeline, it can remove data that is still needed by the new timeline.

hlinnaka · 2022-07-15T07:47:01Z

pageserver/src/layered_repository.rs

-        // grab mutex to prevent new timelines from being created here.
-        let _gc_cs = self.gc_cs.lock().unwrap();
-


This is the part that prevented the race condition I mentioned in my comment.

hlinnaka · 2022-07-15T08:05:26Z

So I think we'll still need gc_cs to prevent a new timeline from being created, while GC is running. But I think we can split GC into two phases:

Collect the list of timelines and their branchpoints and gc-horizons.
Scan the timeline directories to remove files.

The first phase is pretty fast. It only needs to look at some in-memory data structures, and it can run concurrently with compaction.
The second phase is slow. It must not run at the same time with compaction. But it can run concurrently with branch creation.

We almost have that split already. gc_iteration_internal first calls timeline.update_gc_info, and then it calls timeline.gc. The update_gc_info is the first phase and gc is the second phase. We can change gc_iteration_internal so that it first calls update_gc_info for all timelines. Then it can release gc_cs lock, and then call gc on all the timelines. Probably need to move some of the code from gc to update_gc_info, though. But the core idea is to split GC into two phases: first determine which LSN ranges need to be retained, then release gc_cs lock, and only then scan the directories to remove files.

aome510 · 2022-07-15T22:06:12Z

This doesn't look safe to me. This can happen:

In the beginning, there is only one branch, 'main'. Thread A calls branch_timeline('main', 'new', '1/11112222) to create new branch called 'new', at LSN '1/10000000`. At the same time, thread B calls gc_iteration_internal().

A acquires layer_removal_cs

A reads latest_gc_cutoff_lsn as 1/11110000.

B scans the list of timelines. The new timeline creation hasn't finished yet, so it only sees 'main'

B calls gc() on the source timeline. It blocks on layer_removal_cs

A finishes creating the new timeline

B goes ahead with the garbage collection, updates latest_gc_cutoff_lsn to 1/22220000, and removes layers older than that.

Step 6 will remove data that is still needed by the new timeline.

In a nutshell, GC needs to be careful to not remove data that is still needed by child branches. There's a race condition between new branch creation and GC. If a new branch is created after GC has collected its list of branches and their branch-points (retain_lsns), but before it has updated latest_gc_cutoff_lsn on the source timeline, it can remove data that is still needed by the new timeline.

Thanks for taking a look into this. I have separated the GC codes into 2 phases as you described in the later comment. Also added a test test_branch_creation_before_gc simulating the above scenario.

Still need to do some cleanups such as adding more comments and updating the documents

hlinnaka · 2022-07-18T11:00:36Z

pageserver/src/layered_repository.rs

+                .ok_or_else(|| anyhow::anyhow!("unknown timeline id: {}", &src))?
+        };
+
+        let layer_removal_cs = src_timeline.layer_removal_cs.lock().unwrap();


It's OK if layers are removed during branch creation, I don't think this lock is needed here.

hlinnaka · 2022-07-18T18:08:44Z

pageserver/src/layered_repository.rs

        }
+        gc_info.pitr_cutoff = pitr_cutoff_lsn;
+


I think you need to also update the timeline's latest_gc_cutoff_lsn here. Otherwise essentially the same race condition can still happen:

B starts GC and scans the list of timelines. The new timeline doesn't exist yet, so it only sees 'main'

B calls update_gc_info() on the source timeline, setting pitr/horizon_cutoff to 1/22220000

B releases gc_cs

A acquires gc_cs

A reads latest_gc_cutoff_lsn as 1/11110000.

A finishes creating the new timeline

B runs second phase of GC, updates latest_gc_cutoff_lsn to 1/22220000, and removes layers older than that.

I don't update the gc cutoff there because it's possible to not GC even after updating GC info. For example, that can happen when we get a pageserver shutdown request: https://github.com/neondatabase/neon/pull/2101/files#diff-2dee9987054422607c1ab7369e102927d8477dea6d75c4c7f3891e3d65b15d1eR971-R975

The situation above should not happen because I also added a check for GC info in branch_timeline:
https://github.com/neondatabase/neon/pull/2101/files#diff-2dee9987054422607c1ab7369e102927d8477dea6d75c4c7f3891e3d65b15d1eR306-R312

Also the new test simulates that situation, so it should fail if there is a race condition.

The situation above should not happen because I also added a check for GC info in branch_timeline:
https://github.com/neondatabase/neon/pull/2101/files#diff-2dee9987054422607c1ab7369e102927d8477dea6d75c4c7f3891e3d65b15d1eR306-R312

Ah, gotcha, I missed that

aome510 · 2022-07-18T19:49:44Z

libs/postgres_ffi/build.rs

-    let mut pg_install_dir: PathBuf;
-    if let Some(postgres_install_dir) = env::var_os("POSTGRES_INSTALL_DIR") {
-        pg_install_dir = postgres_install_dir.into();
+    let mut pg_install_dir = if let Some(postgres_install_dir) = env::var_os("POSTGRES_INSTALL_DIR")


Not related to this PR. This change is to fix a clippy error. Can revert and put to another PR if needed.

aome510 · 2022-07-18T21:50:48Z

.github/workflows/build_and_test.yml

@@ -110,8 +110,8 @@ jobs:
            target/
          # Fall back to older versions of the key, if no cache for current Cargo.lock was found
          key: |
-            v2-${{ runner.os }}-${{ matrix.build_type }}-cargo-${{ matrix.rust_toolchain }}-${{ hashFiles('Cargo.lock') }}
-            v2-${{ runner.os }}-${{ matrix.build_type }}-cargo-${{ matrix.rust_toolchain }}-
+            v3-${{ runner.os }}-${{ matrix.build_type }}-cargo-${{ matrix.rust_toolchain }}-${{ hashFiles('Cargo.lock') }}


Update the cache key to fix the incremental compilation error. See https://neondb.slack.com/archives/C0277TKAJCA/p1658169404632249

I purged the cache so this hopefully won't happen again

I guess it's safe to revert the CI changes for now.

Update: my bad, it's not safe to revert the changes.

aome510 · 2022-07-18T21:51:11Z

.github/workflows/codestyle.yml

@@ -101,7 +101,7 @@ jobs:
            !~/.cargo/registry/src
            ~/.cargo/git
            target
-          key: ${{ runner.os }}-cargo-${{ hashFiles('./Cargo.lock') }}-rust-${{ matrix.rust_toolchain }}
+          key: v1-${{ runner.os }}-cargo-${{ hashFiles('./Cargo.lock') }}-rust-${{ matrix.rust_toolchain }}


Update cache key to fix the incremental compilation error.

hlinnaka · 2022-07-19T06:38:31Z

pageserver/src/layered_repository.rs

+    horizon_cutoff: Lsn,

-    /// In addition to 'retain_lsns', keep everything newer than 'SystemTime::now()'
-    /// minus 'pitr_interval'
+    /// In addition to 'retain_lsns' and 'horizon_cutoff', keep everything newer than this
+    /// point.
    ///
-    pitr: Duration,
+    /// This is calculated by finding a number such that a record is needed for PITR
+    /// if only if its LSN is larger than 'pitr_cutoff'.
+    pitr_cutoff: Lsn,
 }


You can combine horizon_cutoff and pitr_cutoff into one value. The code in gc prints different debug messages, and uses separate counters depending on which one was smaller, but I don't think that distinction is needed. Instead, you could just print a debug message in update_gc_info, to indicate which was dominant.

Yeah, I agree. However, it's not a trivial change though. I think it's better to leave that to a separate PR/ follow-up

hlinnaka · 2022-07-19T06:47:02Z

pageserver/src/layered_repository.rs

+        // Besides GC lock, branch creation task doesn't need to hold the `layer_removal_cs` lock,
+        // hence doesn't need to wait for compaction/GC because it ensures that the starting LSN
+        // of the child branch is not out of scope in the middle of the creation task by
+        // 1. holding the GC lock to prevent overwritting timeline GC data
+        // 2. checking both the latest GC cutoff LSN and latest GC info of the source timeline
+        // to avoid initializing the new branch using data removed by past GC iterations
+        // or in-queue GC iterations.
+


It's a bit weird to explain in great detail why this doesn't need to hold layer_removal_cs lock. Also, this seems like a run-on sentence.

… `layer_removal_cs` lock

1. Collect the list of timelines and their branchpoints and gc-horizons. 2. Scan the timeline directories to remove files.

aome510 · 2022-07-19T16:41:05Z

One example result when running the new branch creation perf tests comparing with the latest main:

This patch:

test_branch_creation_heavy_write[20].branch_creation_duration_max: 3.961 s
test_branch_creation_heavy_write[20].branch_creation_duration_avg: 1.185 s
test_branch_creation_heavy_write[20].branch_creation_duration_stdev: 1.222 s
test_branch_creation_many[1024].branch_creation_duration_max: 0.273 s
test_branch_creation_many[1024].branch_creation_duration_avg: 0.102 s
test_branch_creation_many[1024].branch_creation_duration_stdev: 0.037 s

main:

test_branch_creation_heavy_write[20].branch_creation_duration_max: 21.256 s
test_branch_creation_heavy_write[20].branch_creation_duration_avg: 2.072 s
test_branch_creation_heavy_write[20].branch_creation_duration_stdev: 5.348 s
test_branch_creation_many[1024].branch_creation_duration_max: 3.733 s
test_branch_creation_many[1024].branch_creation_duration_avg: 0.106 s
test_branch_creation_many[1024].branch_creation_duration_stdev: 0.119 s

In general, looks like a decent improvement. Will merge once tests passed

aome510 requested review from hlinnaka and LizardWizzard July 14, 2022 19:46

hlinnaka reviewed Jul 15, 2022

View reviewed changes

hlinnaka reviewed Jul 18, 2022

View reviewed changes

aome510 force-pushed the add-branch-perf-tests branch from 87973e3 to ac973ed Compare July 18, 2022 18:30

aome510 commented Jul 18, 2022

View reviewed changes

aome510 force-pushed the add-branch-perf-tests branch from 7af02e4 to 7050eb4 Compare July 18, 2022 20:41

aome510 commented Jul 18, 2022

View reviewed changes

aome510 requested review from hlinnaka and knizhnik July 18, 2022 21:58

hlinnaka approved these changes Jul 19, 2022

View reviewed changes

hlinnaka reviewed Jul 19, 2022

View reviewed changes

aome510 force-pushed the add-branch-perf-tests branch from cd37135 to eb2a2d7 Compare July 19, 2022 16:21

aome510 added 12 commits July 19, 2022 12:22

add branching creation perf test

df48d1f

remove layered repository gc_cs lock in favour of timeline-specific…

f6b0f8a

… `layer_removal_cs` lock

add test branch with many children

6a891c5

rename test_branching.py to test_branch_creation.py

b5fc8b5

add test_branch_creation_before_gc test

4c79360

update GC codes: separate into two phases

8c79ee5

1. Collect the list of timelines and their branchpoints and gc-horizons. 2. Scan the timeline directories to remove files.

update test_branch_creation_before_gc

a911406

also check gc_info when creating a new branch

0df91f6

fix clippy error

76a3ef2

remove layer_removal_cs lock in branch_timeline function

301904d

update rust github CI cache key

5f060ff

update comments in layered_repository.rs

36014ef

aome510 added 4 commits July 19, 2022 12:22

add test comment

639eb62

cleanup

5cce7c3

fix test_branch_behind test

66cedf2

update comments in branch_timeline function

a933902

aome510 force-pushed the add-branch-perf-tests branch from eb2a2d7 to a933902 Compare July 19, 2022 16:22

aome510 mentioned this pull request Jul 19, 2022

Epic: expand branching tests #1899

Closed

4 tasks

aome510 merged commit 160e52e into main Jul 19, 2022

aome510 deleted the add-branch-perf-tests branch July 19, 2022 18:56

hlinnaka mentioned this pull request Jul 26, 2022

Do not hold timelines lock while calling update_gc_info to avoid recu… #2163

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize branch creation #2101

Optimize branch creation #2101

aome510 commented Jul 14, 2022 •

edited

Loading

aome510 commented Jul 14, 2022

hlinnaka commented Jul 15, 2022

hlinnaka Jul 15, 2022

hlinnaka commented Jul 15, 2022 •

edited

Loading

aome510 commented Jul 15, 2022

hlinnaka Jul 18, 2022

hlinnaka Jul 18, 2022

aome510 Jul 18, 2022 •

edited

Loading

hlinnaka Jul 18, 2022

aome510 Jul 18, 2022

aome510 Jul 18, 2022

hlinnaka Jul 19, 2022

aome510 Jul 19, 2022

aome510 Jul 19, 2022 •

edited

Loading

aome510 Jul 18, 2022

hlinnaka Jul 19, 2022

aome510 Jul 19, 2022 •

edited

Loading

hlinnaka Jul 19, 2022

aome510 commented Jul 19, 2022 •

edited

Loading

		// grab mutex to prevent new timelines from being created here.
		let _gc_cs = self.gc_cs.lock().unwrap();

Optimize branch creation #2101

Optimize branch creation #2101

Conversation

aome510 commented Jul 14, 2022 • edited Loading

aome510 commented Jul 14, 2022

hlinnaka commented Jul 15, 2022

Choose a reason for hiding this comment

hlinnaka commented Jul 15, 2022 • edited Loading

aome510 commented Jul 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aome510 Jul 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aome510 Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aome510 Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aome510 commented Jul 19, 2022 • edited Loading

aome510 commented Jul 14, 2022 •

edited

Loading

hlinnaka commented Jul 15, 2022 •

edited

Loading

aome510 Jul 18, 2022 •

edited

Loading

aome510 Jul 19, 2022 •

edited

Loading

aome510 Jul 19, 2022 •

edited

Loading

aome510 commented Jul 19, 2022 •

edited

Loading