Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize branch creation #2101

Merged
merged 16 commits into from
Jul 19, 2022
Merged

Optimize branch creation #2101

merged 16 commits into from
Jul 19, 2022

Conversation

aome510
Copy link
Contributor

@aome510 aome510 commented Jul 14, 2022

Resolves #2054

Context: branch creation needs to wait for GC to acquire gc_cs lock, which prevents creating new timelines during GC. However, because individual timeline GC iteration also requires compaction_cs lock, branch creation may also need to wait for compactions of multiple timelines. This results in large latency when creating a new branch, which we advertised as "instantly".

This PR optimizes the latency of branch creation by separating GC into two phases:

  1. Collect GC data (branching points, cutoff LSNs, etc)
  2. Perform GC for each timeline

The GC bottleneck comes from step 2, which must wait for compaction of multiple timelines. This PR modifies the branch creation and GC functions to allow GC to hold the GC lock only in step 1. As a result, branch creation doesn't need to wait for compaction to finish but only needs to wait for GC data collection step, which is fast.

@aome510
Copy link
Contributor Author

aome510 commented Jul 14, 2022

The speedup can be measured by running the test_branch_creation_heavy_write test against the PR and main. At least based on runs in my local laptop, the maximum latency for this PR should be 3-5x smaller than what of main.

@hlinnaka
Copy link
Contributor

This doesn't look safe to me. This can happen:

In the beginning, there is only one branch, 'main'. Thread A calls branch_timeline('main', 'new', '1/11112222) to create new branch called 'new', at LSN '1/10000000`. At the same time, thread B calls gc_iteration_internal().

  1. A acquires layer_removal_cs
  2. A reads latest_gc_cutoff_lsn as 1/11110000.
  3. B scans the list of timelines. The new timeline creation hasn't finished yet, so it only sees 'main'
  4. B calls gc() on the source timeline. It blocks on layer_removal_cs
  5. A finishes creating the new timeline
  6. B goes ahead with the garbage collection, updates latest_gc_cutoff_lsn to 1/22220000, and removes layers older than that.

Step 6 will remove data that is still needed by the new timeline.

In a nutshell, GC needs to be careful to not remove data that is still needed by child branches. There's a race condition between new branch creation and GC. If a new branch is created after GC has collected its list of branches and their branch-points (retain_lsns), but before it has updated latest_gc_cutoff_lsn on the source timeline, it can remove data that is still needed by the new timeline.

Comment on lines 886 to 887
// grab mutex to prevent new timelines from being created here.
let _gc_cs = self.gc_cs.lock().unwrap();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that prevented the race condition I mentioned in my comment.

@hlinnaka
Copy link
Contributor

hlinnaka commented Jul 15, 2022

So I think we'll still need gc_cs to prevent a new timeline from being created, while GC is running. But I think we can split GC into two phases:

  1. Collect the list of timelines and their branchpoints and gc-horizons.
  2. Scan the timeline directories to remove files.

The first phase is pretty fast. It only needs to look at some in-memory data structures, and it can run concurrently with compaction.
The second phase is slow. It must not run at the same time with compaction. But it can run concurrently with branch creation.

We almost have that split already. gc_iteration_internal first calls timeline.update_gc_info, and then it calls timeline.gc. The update_gc_info is the first phase and gc is the second phase. We can change gc_iteration_internal so that it first calls update_gc_info for all timelines. Then it can release gc_cs lock, and then call gc on all the timelines. Probably need to move some of the code from gc to update_gc_info, though. But the core idea is to split GC into two phases: first determine which LSN ranges need to be retained, then release gc_cs lock, and only then scan the directories to remove files.

@aome510
Copy link
Contributor Author

aome510 commented Jul 15, 2022

This doesn't look safe to me. This can happen:

In the beginning, there is only one branch, 'main'. Thread A calls branch_timeline('main', 'new', '1/11112222) to create new branch called 'new', at LSN '1/10000000`. At the same time, thread B calls gc_iteration_internal().

  1. A acquires layer_removal_cs
  2. A reads latest_gc_cutoff_lsn as 1/11110000.
  3. B scans the list of timelines. The new timeline creation hasn't finished yet, so it only sees 'main'
  4. B calls gc() on the source timeline. It blocks on layer_removal_cs
  5. A finishes creating the new timeline
  6. B goes ahead with the garbage collection, updates latest_gc_cutoff_lsn to 1/22220000, and removes layers older than that.

Step 6 will remove data that is still needed by the new timeline.

In a nutshell, GC needs to be careful to not remove data that is still needed by child branches. There's a race condition between new branch creation and GC. If a new branch is created after GC has collected its list of branches and their branch-points (retain_lsns), but before it has updated latest_gc_cutoff_lsn on the source timeline, it can remove data that is still needed by the new timeline.

Thanks for taking a look into this. I have separated the GC codes into 2 phases as you described in the later comment. Also added a test test_branch_creation_before_gc simulating the above scenario.

Still need to do some cleanups such as adding more comments and updating the documents

.ok_or_else(|| anyhow::anyhow!("unknown timeline id: {}", &src))?
};

let layer_removal_cs = src_timeline.layer_removal_cs.lock().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK if layers are removed during branch creation, I don't think this lock is needed here.

}
gc_info.pitr_cutoff = pitr_cutoff_lsn;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to also update the timeline's latest_gc_cutoff_lsn here. Otherwise essentially the same race condition can still happen:

  1. B starts GC and scans the list of timelines. The new timeline doesn't exist yet, so it only sees 'main'
  2. B calls update_gc_info() on the source timeline, setting pitr/horizon_cutoff to 1/22220000
  3. B releases gc_cs
  4. A acquires gc_cs
  5. A reads latest_gc_cutoff_lsn as 1/11110000.
  6. A finishes creating the new timeline
  7. B runs second phase of GC, updates latest_gc_cutoff_lsn to 1/22220000, and removes layers older than that.

Copy link
Contributor Author

@aome510 aome510 Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't update the gc cutoff there because it's possible to not GC even after updating GC info. For example, that can happen when we get a pageserver shutdown request: https://github.com/neondatabase/neon/pull/2101/files#diff-2dee9987054422607c1ab7369e102927d8477dea6d75c4c7f3891e3d65b15d1eR971-R975

The situation above should not happen because I also added a check for GC info in branch_timeline:
https://github.com/neondatabase/neon/pull/2101/files#diff-2dee9987054422607c1ab7369e102927d8477dea6d75c4c7f3891e3d65b15d1eR306-R312

Also the new test simulates that situation, so it should fail if there is a race condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation above should not happen because I also added a check for GC info in branch_timeline:
https://github.com/neondatabase/neon/pull/2101/files#diff-2dee9987054422607c1ab7369e102927d8477dea6d75c4c7f3891e3d65b15d1eR306-R312

Ah, gotcha, I missed that

@aome510 aome510 force-pushed the add-branch-perf-tests branch from 87973e3 to ac973ed Compare July 18, 2022 18:30
let mut pg_install_dir: PathBuf;
if let Some(postgres_install_dir) = env::var_os("POSTGRES_INSTALL_DIR") {
pg_install_dir = postgres_install_dir.into();
let mut pg_install_dir = if let Some(postgres_install_dir) = env::var_os("POSTGRES_INSTALL_DIR")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR. This change is to fix a clippy error. Can revert and put to another PR if needed.

@aome510 aome510 force-pushed the add-branch-perf-tests branch from 7af02e4 to 7050eb4 Compare July 18, 2022 20:41
@@ -110,8 +110,8 @@ jobs:
target/
# Fall back to older versions of the key, if no cache for current Cargo.lock was found
key: |
v2-${{ runner.os }}-${{ matrix.build_type }}-cargo-${{ matrix.rust_toolchain }}-${{ hashFiles('Cargo.lock') }}
v2-${{ runner.os }}-${{ matrix.build_type }}-cargo-${{ matrix.rust_toolchain }}-
v3-${{ runner.os }}-${{ matrix.build_type }}-cargo-${{ matrix.rust_toolchain }}-${{ hashFiles('Cargo.lock') }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the cache key to fix the incremental compilation error. See https://neondb.slack.com/archives/C0277TKAJCA/p1658169404632249

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I purged the cache so this hopefully won't happen again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's safe to revert the CI changes for now.

Copy link
Contributor Author

@aome510 aome510 Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: my bad, it's not safe to revert the changes.

@@ -101,7 +101,7 @@ jobs:
!~/.cargo/registry/src
~/.cargo/git
target
key: ${{ runner.os }}-cargo-${{ hashFiles('./Cargo.lock') }}-rust-${{ matrix.rust_toolchain }}
key: v1-${{ runner.os }}-cargo-${{ hashFiles('./Cargo.lock') }}-rust-${{ matrix.rust_toolchain }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update cache key to fix the incremental compilation error.

@aome510 aome510 requested review from hlinnaka and knizhnik July 18, 2022 21:58
Comment on lines +1123 to 1131
horizon_cutoff: Lsn,

/// In addition to 'retain_lsns', keep everything newer than 'SystemTime::now()'
/// minus 'pitr_interval'
/// In addition to 'retain_lsns' and 'horizon_cutoff', keep everything newer than this
/// point.
///
pitr: Duration,
/// This is calculated by finding a number such that a record is needed for PITR
/// if only if its LSN is larger than 'pitr_cutoff'.
pitr_cutoff: Lsn,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can combine horizon_cutoff and pitr_cutoff into one value. The code in gc prints different debug messages, and uses separate counters depending on which one was smaller, but I don't think that distinction is needed. Instead, you could just print a debug message in update_gc_info, to indicate which was dominant.

Copy link
Contributor Author

@aome510 aome510 Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. However, it's not a trivial change though. I think it's better to leave that to a separate PR/ follow-up

Comment on lines 284 to 292
// Besides GC lock, branch creation task doesn't need to hold the `layer_removal_cs` lock,
// hence doesn't need to wait for compaction/GC because it ensures that the starting LSN
// of the child branch is not out of scope in the middle of the creation task by
// 1. holding the GC lock to prevent overwritting timeline GC data
// 2. checking both the latest GC cutoff LSN and latest GC info of the source timeline
// to avoid initializing the new branch using data removed by past GC iterations
// or in-queue GC iterations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit weird to explain in great detail why this doesn't need to hold layer_removal_cs lock. Also, this seems like a run-on sentence.

@aome510 aome510 force-pushed the add-branch-perf-tests branch from cd37135 to eb2a2d7 Compare July 19, 2022 16:21
@aome510 aome510 force-pushed the add-branch-perf-tests branch from eb2a2d7 to a933902 Compare July 19, 2022 16:22
@aome510
Copy link
Contributor Author

aome510 commented Jul 19, 2022

One example result when running the new branch creation perf tests comparing with the latest main:

This patch:

test_branch_creation_heavy_write[20].branch_creation_duration_max: 3.961 s
test_branch_creation_heavy_write[20].branch_creation_duration_avg: 1.185 s
test_branch_creation_heavy_write[20].branch_creation_duration_stdev: 1.222 s
test_branch_creation_many[1024].branch_creation_duration_max: 0.273 s
test_branch_creation_many[1024].branch_creation_duration_avg: 0.102 s
test_branch_creation_many[1024].branch_creation_duration_stdev: 0.037 s

main:

test_branch_creation_heavy_write[20].branch_creation_duration_max: 21.256 s
test_branch_creation_heavy_write[20].branch_creation_duration_avg: 2.072 s
test_branch_creation_heavy_write[20].branch_creation_duration_stdev: 5.348 s
test_branch_creation_many[1024].branch_creation_duration_max: 3.733 s
test_branch_creation_many[1024].branch_creation_duration_avg: 0.106 s
test_branch_creation_many[1024].branch_creation_duration_stdev: 0.119 s

In general, looks like a decent improvement. Will merge once tests passed

@aome510 aome510 mentioned this pull request Jul 19, 2022
4 tasks
@aome510 aome510 merged commit 160e52e into main Jul 19, 2022
@aome510 aome510 deleted the add-branch-perf-tests branch July 19, 2022 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Branch creation waits for compaction to finish
2 participants