Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay calculating the starting LSN when doing timeline branching #2053

Merged
merged 2 commits into from
Jul 8, 2022

Conversation

aome510
Copy link
Contributor

@aome510 aome510 commented Jul 7, 2022

Previously, upon branching, if no starting LSN is specified, we
determine the start LSN based on the source timeline's last record LSN
in timelines::create_timeline function, which then calls Repository::branch_timeline
to create the timeline.

Inside the LayeredRepository::branch_timeline function, to start branching,
we try to acquire a GC lock to prevent GC from removing data needed
for the new timeline. However, a GC iteration takes time, so the GC lock
can be held for a long period of time. As a result, the previously determined
starting LSN can become invalid because of GC.

This PR fixes the above issue by delaying the LSN calculation part and moving it to be
inside LayeredRepository::branch_timeline function.

Previously, upon branching, if no starting LSN is specified, we
determine the start LSN based on the source timeline's last record LSN
in `timelines::create_timeline` function, which then calls `Repository::branch_timeline`.

Inside the `LayeredRepository::branch_timeline`, to start branching,
we try to acquire a GC lock to prevent GC from removing data needed
for the new timeline. However, a GC iteration takes time, so
the GC lock can be held for a long period of time. As a result,
the previously determined starting LSN can become invalid because of GC
and the long wait time.
@aome510 aome510 requested a review from knizhnik July 7, 2022 21:09
@aome510 aome510 requested a review from LizardWizzard July 7, 2022 21:14
@aome510
Copy link
Contributor Author

aome510 commented Jul 7, 2022

Related discussion: #1569 (comment)

Copy link
Contributor

@knizhnik knizhnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have correctly explained the reason of the problem: GC lock in branch_timeline can block creation of timeline for a long time - till end of GC. At this moment chosen branchong point can be beyond GC horizon which cause "invalid branch LSN" error despite to thefact that it was not explicitly specified.

Your fix definitely solve this issue. And it is obviously better than it was before. This is why I approved this PR. But my concern is that if GC really can take arbitrary long time, then blocking branch creation till end of GC may be problematic. Yes, branches re not expected to be created frequently. But if it takes let say one minute and "create branch" command is blocked for one minute, ... then user can just think that Neon hangs.

@knizhnik
Copy link
Contributor

knizhnik commented Jul 8, 2022

I tried to investigate this issue more. First of all I find out that I have used slightly deteriorated version of neon - when I checkout the latest sources the problem really reproduced quite frequently. So something is recently changed...

Then I tried to understand why GC takes so long time, although it is expected to be cheap operation (reading only layers metadata, do not read or write layer file content - just delete deteriorated layers). And the reason is compaction_cs which is obtained not only by compaction but also by GC. And compaction can really take very long time. Is it acceptabe to delay creation of new branches and GC until compaction is complete?

@hlinnaka
Copy link
Contributor

hlinnaka commented Jul 8, 2022

Then I tried to understand why GC takes so long time, although it is expected to be cheap operation (reading only layers metadata, do not read or write layer file content - just delete deteriorated layers). And the reason is compaction_cs which is obtained not only by compaction but also by GC. And compaction can really take very long time. Is it acceptabe to delay creation of new branches and GC until compaction is complete?

Yeah, that's not great... Created a new issue to track that.

@aome510
Copy link
Contributor Author

aome510 commented Jul 8, 2022

Then I tried to understand why GC takes so long time, although it is expected to be cheap operation (reading only layers metadata, do not read or write layer file content - just delete deteriorated layers). And the reason is compaction_cs which is obtained not only by compaction but also by GC. And compaction can really take very long time. Is it acceptabe to delay creation of new branches and GC until compaction is complete?

I also realize the same thing. Initially about to comment in the original discussion but forgot to do so. It's a bad thing that there is a delay during branch creation because of GC as it contradicts with our description in the website:

Neon allows to instantly branch your Postgres database to support a modern development workflow

Waiting 16s is not instant :)

@aome510 aome510 merged commit 1f5918b into main Jul 8, 2022
@aome510 aome510 deleted the delay-branch-start-lsn-calc branch July 8, 2022 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants