-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay calculating the starting LSN when doing timeline branching #2053
Conversation
Previously, upon branching, if no starting LSN is specified, we determine the start LSN based on the source timeline's last record LSN in `timelines::create_timeline` function, which then calls `Repository::branch_timeline`. Inside the `LayeredRepository::branch_timeline`, to start branching, we try to acquire a GC lock to prevent GC from removing data needed for the new timeline. However, a GC iteration takes time, so the GC lock can be held for a long period of time. As a result, the previously determined starting LSN can become invalid because of GC and the long wait time.
Related discussion: #1569 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have correctly explained the reason of the problem: GC lock in branch_timeline can block creation of timeline for a long time - till end of GC. At this moment chosen branchong point can be beyond GC horizon which cause "invalid branch LSN" error despite to thefact that it was not explicitly specified.
Your fix definitely solve this issue. And it is obviously better than it was before. This is why I approved this PR. But my concern is that if GC really can take arbitrary long time, then blocking branch creation till end of GC may be problematic. Yes, branches re not expected to be created frequently. But if it takes let say one minute and "create branch" command is blocked for one minute, ... then user can just think that Neon hangs.
I tried to investigate this issue more. First of all I find out that I have used slightly deteriorated version of neon - when I checkout the latest sources the problem really reproduced quite frequently. So something is recently changed... Then I tried to understand why GC takes so long time, although it is expected to be cheap operation (reading only layers metadata, do not read or write layer file content - just delete deteriorated layers). And the reason is compaction_cs which is obtained not only by compaction but also by GC. And compaction can really take very long time. Is it acceptabe to delay creation of new branches and GC until compaction is complete? |
Yeah, that's not great... Created a new issue to track that. |
I also realize the same thing. Initially about to comment in the original discussion but forgot to do so. It's a bad thing that there is a delay during branch creation because of GC as it contradicts with our description in the website:
Waiting 16s is not instant :) |
Previously, upon branching, if no starting LSN is specified, we
determine the start LSN based on the source timeline's last record LSN
in
timelines::create_timeline
function, which then callsRepository::branch_timeline
to create the timeline.
Inside the
LayeredRepository::branch_timeline
function, to start branching,we try to acquire a GC lock to prevent GC from removing data needed
for the new timeline. However, a GC iteration takes time, so the GC lock
can be held for a long period of time. As a result, the previously determined
starting LSN can become invalid because of GC.
This PR fixes the above issue by delaying the LSN calculation part and moving it to be
inside
LayeredRepository::branch_timeline
function.