-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set last written lsn for created relation #2398
Conversation
Seems reasonable, but I'm curious if the lack of this was causing some user-visible issue? Can you write a test case? |
This is what I trying to create all this day. Without any success:( It is not so difficult to force this scenario: just inset some sleep in Also I wonder why test_parallel_copy may cause reading relation pages while copying data to it. |
I think I just hit this bug on my PR: https://github.com/neondatabase/neon/actions/runs/3002515510 |
I more precisely investigated the log and almost sure that the problem is cause by lsn written cache: it is not updated after relation creation (and it is fixed by this PR).
and it is LSN of branch creation:
So this LSN precedes moment when table is created and that it why key is not found. |
The question remains, what exactly is the sequence of events here? I would've thought it goes like this:
So even if we're missing a SetLastWrittenLSNForRelation() call when the relation is created, the COPY should do it before any GetPage requests on the table are issued. What am I missing? |
Hmm, I think this is what actually happens:
I think the "blulk extension" code in RelationAddExtraBlocks() needs to be hit for this to lead to an error. RelationAddExtraBlocks() extends the relation with empty pages without WAL-logging them. |
This is what I do not understand myself:( |
O just realized that actually So there is no magic here, but still any my attempt to reproduce the bug by inserting delays didn't succeed. |
The strangest thing here is that this PR is setting last written LSN for the relation metadata (i.e. REL_METADATA_PSEUDO_BLOCKNO). But according to the log, the failure happens in GetPage, so last written LSN for the correspondent chunk should be used instead. It is unclear how setting LSN for relation metadata may affect it. But is it a fact there are no CI failures in this branch, although I have restarted tests more than ten times. |
Three news:
So, as @hlinnaka expected, the problem is related with bulk relation extension. In this case quantum of new page is allocaetd using smgr_extend with zero buffer. So it them are not wal-logged and last written LSN is not updated for them. If such page is swapped out and the accessed before SMGR_CREATE record is replayed by pageserver and some other page from this chunk is updated, then we will get this "key not found" error, because we try to retrieve page of the relation which doesn't not yet exist at pageserver. But reproducing all this conditions is very non trivial and I have spent couple of days trying to simulate stuation which rarely happens on CI and never at local runs. So what I have to do:
and
Backend is stopped at breakpoint.
|
I have not checked, but looks like the problem is not caused by my last written lsn cache. It just increase probability of such error. |
c8839f8
to
24e9713
Compare
I wonder if I should continue attempts to create some test reproducing the problem? Looks like we need something like failpoints mechanism but now for C (for postgres code). There are actually two mechanism used to reproduce race condition bugs;
In any case, the problem seems to be clear and this PR is fixing it (I hope: the problem is not reproduced with my manual scenario). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's get this in.
Careful with vendor/postgres-v15! I think this PR is about to make the same mistake as commit f44afba, and changes vendor/postgres-v15 actually be v14 again.
I created 2 PRs for core part of this patch: |
Approved neondatabase/postgres#209 and opened new PR for the v15 changes at neondatabase/postgres#212, with REL_15_STABLE_neon as the base. |
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
eda89e9
to
84d3fb2
Compare
Postgres core part of this PR is merged, neon part still waiting for review. Not ACID:) |
No description provided.