Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set last written lsn for created relation #2398

Merged
merged 4 commits into from
Sep 19, 2022
Merged

Set last written lsn for created relation #2398

merged 4 commits into from
Sep 19, 2022

Conversation

knizhnik
Copy link
Contributor

@knizhnik knizhnik commented Sep 6, 2022

No description provided.

@hlinnaka
Copy link
Contributor

hlinnaka commented Sep 6, 2022

Seems reasonable, but I'm curious if the lack of this was causing some user-visible issue? Can you write a test case?

@knizhnik
Copy link
Contributor Author

knizhnik commented Sep 6, 2022

Seems reasonable, but I'm curious if the lack of this was causing some user-visible issue? Can you write a test case?

This is what I trying to create all this day. Without any success:(
This change is motivated by failure of test_parallel_copy test (#2384)
It failed to find key with relation size in get_page_at_lsn.
I tried to understand how it is possible (assuming that there is no data corruption or race condition in pageserver). I have inspected content of WAL after execution of "CREATE TABLE" command. There is only one record associated with created relation: SMGR_CREATE. But we are not update last written LSN cache.So if we try to send any request to this relation to pageserver, then we will use LSN preceeding one of SMGR_CREATE record. Compute node will specify "latest" flag, but it can happen that it is also smaller than LSN of SMGR_CREATE.

It is not so difficult to force this scenario: just inset some sleep in ingest_xlog_smgr_create.
But what I failed to make compute to request page of this relation fro page server. We have relation size cache at compute, so we know that relation is empty and do not try to read any pages from it.

Also I wonder why test_parallel_copy may cause reading relation pages while copying data to it.

@hlinnaka
Copy link
Contributor

hlinnaka commented Sep 7, 2022

I think I just hit this bug on my PR: https://github.com/neondatabase/neon/actions/runs/3002515510

@knizhnik
Copy link
Contributor Author

knizhnik commented Sep 7, 2022

I think I just hit this bug on my PR: https://github.com/neondatabase/neon/actions/runs/3002515510

I more precisely investigated the log and almost sure that the problem is cause by lsn written cache: it is not updated after relation creation (and it is fixed by this PR).
As you can see request LSN is 0/016960E8:

asyncpg.exceptions.PostgresIOError: could not read block 3 in rel 1663/12972/16384.0 from page server at lsn 0/016960E8

and it is LSN of branch creation:

2022-09-06 21:32:27.756 INFO [neon_fixtures.py:1151] Run success: Created timeline '55a645087623b4cffacbb3336d4d2825' at Lsn 0/16960E8 for tenant: a310952299e6717aa7b9bab68c685eba. Ancestor timeline: 'empty'

So this LSN precedes moment when table is created and that it why key is not found.

@hlinnaka
Copy link
Contributor

hlinnaka commented Sep 7, 2022

The question remains, what exactly is the sequence of events here? I would've thought it goes like this:

  1. CREATE TABLE, creates SMGR_CREATE record. Without this PR, it does not update last-written LSN
  2. COPY extends the table from 0 to 1 pages. It calls smgrextend(), which updates the relsize cache, and also updates the last-written LSN for the block range, and for the relation (i.e. for REL_METADATA_PSEUDO_BLOCKNO).

So even if we're missing a SetLastWrittenLSNForRelation() call when the relation is created, the COPY should do it before any GetPage requests on the table are issued. What am I missing?

@hlinnaka
Copy link
Contributor

hlinnaka commented Sep 7, 2022

Hmm, I think this is what actually happens:

  1. CREATE TABLE, creates SMGR_CREATE record. Without this PR, it does not update last-written LSN
  2. COPY extends the table from 0 to 1 pages. It calls smgrextend(), which calls SetLastWrittenLSNForBlock() and SetLastWrittenLSNForRelation(). However: when the heap is extended, it extends the relation with empty pages, with 0/0 LSN. The SetLastWrittenLSN calls do nothing, when the LSN is invalid.

I think the "blulk extension" code in RelationAddExtraBlocks() needs to be hit for this to lead to an error. RelationAddExtraBlocks() extends the relation with empty pages without WAL-logging them.

@knizhnik
Copy link
Contributor Author

knizhnik commented Sep 7, 2022

So even if we're missing a SetLastWrittenLSNForRelation() call when the relation is created, the COPY should do it before any GetPage requests on the table are issued. What am I missing?

This is what I do not understand myself:(
As we see, pageserver failed in get_page_at_lsn call. It means that one of parallel copy tasks tries to load some page pf this relation. But it means that relation shoudl be considered as non-empty i.e. was extended.
But if it is extended (smgr_extend) is called, then SetLastWrittenLsn should be called for this relation and the fact that it was not called after creation of relation doesn't matter any more.

@knizhnik
Copy link
Contributor Author

knizhnik commented Sep 7, 2022

O just realized that actually smgr_extend can be called and in turn call SetLastWrittenLSNForRelation.
But if there is no LSN stored in the mage image (PageGetLSN) then this SetLastWrittenLSNForRelation does nothing and does not update last written LSN for this relation.
And calling ReadBuffer* with P_NEW actually initializes page with zeros.

So there is no magic here, but still any my attempt to reproduce the bug by inserting delays didn't succeed.

@knizhnik
Copy link
Contributor Author

knizhnik commented Sep 7, 2022

The strangest thing here is that this PR is setting last written LSN for the relation metadata (i.e. REL_METADATA_PSEUDO_BLOCKNO). But according to the log, the failure happens in GetPage, so last written LSN for the correspondent chunk should be used instead. It is unclear how setting LSN for relation metadata may affect it. But is it a fact there are no CI failures in this branch, although I have restarted tests more than ten times.

@knizhnik
Copy link
Contributor Author

knizhnik commented Sep 9, 2022

Three news:

  1. (good) I managed to reliably reproduce this error.
  2. (bad) My patches in this PR really doesn't prevent it
  3. It is hard to write test for it.

So, as @hlinnaka expected, the problem is related with bulk relation extension. In this case quantum of new page is allocaetd using smgr_extend with zero buffer. So it them are not wal-logged and last written LSN is not updated for them. If such page is swapped out and the accessed before SMGR_CREATE record is replayed by pageserver and some other page from this chunk is updated, then we will get this "key not found" error, because we try to retrieve page of the relation which doesn't not yet exist at pageserver. But reproducing all this conditions is very non trivial and I have spent couple of days trying to simulate stuation which rarely happens on CI and never at local runs.

So what I have to do:

  1. Insert large sleep before self.put_rel_creation ni walingest. So it simulates situation when lagging page server is not yet replayed this record when page of this relation is accessed.
	std::thread::sleep(std::time::Duration::from_millis(30_000));
        self.put_rel_creation(modification, rel)?;
  1. Bulk relation extension (RelationAddExtraBlocks) is used only when there are mutliple waiter: backends trying to extend relation. It just comment this logic and let RelationAddExtraBlocks always be used:
	/*
	 * If we need the lock but are not able to acquire it immediately, we'll
	 * consider extending the relation by multiple blocks at a time to manage
	 * contention on the relation extension lock.  However, this only makes
	 * sense if we're using the FSM; otherwise, there's no point.
	 */
	if (needLock)
	{
		#if 0
		if (!use_fsm)
			LockRelationForExtension(relation, ExclusiveLock);
		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
		#endif

and

	/* Use the length of the lock wait queue to judge how much to extend. */
	#if 0
	lockWaiters = RelationExtensionLockWaiterCount(relation);
	if (lockWaiters <= 0)
		return;
	#endif
  1. Then I set breakpoint in GDB at the end of RelationAddExtraBlocks at let it extend relation, mark pages as free in free space map but do not fill them.

  2. Create relation and initiate copy to it:

create table t1(x integer);
copy t1 from '/tmp/t.csv';

Backend is stopped at breakpoint.

  1. In another session initiate another copy and get key not found error:
copy t1 from '/tmp/t.csv';
ERROR:  could not read block 0 in rel 1663/13010/16384.0 from page server at lsn 0/0169A3F8
DETAIL:  page server returned error: could not find data for key 000000067F000032D20000400000FFFFFFFF at LSN 0/169A380, for request at LSN 0/169A488
CONTEXT:  COPY t1, line 1000

@knizhnik
Copy link
Contributor Author

knizhnik commented Sep 9, 2022

I have not checked, but looks like the problem is not caused by my last written lsn cache. It just increase probability of such error.

@knizhnik
Copy link
Contributor Author

I wonder if I should continue attempts to create some test reproducing the problem? Looks like we need something like failpoints mechanism but now for C (for postgres code). There are actually two mechanism used to reproduce race condition bugs;

  1. Specify particular schedule using sleeps or some other mechanism.
  2. Repeat test many times hoping that any synchronization bug will sooner or later show up itself.
    In this particular case approach 2 is not working well, because the problem can happen t the very beginning work with relation. So we need to have multiple backends trying to append data to empty (just created) relation. So we nee to organize concurrent access to the relation and contention fro the very beginning. It seems to be difficult to achieve at normal desktop or laptop. This is why the problem never reproduced locally, I am not sure what is more critical for reproducing the problem: larger number of cores or some background activity which delays backends acting as some kind of barrier.

In any case, the problem seems to be clear and this PR is fixing it (I hope: the problem is not reproduced with my manual scenario).

Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's get this in.

Careful with vendor/postgres-v15! I think this PR is about to make the same mistake as commit f44afba, and changes vendor/postgres-v15 actually be v14 again.

@knizhnik
Copy link
Contributor Author

knizhnik commented Sep 15, 2022

Ok, let's get this in.

Careful with vendor/postgres-v15! I think this PR is about to make the same mistake as commit f44afba, and changes vendor/postgres-v15 actually be v14 again.

I created 2 PRs for core part of this patch:
neondatabase/postgres#209
neondatabase/postgres#211
Them require approval to be committed.
Once been commiitted, I will update submodule references in neon repository.

@hlinnaka
Copy link
Contributor

Ok, let's get this in.
Careful with vendor/postgres-v15! I think this PR is about to make the same mistake as commit f44afba, and changes vendor/postgres-v15 actually be v14 again.

I created 2 PRs for core part of this patch: neondatabase/postgres#209 neondatabase/postgres#211 Them require approval to be committed. Once been commiitted, I will update submodule references in neon repository.

Approved neondatabase/postgres#209 and opened new PR for the v15 changes at neondatabase/postgres#212, with REL_15_STABLE_neon as the base.

@knizhnik
Copy link
Contributor Author

Postgres core part of this PR is merged, neon part still waiting for review. Not ACID:)

@knizhnik knizhnik merged commit 846d126 into main Sep 19, 2022
@knizhnik knizhnik deleted the create_rel_lsn branch September 19, 2022 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants