Store prefetch results in LFC cache once as soon as they are received #10442

knizhnik · 2025-01-19T07:03:21Z

Problem

Prefetch is performed locally, so different backers can request the same pages form PS.
Such duplicated request increase load of page server and network traffic.
Making prefetch global seems to be very difficult and undesirable, because different queries can access chunks on different speed. Storing prefetch chunks in LFC will not completely eliminate duplicates, but can minimise such requests.

The problem with storing prefetch result in LFC is that in this case page is not protected by share buffer lock.
So we will have to perform extra synchronisation at LFC side.

See:
https://neondb.slack.com/archives/C0875PUD0LC/p1736772890602029?thread_ts=1736762541.116949&cid=C0875PUD0LC

@MMeent implementation of prewarm:
See #10312

Summary of changes

Use conditional variables to sycnhronize access to LFC entry.

github-actions · 2025-01-19T08:03:14Z

7403 tests run: 7016 passed, 0 failed, 387 skipped (full report)

Flaky tests (5)

Postgres 17

test_metrics_normal_work: release-x86-64-without-lfc
test_timeline_delete_works_for_remote_smoke: release-x86-64-without-lfc

Postgres 16

test_metrics_normal_work: release-arm64-with-lfc

Postgres 15

test_pageserver_gc_compaction_idempotent[after_restart]: release-arm64-with-lfc

Postgres 14

test_metrics_normal_work: release-x86-64-with-lfc

Code coverage* (full report)

functions: 33.5% (8496 of 25347 functions)
lines: 49.3% (71478 of 145075 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
779dff3 at 2025-01-27T18:03:54.760Z :recycle:}

erikgrinaker

Thanks! I've only done a high-level review, but the approach makes sense to me. I expect the synchronization overhead would be negligible since this is on an IO path, but would be interested to see how benchmarks are affected.

MMeent

I'm missing the documentation as to why it's OK to side-load the data here without holding a buffer lock.

Could you add that info?

MMeent · 2025-01-20T12:27:30Z

pgxn/neon/pagestore_smgr.c

+		if (response->tag == T_NeonGetPageResponse)
+		{
+			/* store prefetched result in LFC */
+			lfc_prefetch(BufTagGetNRelFileInfo(buftag), buftag.forkNum,  buftag.blockNum, ((NeonGetPageResponse*)response)->page, slot->request_lsns.not_modified_since);


I'm a bit concerned about the ad-hoc nature of this.

I think we could better do this only at the end of any smgr operation, so that we can do this in bulk and not worry about e.g. O(128) lfc_prefetch calls when many pages of the same chunk have been prefetched.

Futhermore, this will skew our metric towards prefetch discards significantly, as this will likely cause most "prefetch hits" we would have had earlier to now become "prefetch discards", as this new code doesn't distinguish between those types of hits/misses.

Furthermore, I also think this can cause double the number of writes to LFC: The read-from-pageserver path already includes writing the page to LFC; leading to duplicated efforts.

I added more explanations to lfc_prefetch explaining it's correctness.

Sorry, I do not completely understand you concerns about "ad-hoc nature of this".
Actually I realised that I missed one place where lfc_prefetch has to be called: prefetch_pump_state recently added by you.

My idea was to store prefetched page in LFC ASAP to minimise probability that some other backend will request the same page. It cause duplicated getpage requests to PS. Yes, it can not completely eliminate duplicates. To achieve it we need global prefetch queue, which is very non-desirable. But it allows to significantly reduce this probability.

Why multiple backends can try to prefetch the same pages? It seems to be quite natural that different concurrent backends are executing there same queries and retrieving almost the same date. In case of seqscan, Postgres can use synchronised scan, which allows to eliminate redundant reads. But there is nothing similar in case of indexscan and prefetch of referenced heap pages. We have actually observed such behaviour and the main idea of this PR is to address this problem. prefetch_pump_state is called mostly "at the end of any smgr operation". So it really retrieve received prefetch results ASAP (only separate background worker can do it faster, but it seems to be overkill) and the next thing is to place it in LFC to let other backends use this page.

Sorry, I do not completely understand you concerns about "ad-hoc nature of this".

The "ad-hoc" refers to "load the data unconditionally at exactly the moment we received the data".

I think there is no good reason to load the page here: It is very possible that we process requests which shouldn't end up in the LFC (e.g. through the get_page_at_lsn function of the neon_test_utils extension), and IMO we should prefer bulk loading the responses into the LFC over loading individual pages.

My idea was to store prefetched page in LFC ASAP to minimise probability that some other backend will request the same page

I think that makes sense - but only if everything else we can do is already done and we'd otherwise be waiting on IO.
I just think it quite wasteful we don't batch these operations - now we lock the LFC 2-3x for every page we try to put into the LFC from prefetch, while IMO that should be 2-3x for a batch of pages at most. See the idea behind #10492 - it sideloads pages in bulk at the end of any IO operation (at least, the pages not yet sideloaded) and discards those pages that were successfully sideloaded or already in the LFC, so that we immediately reduce backend-local memory usage rather than keeping up to 1MB of backend memory for prefetched results that are never used because they're already loaded into the LFC.

I think there is no good reason to load the page here:

I think there is very good reasoner it: once we received something from PS, make it available to their backend to avoid duplicated get page requests. Certainly prefetched page will sooner or later hopefully placed and shared buffer and so becomes available through it. But

It can happen that prefetched pages will too be actually requested (in case of LIMIT clause).

It may happen too late when it will be already requested by other backends.

and IMO we should prefer bulk loading the responses into the LFC over loading individual pages.

It is definitely preferable for seqscan and may be rewarding. But prefetching of referenced heap pages is more or less random (and page based). And this is may why it provides the best improvement (larger than for seqscan).

I think that makes sense - but only if everything else we can do is already done and we'd otherwise be waiting on IO.

What kind of IO? Writing page to LFC file? It is written in memory and only sometimes later may be flushed to the disk. I do not thing that local SSD write speed will become a bottleneck.

while IMO that should be 2-3x for a batch of pages at most
Mostly agree with you. But
a) for batch prefetch we need to change protocol
b) there are still non-batched requests (index scans,...) prefetching of which may be even more important. At least we are still positioning Neon as OLTP rather than OLAP DBMS.

See the idea behind #10492 - it sideloads pages in bulk at the end of any IO operation

I do not think locking can be a bottleneck here... Originally I have concerns that single global LFC lock can cause huge contention and some degrade of performance. This is why I tried to partitioned has with multiple locks... and didn't observe some noticeable positive impaction performance. Certainly may be my test was no so good or some problem in implementation... Some other test may show different results.

Generally, I prefer "extreme programming" approach: do not try to solve the problem, unless you are sure that it is really a problem. Why do you think that double locking (I assume 3 or more locks will be very rare) is the problem?
Is there some test/workload which illustrate it? With this PR I am trying to address the real problem we had observed with v0 and some other clients: a lot of duplicated getpage requests. Certainly there are some other factors which cause such behaviour (wrong policy for prefetch distance for indexscan or too large prefetch default distance for example). But still it seems to be quite obvious drawback of current local prefetch design. I still think that local prefetch is better than having some global prefetch pool of workers. But it doesn't;t mean that we should not try to minimise issues with this approach.

I once again want to notice that best results prefetch should for heap page prefetch for indexscan. So prefetching individual pages is very important. May be even more important than bulk (chunk) preloading.

I do not see much problems to write in LFC what we already received from PS. Neither synchronisation (LFC locks), neither write to LFC file shouldn't be a bottleneck.

But instead of "tasseography" I prefer to investigate real use cases or create some benchmarks to prove or disprove our assumptions and exceptions.

There are several things I have observed and what to share:

seqscan prefetch provides ~2x improvement

indexscan prefetch for non-clustered index provides ~3x improvement

Efficiency of prefetch is determined not only by client and latencies but also by PS. Its speed depends on many factors including layers layout and presence of heavy background tasks such as compaction. Random accesses are much more expensive for PS, because most likely are not cached and require access to many different layers.
In some cases prefetch doesn't provide any advantages for parallel seqscan at all (just because several parallel workers can consume all capacity of PS).

What conclusions we can make from this facts? My opinion:

More concern about efficiency off prefetch of individual pages rather than bulk prefetch.

Try to minimise load of PS, at least by avoiding duplicated get page requests.

I think that makes sense - but only if everything else we can do is already done and we'd otherwise be waiting on IO.

What kind of IO?

In prefetch_wait_for() we loop on prefetch_read(), which can block, waiting for IO from page_server->receive(). I think we should first receive as much data as possible in that loop, then (once we would block) load whatever data we have into LFC, and only then start waiting on IO from PS.

Mostly agree with you. But
a) for batch prefetch we need to change protocol

You misunderstand my point. I don't (necessarily) want to batch prefetching to PS, but I want to batch whatever operations we do on LFC, so that we don't have un-necessary locking on that data.

Example: we have prefetched 10 pages which are yet to be pageserver->receive()-d. We now need the 10th page of those (the one that was most recently smgr_prefetch()-ed).
We can go iterate nine times between pageserver->receive() and lfc_prefetch(), or alternatively we can pageserver->try_receive() whichever blocks are already available, and then store them with lfc_prefetch in one batch call, allowing LFC to limit its locking, write()s, and lookups, to a minimum.

I do not think locking can be a bottleneck here

With sufficient readahead and many (1000s of) backends I don't see why it would be impossible to be a bottleneck. Sure, there are many other systems that are slow, but we use the LFC a lot, and only the "is this page in the LFC" lock is shared: all IO operations (both read and write) take an exclusive lock, even if that is only temporary. So, if each backend is doing 100s of reads /minute, then I wouldn't be surprised if the lfc_ctl lock would show up in the wait events.

Generally, I prefer "extreme programming" approach: do not try to solve the problem, unless you are sure that it is really a problem.

I can agree with that, yet I prefer to design my code so that it doesn't become an issue at all, or at least reduces the risk of becoming an issue. There are 2 orders of magnitude difference in "number of times we take an exclusive lock" between the optimal performance of bulk loading data into the LFC and page-by-page loading of that same data into the LFC, and I don't see how that difference would be ~"free" performance-wise.

IMV, the lack of batching here is newly added technical and performance debt, and I don't like that.

Neither synchronisation (LFC locks), neither write to LFC file shouldn't be a bottleneck.

I think your double negative here implies that both synchronization and writing to LFC file should be a bottleneck?

But regardless, I think LFC locking can definitely become a bottleneck here, so I would prefer to limit that to a minimum.

MMeent · 2025-01-27T15:58:27Z

pgxn/neon/file_cache.c


 typedef struct FileCacheEntry
 {
 	BufferTag	key;
 	uint32		hash;
 	uint32		offset;
 	uint32		access_count;
-	uint32		bitmap[CHUNK_BITMAP_SIZE];
+	uint32		state[(BLOCKS_PER_CHUNK + 31) / 32 * 2]; /* two bits per block */


I'm a bit concerned about increasing the size of LFC by ~20% (was: 64 bytes FCEntry, is now 80 bytes FCEntry (+ 24 bytes hash table overheads), result = 18% increase in size). Sure, it isn't that large, but I think it's still a significant fraction of our shared memory.

Is there a specific reason why this needs block-level locking, rather than the chunk-level locking I introduced in #10312 ?

Well, I am also do not want to increase FileCacheEntry much, because it will increase memory footprint of LFC.
This is why I tried to add just one extra bit (per page). Why not per-chunk? I want to avoid false collisions caused by access to different page of the same chunk. Prefetch of one page of this chunk can block readers of all other pages. This is IMHO very undesirable, especially for seqscan where such collisions are expected to happen very frequently.

Laos please notice that there are many ways to decrease size of FileCacheEntry which are not yet taken: for example list_node used for LRU can be reduced from 16 bytes to just 8 if we replace pointers with integers.

In any case, IMHO 80 bytes is still acceptable and printing false sharing is more critical.

list_node used for LRU can be reduced from 16 bytes to just 8 if we replace pointers with integers

That's not impossible, but it would require us to replace the structure of the LFC:
Right now we store our data in hashmap entries - which are not exactly arbitrarily allocated, but certainly not as a single array of the entry type. If we wanted to use indexing, we'd need to store references that map into an array of FCEntries instead.
While that is not impossible, it is certainly more work than just replacing dlist with a homegrown solution, and it would add some overhead for the increased indirections that then would need to be stored.

knizhnik requested review from a team as code owners January 19, 2025 07:03

knizhnik requested review from ololobus, myrrc, erikgrinaker, hlinnaka and MMeent January 19, 2025 07:03

erikgrinaker reviewed Jan 20, 2025

View reviewed changes

knizhnik force-pushed the prefetch_in_lfc branch from 4c3b6ca to 071142f Compare January 20, 2025 19:09

MMeent mentioned this pull request Jan 23, 2025

Compute/prefetch+LFC: Sideloading for prefetched pages #10492

Draft

MMeent reviewed Jan 27, 2025

View reviewed changes

Konstantin Knizhnik added 2 commits January 27, 2025 18:33

Store prefetch results in LFC cache once as soon as they are received

dd98aa7

Add more comments explaining correctness of lfc_prefetch

582953e

knizhnik force-pushed the prefetch_in_lfc branch from 071142f to 582953e Compare January 27, 2025 16:33

Call lfc_prefetch in prefetch_pump_state

779dff3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store prefetch results in LFC cache once as soon as they are received #10442

Store prefetch results in LFC cache once as soon as they are received #10442

knizhnik commented Jan 19, 2025

github-actions bot commented Jan 19, 2025 •

edited

Loading

Postgres 17

Postgres 16

Postgres 15

Postgres 14

erikgrinaker left a comment •

edited

Loading

MMeent left a comment

MMeent Jan 20, 2025

MMeent Jan 27, 2025

knizhnik Jan 27, 2025

MMeent Jan 27, 2025

knizhnik Jan 27, 2025

knizhnik Jan 27, 2025

knizhnik Jan 27, 2025

MMeent Jan 27, 2025

MMeent Jan 27, 2025

MMeent Jan 27, 2025

knizhnik Jan 27, 2025

MMeent Jan 27, 2025

Store prefetch results in LFC cache once as soon as they are received #10442

Are you sure you want to change the base?

Store prefetch results in LFC cache once as soon as they are received #10442

Conversation

knizhnik commented Jan 19, 2025

Problem

Summary of changes

github-actions bot commented Jan 19, 2025 • edited Loading

7403 tests run: 7016 passed, 0 failed, 387 skipped (full report)

Postgres 17

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

erikgrinaker left a comment • edited Loading

Choose a reason for hiding this comment

MMeent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 19, 2025 •

edited

Loading

erikgrinaker left a comment •

edited

Loading