-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store prefetch results in LFC cache once as soon as they are received #10442
base: main
Are you sure you want to change the base?
Conversation
7403 tests run: 7016 passed, 0 failed, 387 skipped (full report)Flaky tests (5)Postgres 17
Postgres 16
Postgres 15
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
779dff3 at 2025-01-27T18:03:54.760Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've only done a high-level review, but the approach makes sense to me. I expect the synchronization overhead would be negligible since this is on an IO path, but would be interested to see how benchmarks are affected.
4c3b6ca
to
071142f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing the documentation as to why it's OK to side-load the data here without holding a buffer lock.
Could you add that info?
pgxn/neon/pagestore_smgr.c
Outdated
if (response->tag == T_NeonGetPageResponse) | ||
{ | ||
/* store prefetched result in LFC */ | ||
lfc_prefetch(BufTagGetNRelFileInfo(buftag), buftag.forkNum, buftag.blockNum, ((NeonGetPageResponse*)response)->page, slot->request_lsns.not_modified_since); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about the ad-hoc nature of this.
I think we could better do this only at the end of any smgr operation, so that we can do this in bulk and not worry about e.g. O(128) lfc_prefetch calls when many pages of the same chunk have been prefetched.
Futhermore, this will skew our metric towards prefetch discards
significantly, as this will likely cause most "prefetch hits" we would have had earlier to now become "prefetch discards", as this new code doesn't distinguish between those types of hits/misses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, I also think this can cause double the number of writes to LFC: The read-from-pageserver path already includes writing the page to LFC; leading to duplicated efforts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more explanations to lfc_prefetch
explaining it's correctness.
Sorry, I do not completely understand you concerns about "ad-hoc nature of this".
Actually I realised that I missed one place where lfc_prefetch
has to be called: prefetch_pump_state
recently added by you.
My idea was to store prefetched page in LFC ASAP to minimise probability that some other backend will request the same page. It cause duplicated getpage
requests to PS. Yes, it can not completely eliminate duplicates. To achieve it we need global prefetch queue, which is very non-desirable. But it allows to significantly reduce this probability.
Why multiple backends can try to prefetch the same pages? It seems to be quite natural that different concurrent backends are executing there same queries and retrieving almost the same date. In case of seqscan, Postgres can use synchronised scan, which allows to eliminate redundant reads. But there is nothing similar in case of indexscan and prefetch of referenced heap pages. We have actually observed such behaviour and the main idea of this PR is to address this problem. prefetch_pump_state
is called mostly "at the end of any smgr operation". So it really retrieve received prefetch results ASAP (only separate background worker can do it faster, but it seems to be overkill) and the next thing is to place it in LFC to let other backends use this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I do not completely understand you concerns about "ad-hoc nature of this".
The "ad-hoc" refers to "load the data unconditionally at exactly the moment we received the data".
I think there is no good reason to load the page here: It is very possible that we process requests which shouldn't end up in the LFC (e.g. through the get_page_at_lsn function of the neon_test_utils extension), and IMO we should prefer bulk loading the responses into the LFC over loading individual pages.
My idea was to store prefetched page in LFC ASAP to minimise probability that some other backend will request the same page
I think that makes sense - but only if everything else we can do is already done and we'd otherwise be waiting on IO.
I just think it quite wasteful we don't batch these operations - now we lock the LFC 2-3x for every page we try to put into the LFC from prefetch, while IMO that should be 2-3x for a batch of pages at most. See the idea behind #10492 - it sideloads pages in bulk at the end of any IO operation (at least, the pages not yet sideloaded) and discards those pages that were successfully sideloaded or already in the LFC, so that we immediately reduce backend-local memory usage rather than keeping up to 1MB of backend memory for prefetched results that are never used because they're already loaded into the LFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no good reason to load the page here:
I think there is very good reasoner it: once we received something from PS, make it available to their backend to avoid duplicated get page
requests. Certainly prefetched page will sooner or later hopefully placed and shared buffer and so becomes available through it. But
- It can happen that prefetched pages will too be actually requested (in case of LIMIT clause).
- It may happen too late when it will be already requested by other backends.
and IMO we should prefer bulk loading the responses into the LFC over loading individual pages.
It is definitely preferable for seqscan and may be rewarding. But prefetching of referenced heap pages is more or less random (and page based). And this is may why it provides the best improvement (larger than for seqscan).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense - but only if everything else we can do is already done and we'd otherwise be waiting on IO.
What kind of IO? Writing page to LFC file? It is written in memory and only sometimes later may be flushed to the disk. I do not thing that local SSD write speed will become a bottleneck.
while IMO that should be 2-3x for a batch of pages at most
Mostly agree with you. But
a) for batch prefetch we need to change protocol
b) there are still non-batched requests (index scans,...) prefetching of which may be even more important. At least we are still positioning Neon as OLTP rather than OLAP DBMS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the idea behind #10492 - it sideloads pages in bulk at the end of any IO operation
I do not think locking can be a bottleneck here... Originally I have concerns that single global LFC lock can cause huge contention and some degrade of performance. This is why I tried to partitioned has with multiple locks... and didn't observe some noticeable positive impaction performance. Certainly may be my test was no so good or some problem in implementation... Some other test may show different results.
Generally, I prefer "extreme programming" approach: do not try to solve the problem, unless you are sure that it is really a problem. Why do you think that double locking (I assume 3 or more locks will be very rare) is the problem?
Is there some test/workload which illustrate it? With this PR I am trying to address the real problem we had observed with v0 and some other clients: a lot of duplicated getpage requests. Certainly there are some other factors which cause such behaviour (wrong policy for prefetch distance for indexscan or too large prefetch default distance for example). But still it seems to be quite obvious drawback of current local prefetch design. I still think that local prefetch is better than having some global prefetch pool of workers. But it doesn't;t mean that we should not try to minimise issues with this approach.
I once again want to notice that best results prefetch should for heap page prefetch for indexscan. So prefetching individual pages is very important. May be even more important than bulk (chunk) preloading.
I do not see much problems to write in LFC what we already received from PS. Neither synchronisation (LFC locks), neither write to LFC file shouldn't be a bottleneck.
But instead of "tasseography" I prefer to investigate real use cases or create some benchmarks to prove or disprove our assumptions and exceptions.
There are several things I have observed and what to share:
- seqscan prefetch provides ~2x improvement
- indexscan prefetch for non-clustered index provides ~3x improvement
- Efficiency of prefetch is determined not only by client and latencies but also by PS. Its speed depends on many factors including layers layout and presence of heavy background tasks such as compaction. Random accesses are much more expensive for PS, because most likely are not cached and require access to many different layers.
In some cases prefetch doesn't provide any advantages for parallel seqscan at all (just because several parallel workers can consume all capacity of PS).
What conclusions we can make from this facts? My opinion:
- More concern about efficiency off prefetch of individual pages rather than bulk prefetch.
- Try to minimise load of PS, at least by avoiding duplicated
get page
requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense - but only if everything else we can do is already done and we'd otherwise be waiting on IO.
What kind of IO?
In prefetch_wait_for()
we loop on prefetch_read()
, which can block, waiting for IO from page_server->receive()
. I think we should first receive as much data as possible in that loop, then (once we would block) load whatever data we have into LFC, and only then start waiting on IO from PS.
Mostly agree with you. But
a) for batch prefetch we need to change protocol
You misunderstand my point. I don't (necessarily) want to batch prefetching to PS, but I want to batch whatever operations we do on LFC, so that we don't have un-necessary locking on that data.
Example: we have prefetched 10 pages which are yet to be pageserver->receive()-d. We now need the 10th page of those (the one that was most recently smgr_prefetch()-ed).
We can go iterate nine times between pageserver->receive()
and lfc_prefetch()
, or alternatively we can pageserver->try_receive()
whichever blocks are already available, and then store them with lfc_prefetch
in one batch call, allowing LFC to limit its locking, write()s, and lookups, to a minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think locking can be a bottleneck here
With sufficient readahead and many (1000s of) backends I don't see why it would be impossible to be a bottleneck. Sure, there are many other systems that are slow, but we use the LFC a lot, and only the "is this page in the LFC" lock is shared: all IO operations (both read and write) take an exclusive lock, even if that is only temporary. So, if each backend is doing 100s of reads /minute, then I wouldn't be surprised if the lfc_ctl lock would show up in the wait events.
Generally, I prefer "extreme programming" approach: do not try to solve the problem, unless you are sure that it is really a problem.
I can agree with that, yet I prefer to design my code so that it doesn't become an issue at all, or at least reduces the risk of becoming an issue. There are 2 orders of magnitude difference in "number of times we take an exclusive lock" between the optimal performance of bulk loading data into the LFC and page-by-page loading of that same data into the LFC, and I don't see how that difference would be ~"free" performance-wise.
IMV, the lack of batching here is newly added technical and performance debt, and I don't like that.
Neither synchronisation (LFC locks), neither write to LFC file shouldn't be a bottleneck.
I think your double negative here implies that both synchronization and writing to LFC file should be a bottleneck?
But regardless, I think LFC locking can definitely become a bottleneck here, so I would prefer to limit that to a minimum.
|
||
typedef struct FileCacheEntry | ||
{ | ||
BufferTag key; | ||
uint32 hash; | ||
uint32 offset; | ||
uint32 access_count; | ||
uint32 bitmap[CHUNK_BITMAP_SIZE]; | ||
uint32 state[(BLOCKS_PER_CHUNK + 31) / 32 * 2]; /* two bits per block */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about increasing the size of LFC by ~20% (was: 64 bytes FCEntry, is now 80 bytes FCEntry (+ 24 bytes hash table overheads), result = 18% increase in size). Sure, it isn't that large, but I think it's still a significant fraction of our shared memory.
Is there a specific reason why this needs block-level locking, rather than the chunk-level locking I introduced in #10312 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I am also do not want to increase FileCacheEntry much, because it will increase memory footprint of LFC.
This is why I tried to add just one extra bit (per page). Why not per-chunk? I want to avoid false collisions caused by access to different page of the same chunk. Prefetch of one page of this chunk can block readers of all other pages. This is IMHO very undesirable, especially for seqscan where such collisions are expected to happen very frequently.
Laos please notice that there are many ways to decrease size of FileCacheEntry which are not yet taken: for example list_node
used for LRU can be reduced from 16 bytes to just 8 if we replace pointers with integers.
In any case, IMHO 80 bytes is still acceptable and printing false sharing is more critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list_node
used for LRU can be reduced from 16 bytes to just 8 if we replace pointers with integers
That's not impossible, but it would require us to replace the structure of the LFC:
Right now we store our data in hashmap entries - which are not exactly arbitrarily allocated, but certainly not as a single array of the entry type. If we wanted to use indexing, we'd need to store references that map into an array of FCEntries instead.
While that is not impossible, it is certainly more work than just replacing dlist with a homegrown solution, and it would add some overhead for the increased indirections that then would need to be stored.
071142f
to
582953e
Compare
Problem
Prefetch is performed locally, so different backers can request the same pages form PS.
Such duplicated request increase load of page server and network traffic.
Making prefetch global seems to be very difficult and undesirable, because different queries can access chunks on different speed. Storing prefetch chunks in LFC will not completely eliminate duplicates, but can minimise such requests.
The problem with storing prefetch result in LFC is that in this case page is not protected by share buffer lock.
So we will have to perform extra synchronisation at LFC side.
See:
https://neondb.slack.com/archives/C0875PUD0LC/p1736772890602029?thread_ts=1736762541.116949&cid=C0875PUD0LC
@MMeent implementation of prewarm:
See #10312
Summary of changes
Use conditional variables to sycnhronize access to LFC entry.