Epic: Storage perf improvements #3401

shanyp · 2023-01-23T09:43:44Z

Motivation

See https://docs.google.com/document/d/1GcmSW9_DXHou3tuezhjKyL2-ZNuxdRTAaKoi1BC08ns/edit#heading=h.1r2wls36zj0n

DoD

Implementation ideas

Tasks

Skip largest N holes during compaction #3597
Decrease the difference between physical and logical size

Other related tasks and Epics

knizhnik · 2023-01-23T12:58:06Z

Lemmas:

In most cases Neon access the most recent layers. Exceptions are read-only replicas (not currently supported) and branches (not widely used).
We have to store delta layer for quite long time because of PiTR (7 days now).
Accessing page image is about two times faster than page reconstruction

So what can we do to improve performance and support large databases based on this lemmas?
The first obvious think we should try is to improve layer map and make it possible to efficiently locate layer containing required page data without traversing all layers. #3368 is first step in this direction.

The second important question is whether be need compaction (reshuffling) and at which level.
The first obvious solution is to replace reshuffling with page reconstruction, i.e. instead of transforming L0 delta layers to L1 delta layers, we can produce partial image layers instead (#2563). We can produce partial image layers ASAP (when L0 layers is saved to the disk) or with some delay and combine several L0 delta layers (as currently compaction does). So storage will consists of full key range L0 deltas, partial image layers and full image layers which can be generated rarely (for example after 10-20 delta layers.

Please notice that with this approaches wal records are stored only in full key range L0 delta layers. So layer may can not somehow help us in this case to locale wal records associated with a page. But the assumption is that it is rarely needed and last written lsn cache and layer map with information about holes can speed up search among partial image layers.

If reshuffling is still considered to be important, then it should be performed not only for 6 L0 layers.Because in this case, doing a lot of expensive disk IO, we can only get 6 delta layers for all key range. For large databases is definitely not enough and seems to have no significantly differ from having just one (L0) layer for all key range. Also we faced with the problem of too small last delta layer produced while reshuffling (#3393) So from my point of view we should either make reshuffling more flexible and adaptable, either eliminate it at all: generate partial image layers instead.

knizhnik · 2023-01-26T20:32:06Z

More thoughts bout compaction-reshuffling. Right now it is performed only for 6 L0 layers. Result of compaction also consists of 6 layers but splitted vertically (by key range) rather than horizontally (by LSN).
This constant 6 looks really strange (like any other constant N). It means that doesn't matter how big database it, we always split it in 6 segments by key range. From theoretical point of this N has no difference with 1.

So reshuffling fixed number of L0 layers seems to be quite strange idea (unless we want to optimize our storage for some particular size). What are the alternatives:

Do not perform compaction at all. Instead of it generate partial image layers. Or even full image layers, assuming that we are able to ignore holes in sparse delta layers and do not generate excess image layers.
Let N depends on database size, so that for larger databases we will accumulate more L0 layers.
3.Hierarchical merge of layers (as in classical LSM)
Somehow estimate sparsity of layer and if it exceed some threshold, then add this layer to compaction list. Something like this @bojanserafimov try to implement.

Any other ideas?

bojanserafimov · 2023-01-26T21:42:42Z

The second important question is whether be need compaction (reshuffling) and at which level ... The first obvious solution is to replace reshuffling with page reconstruction, i.e. instead of transforming L0 delta layers to L1 delta layers, we can produce partial image layers instead

Compaction only helps with pageserver recovery, tenant migration, and possibly unlimited PITR (PITR from s3). If we don't do any compaction and pageserver's disk dies, we'll need to download 7 days' worth of data before we can start serving reads (or depends how we do partial images, could be better). But with our current algorithm, we just need 1 image, 3 L1 deltas and 6 L0 deltas to serve the first read, and probably any reads in the same key region. So we can start serving and lazily recover. Obviously compaction is not the only solution but if we're dropping it we need to consider this problem.

But yes, compaction is probably irrelevant for read performance. Without compaction we can just use L0s and partial images. As you noted in this case we need a more accurate layer map, but that's a solveable problem (more on this later).

Back when we were debating LSM vs alternatives, this L0-only strategy was my preferred approach. Heikki called it "Index+WAL scheme". I didn't have a good solution for recovery, so I accepted the LSM tree approach. Both approaches needed L0 layers at least, so that was a good move.

Please notice that with this approaches wal records are stored only in full key range L0 delta layers. So layer may can not somehow help us in this case to locale wal records associated with a page. But the assumption is that it is rarely needed and last written lsn cache and layer map with information about holes can speed up search among partial image layers.

Yes. Another approach I had in mind is to store inside the delta layer some hints about where to find previous entries for that page. So we can lookup the last written lsn cache to find the latest entry, and from there we load the hints and lookup the historical entries. What are these hints precisely? Maybe a pointer to the previous entry. Maybe also pointers to the 2-nd, 4-th, 8-th, etc. previous entries so that we can binary search in the LSN dimension and answer non-latest queries. Specifics don't matter, we have options.

If reshuffling is still considered to be important, then it should be performed not only for 6 L0 layers.Because in this case, doing a lot of expensive disk IO, we can only get 6 delta layers for all key range. For large databases is definitely not enough and seems to have no significantly differ from having just one (L0) layer for all key range.

Yes 6 feels arbitrary. It should be a constant though. Why is it different than the 3 we use for L1 compaction, I'm not sure. Both have the same effect of limiting recovery difficulty and I'd expect them to be equal, the way a BTree has equal number of items per node on each level.

Btw have you tried digging holes in the L0 layers too? And then only doing L1 compaction when we have 6 (or whatever) overlapping L0s? If that has any effect, it's a free win.

I suspect these constants (3 and 6) have a lot to do with how we do GC. We create too many images on purpose so that we don't need to complicate GC code, and we can rely on the fact that eventually things get covered by images. With some more GC nuance we can start relaxing these numbers. For example, when a layer needs gc but it's not covered, we should compare the cost (in money) of keeping it vs covering it, and go with what's cheaper.

Let N depends on database size, so that for larger databases we will accumulate more L0 layers.

Not a fan of this idea. It means pageserver recovery startup time would depend on database size. Also, the 4th approach is to continue with L2, L3 compaction, etc. Not advocating for it, just including it for completeness.

Let's find some small steps we can make. I'm optimistic about the hole digging approach for two reasons:

Compaction only optimizes creation of ~equal file sizes, so it doesn't end up creating holes between the L1s. Having a separate mechanism for it is a good idea
We chose to use large layer files to optimize s3 costs. But with these holes we can hack around that and effectively use a larger number of "layers" if necessary. We have more flexibility

bojanserafimov · 2023-01-26T21:58:01Z

Another problem with "no compaction" maximalism is that we have no defense against uniform random writes. The best thing we can do is create extremely scattered partial image layers. But the lack of any locality here seems like a problem. It means that if pageserver gets a sequential scan during recovery it won't be able to keep up with it. It will get stuck trying to download 7 days of layers. How real is this problem in practice? Not sure. Maybe the first step here is to test pageserver recovery. It's a good idea to do that anyway.

knizhnik · 2023-01-27T08:08:39Z

Compaction only helps with pageserver recovery, tenant migration, and possibly unlimited PITR (PITR from s3)

Hmmm... It seems to me that the main goal of compaction is to reduce number of access layers needed for page reconstruction.
Assume that some page is frequently update, so we will find WAL records affecting this pages in all L0 layers. So without compaction to reconstruct this page we need to extract all this 6 (or whatever else) pages. After compaction all wal records related to this page will be located in one layer file. So i we compact N L0 layers, then in the worst case we need to fetch N times less delta layers to perform page reconstruction. But only if we do not take in account image layers. If we produce full or partial image layer after each K delta layers, then number of accessed layers will be limited to K, doesn't matter which value of N we will use. So with N=1, K=6 (it means no compaction) in the worst case we need to fetch 6 layers the same as in case N=6,K=6 or even N=6,K=3 (current default settings).

I do not see much difference with "pageserver recovery, tenant migration, and PiTR". In all this cases layers are accessed to perform page reconstruction. Yes, when layer is not available locally at pageserver (due to pageserver crash, local pageserver storage corruption or migration) then we have to download it from S3 and price of layer access will be significantly larger.
But in any case - number of accessed layer file does matter. As I wrote above it depends not only on compaction (reshuffling) but also materialization policy. PiTR does the same, but reconstruction starts not with recent LSN, but some LSN in the past. In any case we need to traverse some number of delta layers before we reach image layer.

But with our current algorithm, we just need 1 image, 3 L1 deltas and 6 L0 deltas to serve the first read, and probably any reads in the same key region.

Yes, it is true. But if we do not perform compaction (reshuffling) and just produce new mage layer after 6 L0 layers, then to server first request we need 1image and 6 L0 deltas: it is even smaller. So it contradicts your statement that compaction is needed for recovery.

But yes, compaction is probably irrelevant for read performance.

And once again - quite opposite. If we have to read 1 L1 delta layer instead of 6 L0, then we increase page reconstruction speed almost 6 times and overall performance is also increased 6 times.

So the question is whether it makes sense to produce L1 delta layers (do reshuffling) or immediately generate image layers instead (dense or sparse).

Yes. Another approach I had in mind is to store inside the delta layer some hints about where to find previous entries for that page. So we can lookup the last written lsn cache to find the latest entry, and from there we load the hints and lookup the historical entries. What are these hints precisely? Maybe a pointer to the previous entry. Maybe also pointers to the 2-nd, 4-th, 8-th, etc. previous entries so that we can binary search in the LSN dimension and answer non-latest queries. Specifics don't matter, we have options.

May be makes sense, may be not... Instead of trying to maintain chain of layers we may just prefer to cut this chain by generation page image. Ideally we should read not more than one delta layer to reconstruct page.
Also I do not completely understand how to efficiently implement your proposal: maintain yet another last written LSN cache at page server or somehow propagate this vale from compute?

There are two polar cases: initial table population when there are large number of subsequent wal records belonging to the same page which total size exceeds size of page image. In this case we should try to store page image for this page ASAP. Most likely this WAL records will not be needed at all even in case of PiTR, because nobody needs partly initialized paged.
Opposite case is random updates: in this case we will have just one or few waL records per page. And even reshuffling will not help much. Should we generate page image if we have just 1 or 6 WAL records? We can use some threshold and generate partial (sparse) image layers, but still it is unclear what is better: traverse 6 delta layers or generate page image for just 6 updates.

Yes 6 feels arbitrary. It should be a constant though.

Why it should be a constant? I am not sure, but it seems to be more natural to make it depend on database size if L1 layer represents 1/N-s part of database keyspace.

Btw have you tried digging holes in the L0 layers too? And then only doing L1 compaction when we have 6 (or whatever) overlapping L0s? If that has any effect, it's a free win.

As you know yourself, there are widely updated keys in both ends of key dimension. Yes, maintaining holes allows not to generate excess image layers, but most likely there will be overlaps between L0 layers in any case. So we will have to always perform compaction. Unless we have some threshold for number (size?) of overlapped regions.

Not a fan of this idea. It means pageserver recovery startup time would depend on database size.

Doubtful.I do not take in account image layers, then the large database size is, the larger probability that N updates of the same page will be stored in N delta layers and so require loading N layers on page reconstruction. Increasing compaction threshold may help to group more WAL records assigned to the same page together.

Another problem with "no compaction" maximalism is that we have no defense against uniform random writes. T

Looks like no reshuffling or page reconstruction policy can be efficient for random updates of huge database.
There will be just one or few WAL records belonging to the particular page but locating it may require traverse of larger number of layers.Maintaining information about holes will not help much in this case: there will be larger number of ~same size holes.

shanyp added the t/Epic Issue type: Epic label Jan 23, 2023

shanyp assigned knizhnik Jan 23, 2023

vadim2404 added the c/compute Component: compute, excluding postgres itself label Jan 23, 2023

shanyp added the c/storage Component: storage label Jan 23, 2023

vadim2404 closed this as completed Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Storage perf improvements #3401

Epic: Storage perf improvements #3401

shanyp commented Jan 23, 2023 •

edited by vadim2404

Loading

knizhnik commented Jan 23, 2023

knizhnik commented Jan 26, 2023 •

edited

Loading

bojanserafimov commented Jan 26, 2023 •

edited

Loading

bojanserafimov commented Jan 26, 2023 •

edited

Loading

knizhnik commented Jan 27, 2023

Epic: Storage perf improvements #3401

Epic: Storage perf improvements #3401

Comments

shanyp commented Jan 23, 2023 • edited by vadim2404 Loading

Motivation

DoD

Implementation ideas

Tasks

Other related tasks and Epics

knizhnik commented Jan 23, 2023

knizhnik commented Jan 26, 2023 • edited Loading

bojanserafimov commented Jan 26, 2023 • edited Loading

bojanserafimov commented Jan 26, 2023 • edited Loading

knizhnik commented Jan 27, 2023

shanyp commented Jan 23, 2023 •

edited by vadim2404

Loading

knizhnik commented Jan 26, 2023 •

edited

Loading

bojanserafimov commented Jan 26, 2023 •

edited

Loading

bojanserafimov commented Jan 26, 2023 •

edited

Loading