Possible regression: small L1s on main #3393

bojanserafimov · 2023-01-20T19:01:25Z

After running pgbench 10gb I get some 24KB L1 deltas. Here are some of their filenames:

 000000067F0000334E0000400C01FFFFFFFF-000000067F0000334E0000400C0200000000__00000000BF793F68-00000000CC3E5D21
 000000067F0000334E0000400C01FFFFFFFF-000000067F0000334E0000400C0200000000__00000000CDB966A0-000000017B1A9298
 000000067F0000334E0000400C01FFFFFFFF-000000067F0000334E0000400C0200000000__00000001A5DACFD0-00000002559BFD10
 000000067F0000334E0000400C01FFFFFFFF-000000067F0000334E0000400C0200000000__000000017B1A9298-00000001A58CAFA1
 000000067F0000334E0000400C01FFFFFFFF-000000067F0000334E0000400C0200000000__00000002559BFD10-00000002706403E9
 000000067F0000334E0000400C01FFFFFFFF-000000067F0000334E0000400C0200000000__00000000017729F8-00000000BF793F68

More context:

Originally posted by @bojanserafimov in #3348 (comment)

The text was updated successfully, but these errors were encountered:

knizhnik · 2023-01-20T20:04:31Z

Isn't it expected result of "greedy:" compaction algorithm: it just try to produce layers of specified size (compaction_target) and so the last layer can be arbitrary small.

bojanserafimov · 2023-01-20T22:43:13Z

Maybe. But it reproducess pretty well: I used to never see it now I always see it. And it comes with a few other anomalies:

Usually L1 deltas come in batches, all with the same lsn range. But these layers have unique lsn ranges. They look more like L0s with strangely narrow key range
It looks like the majority of the updates from the random update phase are in a different key range from the init phase. And then most of the random updates L1s get covered by a single image layer
The total number of deltas increased

Not sure how to debug this, other than git bisect

bojanserafimov · 2023-01-20T23:09:58Z

Correction: If it exists, the regression is between faf1d20 (good) and fe8cef3 (bad)

bojanserafimov · 2023-01-21T01:26:32Z

The first bad commit is 12e6f44

So there's no regression, just the test changed. Is it expected that server side generation during init should create 50% more deltas (among all the other differences in behavior)? I'd expect the pageserver workload to be the same

knizhnik · 2023-01-23T08:23:52Z

There are three possible cases when delta layers may have size smaller than specified:

It is last segment produced by compaction. Last layer may be arbitrary small.
There are a lot of updates of the same page and so one key occupy more than one segment. in this case we split range of such key by LSN and such "exclusive" layers may contain only values of this key. Last layer may be arbitrary small.
Partially filled InMemory layer is flushed to the disk because of force checkpoint or pageserver shutdown request.

Only 1) is applicable to L1 layers, 2) and 3) - to L0 layers.

In case of pgbench we have case 1):

$ ls -l --sort=size | tail -n 20 
-rw-r--r-- 1 admin admin 134422528 Jan 18 06:26 000000067F000032AC0000400C0000024000-000000067F000032AC0000400C0000028000__000000040052C0A0
-rw-r--r-- 1 admin admin 134422528 Jan 18 06:20 000000067F000032AC000040140000000000-000000067F000032AC000040140000004000__0000000223643E60
-rw-r--r-- 1 admin admin 134422528 Jan 18 06:26 000000067F000032AC000040140000000000-000000067F000032AC000040140000004000__000000040052C0A0
-rw-r--r-- 1 admin admin  91774976 Jan 18 06:26 000000067F000032AC000040140000004000-030000000000000000000000000000000002__000000040052C0A0
-rw-r--r-- 1 admin admin  90578944 Jan 18 06:20 000000067F000032AC000040140000004000-000000067F000032AC000040140100000000__0000000223643E60
-rw-r--r-- 1 admin admin  78757888 Jan 18 06:20 000000067F000032AC000040160000000000-030000000000000000000000000000000002__0000000223643E60
-rw-r--r-- 1 admin admin  70778880 Jan 18 06:15 000000067F000032AC0000401600000002AA-030000000000000000000000000000000002__0000000001696070-000000009BFADCB1
-rw-r--r-- 1 admin admin  26279936 Jan 18 06:26 000000067F000032AC0000400C0000028000-000000067F000032AC000040120100000000__000000040052C0A0
-rw-r--r-- 1 admin admin  25108480 Jan 18 06:20 000000067F000032AC0000400C0000028000-000000067F000032AC000040120100000000__0000000223643E60
-rw-r--r-- 1 admin admin  24969216 Jan 18 06:19 000000000000000000000000000000000000-000000067F000032AC000032090100000000__0000000223643E60
-rw-r--r-- 1 admin admin  24969216 Jan 18 06:25 000000000000000000000000000000000000-000000067F000032AC000032090100000000__000000040052C0A0
-rw-r--r-- 1 admin admin  24166400 Jan 18 06:20 000000067F000032AC000040160000002B1C-030000000000000000000000000000000002__00000001EADBC421-000000029A4EECB1
-rw-r--r-- 1 admin admin  24150016 Jan 18 06:18 000000067F000032AC000040160000002045-030000000000000000000000000000000002__000000013B80F789-00000001EADBC421
-rw-r--r-- 1 admin admin  22085632 Jan 18 06:22 000000067F000032AC00004016000000354A-030000000000000000000000000000000002__000000029A4EECB1-0000000339D36F11
-rw-r--r-- 1 admin admin  21143552 Jan 18 06:16 000000067F000032AC00004016000000159E-030000000000000000000000000000000002__000000009BFADCB1-000000013B80F789
-rw-r--r-- 1 admin admin  18407424 Jan 18 06:24 000000067F000032AC000040160000003D1D-030000000000000000000000000000000002__0000000339D36F11-00000003D96C7369
-rw-r--r-- 1 admin admin     49152 Jan 18 07:25 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000407730F01-0000000407735701
-rw-r--r-- 1 admin admin     40960 Jan 18 07:01 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000004077288D1-000000040772CB61
-rw-r--r-- 1 admin admin     40960 Jan 18 07:15 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000040772CB61-0000000407730F01
-rw-r--r-- 1 admin admin       512 Jan 18 07:25 metadata

lets look at some of delta layers with size less than 128M and search this LSN range:

$ ls -l *0000000339D36F11-00000003D96C7369
-rw-r--r-- 1 admin admin 268451840 Jan 18 06:24 000000067F000032AC000000000000000001-000000067F000032AC0000400C000000435F__0000000339D36F11-00000003D96C7369
-rw-r--r-- 1 admin admin 268443648 Jan 18 06:24 000000067F000032AC0000400C000000435F-000000067F000032AC0000400C00000085FF__0000000339D36F11-00000003D96C7369
-rw-r--r-- 1 admin admin 268451840 Jan 18 06:24 000000067F000032AC0000400C00000085FF-000000067F000032AC0000400C000000C8B5__0000000339D36F11-00000003D96C7369
-rw-r--r-- 1 admin admin 268451840 Jan 18 06:24 000000067F000032AC0000400C000000C8B5-000000067F000032AC0000400C0000010B96__0000000339D36F11-00000003D96C7369
-rw-r--r-- 1 admin admin 268435456 Jan 18 06:24 000000067F000032AC0000400C0000010B96-000000067F000032AC0000400C0000014EC0__0000000339D36F11-00000003D96C7369
-rw-r--r-- 1 admin admin 268451840 Jan 18 06:24 000000067F000032AC0000400C0000014EC0-000000067F000032AC0000400C00000191AC__0000000339D36F11-00000003D96C7369
-rw-r--r-- 1 admin admin 268451840 Jan 18 06:24 000000067F000032AC0000400C00000191AD-000000067F000032AC0000400C000001D47E__0000000339D36F11-00000003D96C7369
-rw-r--r-- 1 admin admin 268443648 Jan 18 06:24 000000067F000032AC0000400C000001D47E-000000067F000032AC0000400C000002173E__0000000339D36F11-00000003D96C7369
-rw-r--r-- 1 admin admin 268451840 Jan 18 06:24 000000067F000032AC0000400C000002173E-000000067F000032AC0000400C0000025A85__0000000339D36F11-00000003D96C7369
-rw-r--r-- 1 admin admin 268451840 Jan 18 06:24 000000067F000032AC0000400C0000025A85-000000067F000032AC000040160000003D1D__0000000339D36F11-00000003D96C7369
-rw-r--r-- 1 admin admin  18407424 Jan 18 06:24 000000067F000032AC000040160000003D1D-030000000000000000000000000000000002__0000000339D36F11-00000003D96C7369

as you can see - it is last segment in the key range.

So it is on more argument against compaction: right now compaction may produce arbitrary small L1 layers which is not good.

bojanserafimov · 2023-02-06T14:25:41Z

So there's no regression, just the test changed.

bojanserafimov · 2023-02-08T17:47:22Z

In case of pgbench we have case 1):

In my case (the filenames I posted in the issue) it looks more like case 2. The layers are at the end of the key range, but their short LSN ranges and the way they stack on top of each other can only be explained by case 2. Picture (see the 3 layers on the bottom right):
https://user-images.githubusercontent.com/8680233/213776617-0cd5189e-3db6-4e94-9ce9-0be63c9cce43.png

I wonder when we split by LSN for case 2, what's our file size threshold? Hopefully not 24KB. That would be too small. But at the same time, if it's larger then we won't make image layers for the hot page, and reads will get very expensive. We need to inject these tall layers with page images for the hot page.

bojanserafimov · 2023-02-08T17:54:57Z

We need to inject these tall layers with page images for the hot page.

A very useful number to have for such heuristics: For what number X is the latency of redo IPC equal to the cost of redoing X records?

knizhnik · 2023-02-08T19:01:20Z

I wonder when we split by LSN for case 2, what's our file size threshold? Hopefully not 24KB.

Threshold is the same as for any other delta layers: 256Mb. But last layer with duplicates can be arbitrary small, because we can not place any other keys in this layer.

knizhnik · 2023-02-08T19:02:59Z

A very useful number to have for such heuristics: For what number X is the latency of redo IPC equal to the cost of redoing X records?

Sorry, what IPC stands for ?

bojanserafimov · 2023-02-08T19:32:15Z

Threshold is the same as for any other delta layers: 256Mb.

Hmm then maybe this is not case 2. But also it doesn't look like case 1, based on the LSN range (usually layers come in batches, but these three are not part of a batch)

Sorry, what IPC stands for ?

Interprocess communication. Or is that no longer important after async walredo (I haven't reviewed that PR)?

knizhnik · 2023-02-08T19:57:01Z

I do not think that comparison of redo tie with IPC latency has a lot of sense.
Redo is very fast. I do no remember prices number right now. But it is fast.
If we just replay wal records from the file (grab output of pageserver and then give this file as input for walredo process) then elapsed time is about 5x times smaller the elapsed time in walredo in pageserver.
Current async pipe as well as shmem pipe are able to reduce communication latency when there are several parallel requests. In this case it is possible to eliminate request-response loop and improve speed about 2 times.

The question "to redo or not to redo" is really very difficult. I think that success of Neon will greatly depends on whether we find answer for this question. There are lot of different aspects.

Assume that some page is frequently updated. So there are a large number of WAL records associated with this page. Should we force reconstruction of this page? If page is frequently updated, then most likely it is kept in Shared buffer and so compute will not retrieve it from page server. So redoing it ca e considered as useless job. Also as far as page is frequently updated, reconstructed image will deteriorated very fast - we still need to perform wal redo. I do not remember precise numbers but looks like applying 100 wal records tkes almost the same time as applying 1.So there is no much sense i reconstructing page too frequently. Fro the other size if chain of wal records becomes too long then fetching and applying it may take a lot of time.
And final notice: Postgres periodically performs checkpoints (although them are not so needed for Neon architecture). With enabled fulle_page_writers (and them are enabled by default), first update after checkpoint cause writing full page image (FPI) in WAL. It efficiently breaks redo chain without need for pageserver to perform page reconstruction.

Ideally, page should be reconstructed right before it is requested y compute. But it is hard to predict it. If page is thrownaway from shared buffers, it doesn't means then oni the near future it will be accessed and compute need to send get_page_at_lsn request to pageserver

hlinnaka · 2023-02-09T08:07:25Z

So there's no regression, just the test changed. Is it expected that server side generation during init should create 50% more deltas (among all the other differences in behavior)? I'd expect the pageserver workload to be the same

A-ha, yes, there is a big difference between pgbench's client-side and server-side generation. With client-side generation, the biggest table (pgbench_accounts) is loaded with COPY, and with server-side generation it's loaded with INSERT SELECT. COPY is more optimized in PostgreSQL, it buffers the data and inserts it into the WAL in larger batches.

tldr; client-side generation produces a lot less WAL.

knizhnik · 2023-02-09T08:07:48Z

Now I better understand why we get small layers in pgbench -s 1000: I just noticed that disk B-Tree occupies quite significant part of layer file. I just looked at some database I have (it is not one with pgbench -s 1000), and find out that size of index is 40Mb - it is 15% of 256 Mb. Size of the index is proportional to number of keys. As far as pgbench randomly updates keys, we can expect that set of keys in all compacted layers is almost the same (equal to total number of keys). When we are merging this layers, total number of keys is not changed. But we split them between N layers, so index size in each layer is reduced N times.According to the layer map we are compacting 11 layers, So we free up to 10*40Mb = 400Mb. It is larger than layer size.

What does it mean? We should either not perform compaction at all (partial image layers). Either treat this small L1 layers as L0 layers and use for further compaction. @bojanserafimov tries to implement something like this (but using layer sparsity criteria).
As far as I remember there was some problems. But I am sure them can be fixed.

knizhnik mentioned this issue Jan 23, 2023

Epic: Storage perf improvements #3401

Closed

2 tasks

bojanserafimov closed this as completed Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible regression: small L1s on main #3393

Possible regression: small L1s on main #3393

bojanserafimov commented Jan 20, 2023

knizhnik commented Jan 20, 2023

bojanserafimov commented Jan 20, 2023

bojanserafimov commented Jan 20, 2023 •

edited

Loading

bojanserafimov commented Jan 21, 2023

knizhnik commented Jan 23, 2023

bojanserafimov commented Feb 6, 2023

bojanserafimov commented Feb 8, 2023

bojanserafimov commented Feb 8, 2023

knizhnik commented Feb 8, 2023

knizhnik commented Feb 8, 2023

bojanserafimov commented Feb 8, 2023

knizhnik commented Feb 8, 2023

hlinnaka commented Feb 9, 2023

knizhnik commented Feb 9, 2023

Possible regression: small L1s on main #3393

Possible regression: small L1s on main #3393

Comments

bojanserafimov commented Jan 20, 2023

knizhnik commented Jan 20, 2023

bojanserafimov commented Jan 20, 2023

bojanserafimov commented Jan 20, 2023 • edited Loading

bojanserafimov commented Jan 21, 2023

knizhnik commented Jan 23, 2023

bojanserafimov commented Feb 6, 2023

bojanserafimov commented Feb 8, 2023

bojanserafimov commented Feb 8, 2023

knizhnik commented Feb 8, 2023

knizhnik commented Feb 8, 2023

bojanserafimov commented Feb 8, 2023

knizhnik commented Feb 8, 2023

hlinnaka commented Feb 9, 2023

knizhnik commented Feb 9, 2023

bojanserafimov commented Jan 20, 2023 •

edited

Loading