Limit key range of layers generate by compaction to 16k relations to separate rel/non-rel and sys/user relation entries #2995

knizhnik · 2022-12-02T10:48:24Z

…separate rel/non-rel and sys/user relation entries refer #2948

koivunej · 2022-12-02T17:07:20Z

pageserver/src/keyspace.rs

+    /// dimension range. It means that layers generated after compaction are used to cover all database space.
+    /// Which cause image layer generation for the whole database, leading to huge rite amplification.
+    /// Catalog tables (like pg_class) are also used to be updated frequently (for example with estimated value of relation rows/size).
+    /// Even if we have on append only table, still generated delta layers will cover all this table, despite to the fact that only tail is updated.
    ///
    pub fn partition(&self, target_size: u64) -> KeyPartitioning {


It would be a good idea to have a #[test] example for this fn, even though the implementation is not too complicated yet. Though, I cannot see an easy example right away.

bojanserafimov · 2023-01-12T16:07:56Z

My only concern is that now a random workload will create tiny L0 layers. What's the downside of my proposal? Quoted here:

Instead of creating images, include these long and sparse delta layers in the next compaction round. Basically treat them as L0 layers. Redefine the definition of L1 to be "a sufficiently dense layer, regardless of how many times it's been compacted".

Yes it means the layer gets compacted multiple times. But compaction of these sparse layers is orders of magnitude cheaper than image generation so it's a strict improvement.

bojanserafimov · 2023-01-12T16:19:13Z

Also #2563 doesn't have this problem, as an alternative solution

knizhnik · 2023-01-12T16:30:10Z

My only concern is that now a random workload will create tiny L0 layers. What's the downside of my proposal?

I think you mean L1 layers? Because L0 layers in any case cover all key range.
And this fix is affecting only layer size created after compaction (i.e. L1 layers).
We are limiting L1 layer size to 16k relations. So unlikely random workload will randomly update 16k relations. Most likely there will be just one big relation (as pgbench_accounts) randomly updated. In this case this PR will not change size of most layers.

What's the downside of my proposal? Quoted here:

Frankly speaking I do not think much about your proposal.
It seems to be something much more complex and requires much more changes of current implementation.
I do not say that it is bad - may be it is really right way to change our work with layers. But looks like it requires change of storage format. Are we ready to do it with few week before launch?

Also #2563 doesn't have this problem, as an alternative solution

I do not think that this approach is the best and only possible altrnative.
It has it pros and cons, for example write amplification , because image layers will be generated more frequently.
But at least it doesn't require storage format changes, because this partial image layers have no difference with delta layers with FPIs.

bojanserafimov · 2023-01-12T19:59:01Z

Ok so we might get one small L1 per batch (in the non-rel part of the key space), but that doesn't seem like a problem on average. Seems worth it.

Have you checked this code has the desired effect? I thought we were changing L1 layer bounds, but I see changes to repartitioning code, which is only used in image layer generation (i think).

neon/pageserver/src/tenant/timeline.rs

Line 2529 in 87c3e55

// TODO: this actually divides the layers into fixed-size chunks, not

knizhnik · 2023-01-12T21:44:40Z

Have you checked this code has the desired effect? I thought we were changing L1 layer bounds, but I see changes to repartitioning code, which is only used in image layer generation (i think).

As far as I understand image generation for the whole database happens because of layers with larger key range generated by recompaction. Only htis layers are taken in account when we check whether to perform image generation or not.
Assume that we just append one table. But such append also affect system tables and SLRUs.
This is why instead of producing small L1 layer which will contain only range of the appended blocks [first-appended-block..last-appended-block], we will give two "wide" delta layers: [MIN_KEY...first-appeded-block], [last_appended_block...MAX_KEY].
Repeat it 3 times and we get three delta layers covering most of database key space, which force image layer generation for the whole database and not only for [first-appended-block..last-appended-block].

bojanserafimov · 2023-01-13T02:29:10Z

Yes that's how it works. You misunderstood my question: how does changing KeySpace::partition affect L1 generation? The function is not called at all from compact_level0.

Have you tried running this on pgbench init 10GB and drawing the layers to see that it works? (or any other test)

knizhnik · 2023-01-13T08:46:17Z

how does changing KeySpace::partition affect L1 generation? The function is not called at all from compact_level0.

Yeh, you are right. Layer key dimension should be limited not in KeySpace::partition but in compact_level0 functions. Redo it now.

bojanserafimov · 2023-01-13T19:27:00Z

Timeline after pgbench 10gb python test. Doesn't seem fixed. Maybe there are factors other than non-rel files that contribute to extremely sparse L1s.

To make your own graph run ls test_output/test_pgbench\[neon-45-684\]/repo/tenants/b81c45370fa23e38d4f71e5ccd4f1bf8/timelines/6576e861feaaf82aeaae21bee6dd8220 | grep "__" | cargo run --release --bin draw_timeline_dir > out.svg and then firefox out.svg

bojanserafimov · 2023-01-13T19:38:15Z

NOTE: When reading these pictures remember that both the x and y axis are compressed. I only preserve the relative order of any rectangle endpoints without preserving size (otherwise we won't be able to see anything). So when I say a layer is "sparse", i mean "i see that later we have 10 layers in the same key space, so it's at least 10x sparser than those layers"

knizhnik · 2023-01-13T19:50:39Z

Sorry, can you explain how to interpret this graph?
What are this empty rectangles represent? L1 lasers?
And "dotted" bold line?

bojanserafimov · 2023-01-13T20:22:03Z

empty rectangle = delta. We have 2 L0s at the top, and below 4 batches of L1 layers.

Images are black rectangles with no height, so they look like dots. We often reimage a big range at once so we get a dotted line of images. They're mostly the same width, but some look wider because of coordinate compression (I only preserve the order of rectangle endpoints, not relative size).

Images sometimes overlap in lsn range with an L1 delta layer because we take images at latest LSN, which includes WAL from L0 and inmem layers, which later disappear.

All layers are taken with ls from the timeline directory directly, so I guess there's no gc kicking in yet.

So the story in this picture is:

Pgbench init creates 3 batches of L1s. For unknown reasons batch 1 and 3 cover an aggressive amount of key space.
All 3 batches cover the middle-ish part of the key range, so that get's covered by images
Simple update phase adds a 4th batch of L1s which (for good reason) covers the entire keyspace
Now the rest of the keyspace gets image layers because L1 batches 1 3 and 4 together cover the entire keyspace

bojanserafimov · 2023-01-13T20:33:08Z

I think the problem is larger and this approach only goes so far. Nonrel pages are one obvious cause of the problem, but even without them the problem exists.

Take for example an average OLTP workload. Let's squint and model this hypothetical workload as follows: Write random pages, with a certain part of the keyspace being a hotspot, and the hotspot moving every few minutes to a different part.

Not only will this workload produce enough sparse deltas to trigger the problem, but actually most deltas will be very sparse and cover the entire keyspace. Only the hotspot will have deltas worth reimaging over, and by definition most of the keyspace is not in a hotspot. If we add layer barriers like you do in this PR, that would only create tiny gaps between the L1 layers and eliminate less than 1% of the images.

We actually need to address the fact that currently the definition of L1 is very broad. We either need to introduce L2, and reimage only after a certain number of L2s (I haven't thought about this, and I'm not sure if just L2 will be enough), or go with my approach (classify layers based on density, as defined by "number of partitions inside the layer"), or something else.

knizhnik · 2023-01-13T21:05:59Z

Sorry, I still do not understand how coordinate compression is performed.
So if I have delta images

00_00000000_00000000_00004001_00_000000000 .. 00_00000000_00000000_00004001_00_000000100
00_00000000_00000000_00004002_00_000000000 .. 00_00000000_00000000_00004004_00_000000100

what will be the ratio of length of correspondent rectangles?

If we perform random updates, then not wonder that image layers need to be generated for the whole database range even if percent of updated pages is relatively small. But let's estimate. So L0 layer contains 256Mb wal records.
For simplicity lets assume hat it is FPIs - 8kb (if wa record is smaller, it will only increase my estimation.
So L0 layer with random updates modifies. 32k random pages. Lets multiply it on image layer size: 128MB.
It is 4TB! It means that only for databases with size larger than 4TB delta layers produced by compact_level0 cause generation of
"useless" image layers - image layers which page were not actually changed.

The problem I am trying to address is not related with random/sparse updates - here it is really har to improve something.
But assume that we just append one very larger table (i.e.with 1 TB size). So we add 256Mb of data to this table and it produce 256Mb L0 layers.
The we reshuffle it and ... get delta layer something like:

00_00000000_00000000_00000000_00_000000000 .. 00_00000000_00000000_00004001_00_08008000

So it actually covers all this huge table. There such L1 layers and we will generate 1Tb image layers, while we need actually image layers for 256*3 = 768Mb. Limiting layer key range can avoid such behavior.

bojanserafimov · 2023-01-13T21:52:45Z

what will be the ratio of length of correspondent rectangles?

I take those 4 numbers, sort them, deduplicate them, and replace them with their rank.
So the layers now become:

layer1: 0..1
layer2: 2..3

If there are no other layers, then their width would be the same. But
if there was a third layer 00_00000000_00000000_00004001_00_000000000..00_00000000_00000000_00004001_00_000000001
then we'd have

layer1: 0..2
layer2: 3..4
layer3: 0..1

Here's more on "coordinate compression" (competitive programming slang) https://medium.com/algorithms-digest/coordinate-compression-2fff95326fb I'm sure you'll hear the word again.

So if you're looking at a picture and want to estimate the actual width, maybe count how many image layers it gets covered by later.

bojanserafimov · 2023-01-13T22:58:53Z

The problem I am trying to address is not related with random/sparse updates ...

Yeah we can commit this and solve a special case, but that doesn't close #2948

It means that only for databases with size larger than 4TB delta layers produced by compact_level0 cause generation of "useless" image layers

True, to see extreme problems we need extreme scale. But it's fairly common to see mostly useless images at any scale.

Look at this project https://neondb.slack.com/archives/C03UVBDM40J/p1669655812511299

num_images: 24070
num_deltas: 2620

That's an image:delta ratio of 12. Since (TIL) deltas are 2x larger, that means our bloat factor is 7 (1 delta + 6 times it's size in images). IMO we can get the bloat down to 2, and that will make a big difference for pricing.

If we perform random updates, then not wonder that image layers need to be generated for the whole database range even if percent of updated pages is relatively small.

I think it's a solveable problem. We just compact those sparse deltas instead of reimaging them. The cost of compaction is smaller, the effect is the same (to decrease the max diffuculty below 3). We should have a limit on key_range size such that any deltas larger than it should be compacted rather than reimaged.

If we really need to reimage because it's time to GC that sparse delta, then we could (ignoring it is also a fine choice). But if we're reimaging because of GC, then we can immediately delete the old images, so we're not causing any actual pageserver bloat. Only write amplification, s3 bloat. In all cases we've seen so far, pageserver bloat is the problem. Also for total cost of services, pageserver bloat will probably be a bigger problem.

All visualizations I've made so far come directly from pageserver files that haven't been GCed yet. See for example this pathological case https://neondb.slack.com/archives/C03UVBDM40J/p1669736655098039?thread_ts=1669655812.511299&cid=C03UVBDM40J

knizhnik · 2023-01-14T07:29:20Z

Hmmm...
First layer contains 0x1000 blocks of relation 4001, second - 3 relations 4002..4004 with at least 0x0x1000 blocks in 4004 and your third just one block of relation 4001. And you say that first layer will be represented by box with width 2 and second and third - with box with length 1? Seems to be either incorrect either confusing.

But in any case as far as this diagram doesn't represent real key range, it seems to be not possible to say just looking at this diagram that some long layer will cause generation of page images for the whole key range.
Actually the fact that some delta layer at diagram is several times wider than other doesn't indicate some problem - yes it can happen with rando distribution.

Yeah we can commit this and solve a special case, but that doesn't close #2948

First of all I do not agree that it is "special case". Append-only tables is quite frequent use case. And ore over almost an import of data to the database corresponds to such use case, so images created during such import will suffer from this problem.

Second - I once again do not completely understand picture at #2984 and it is not clear to me why this PR will not fix this problem.

We just compact those sparse deltas instead of reimaging them.

Let's return to the begging: why do we ever need to produce image layers?

To reduce number of layers which need to e inspected to perform page reconstruction and so reduce reconstruction time.
To be able to perform GC (which we perform at image layer).

With delta layers or with partial image layers you can not say if layer contains wal records for the particular page or not without fetching this page and performing B-Tree search. We can use bloom filter or whatever else to avoid fetching all layers. But it will require additional memory for layer map. Also precise last written LSN cache can point us to the right LSN and so avoid scanning of large number of layers.

Sorry, but I do not currently understand how you suggestion can address case with rando updates and not sure if something can really be done in this case within current layering model.

But it's fairly common to see mostly useless images at any scale. That's an image:delta ratio of 12.

I believe (can nor prove) that it really caused by "corner effect" - when we create delta layers which covers large range of key space and so require multiple image layers. If we have some test to reproduce such layers layout, it will be quite trivial to test if this patch address this problem or not.

knizhnik · 2023-01-15T14:16:49Z

I have performed some experiments.
You are absolutely right - this "trick" with limiting key range is not working well even for single table.
There are several reasons for it: autovacuum, placing forknum after relnum in Key,...
So we really need some more generic algorithm for image layer generation.
I thought about your idea of sparse delta layers.
Actually our delta layers are already sparse: the contains only updates keys.
So we can add more precise check that delta layers really overlaps with image layer (contains some entries in this key range).
I have create new PR #3348 where I have added Layer::overlaps method and use it in LayerMap::count_deltas.
Certainly it involves disk B-Tree, so requires reading layer data fro the disk. It may increase time of compaction.
Alternatives - bloom filter or BRIN.

bojanserafimov · 2023-01-17T18:26:34Z

And you say that first layer will be represented by box with width 2 and second and third - with box with length 1? Seems to be either incorrect either confusing.

It's not incorrect, just confusing. You're welcome to improve on the visualization method. Without any "coordinate compression" most rectangles will have 0 width and be invisible. Maybe a better method is to just use collect_keyspace and exclude unused keys. I haven't tried that.

But in any case as far as this diagram doesn't represent real key range, it seems to be not possible to say just looking at this diagram that some long layer will cause generation of page images for the whole key range.

Yes. Only after the images are created you can see in retrospect that one delta caused many images. But for some reason I found it hard to reproduce this locally (to get the useless images to actually appear). Maybe it takes too long and the test finishes before? Not sure. But in the specific workload of pgbench init no delta layer will be denser than an image layer, so if a delta spans 10 other deltas it probably spans 20 images too.

You are absolutely right - this "trick" with limiting key range is not working well even for single table. There are several reasons for it: autovacuum, placing forknum after relnum in Key,...

I'm sure there are other cases too. If most updates occasionally update the root of some BTree index, etc. (I don't know pg internals, just guessing that something like that would happen, or at least we can't rule it out)

Limit key range of layers generate by compaction to 16k relations to …

a50edae

…separate rel/non-rel and sys/user relation entries refer #2948

knizhnik requested review from a team as code owners December 2, 2022 10:48

knizhnik requested review from KlimentSerafimov, SomeoneToIgnore and bojanserafimov December 2, 2022 10:48

Make clippy happy

690b362

koivunej reviewed Dec 2, 2022

View reviewed changes

knizhnik added 2 commits January 13, 2023 10:44

Limit lyer key dimension in copact_level0

da1d4e7

Merge branch 'main' into limit_layer_key_range

30bf763

knizhnik closed this Jan 15, 2023

bojanserafimov mentioned this pull request Jan 20, 2023

Add Layer::overlaps method and use it in count_deltas to avoid unnecessary image layer generation #3348

Closed

bayandin deleted the limit_layer_key_range branch May 19, 2023 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit key range of layers generate by compaction to 16k relations to separate rel/non-rel and sys/user relation entries #2995

Limit key range of layers generate by compaction to 16k relations to separate rel/non-rel and sys/user relation entries #2995

knizhnik commented Dec 2, 2022

koivunej Dec 2, 2022

bojanserafimov commented Jan 12, 2023

bojanserafimov commented Jan 12, 2023

knizhnik commented Jan 12, 2023

bojanserafimov commented Jan 12, 2023

knizhnik commented Jan 12, 2023

bojanserafimov commented Jan 13, 2023

knizhnik commented Jan 13, 2023

bojanserafimov commented Jan 13, 2023

bojanserafimov commented Jan 13, 2023

knizhnik commented Jan 13, 2023

bojanserafimov commented Jan 13, 2023

bojanserafimov commented Jan 13, 2023

knizhnik commented Jan 13, 2023

bojanserafimov commented Jan 13, 2023

bojanserafimov commented Jan 13, 2023

knizhnik commented Jan 14, 2023

knizhnik commented Jan 15, 2023

bojanserafimov commented Jan 17, 2023

Limit key range of layers generate by compaction to 16k relations to separate rel/non-rel and sys/user relation entries #2995

Limit key range of layers generate by compaction to 16k relations to separate rel/non-rel and sys/user relation entries #2995

Conversation

knizhnik commented Dec 2, 2022

koivunej Dec 2, 2022

Choose a reason for hiding this comment

bojanserafimov commented Jan 12, 2023

bojanserafimov commented Jan 12, 2023

knizhnik commented Jan 12, 2023

bojanserafimov commented Jan 12, 2023

knizhnik commented Jan 12, 2023

bojanserafimov commented Jan 13, 2023

knizhnik commented Jan 13, 2023

bojanserafimov commented Jan 13, 2023

bojanserafimov commented Jan 13, 2023

knizhnik commented Jan 13, 2023

bojanserafimov commented Jan 13, 2023

bojanserafimov commented Jan 13, 2023

knizhnik commented Jan 13, 2023

bojanserafimov commented Jan 13, 2023

bojanserafimov commented Jan 13, 2023

knizhnik commented Jan 14, 2023

knizhnik commented Jan 15, 2023

bojanserafimov commented Jan 17, 2023