Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit key range of layers generate by compaction to 16k relations to separate rel/non-rel and sys/user relation entries #2995

Closed
wants to merge 4 commits into from

Conversation

knizhnik
Copy link
Contributor

@knizhnik knizhnik commented Dec 2, 2022

refer #2948

…separate rel/non-rel and sys/user relation entries

refer #2948
/// dimension range. It means that layers generated after compaction are used to cover all database space.
/// Which cause image layer generation for the whole database, leading to huge rite amplification.
/// Catalog tables (like pg_class) are also used to be updated frequently (for example with estimated value of relation rows/size).
/// Even if we have on append only table, still generated delta layers will cover all this table, despite to the fact that only tail is updated.
///
pub fn partition(&self, target_size: u64) -> KeyPartitioning {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a good idea to have a #[test] example for this fn, even though the implementation is not too complicated yet. Though, I cannot see an easy example right away.

@bojanserafimov
Copy link
Contributor

My only concern is that now a random workload will create tiny L0 layers. What's the downside of my proposal? Quoted here:

Instead of creating images, include these long and sparse delta layers in the next compaction round. Basically treat them as L0 layers. Redefine the definition of L1 to be "a sufficiently dense layer, regardless of how many times it's been compacted".

Yes it means the layer gets compacted multiple times. But compaction of these sparse layers is orders of magnitude cheaper than image generation so it's a strict improvement.

@bojanserafimov
Copy link
Contributor

Also #2563 doesn't have this problem, as an alternative solution

@knizhnik
Copy link
Contributor Author

My only concern is that now a random workload will create tiny L0 layers. What's the downside of my proposal?

I think you mean L1 layers? Because L0 layers in any case cover all key range.
And this fix is affecting only layer size created after compaction (i.e. L1 layers).
We are limiting L1 layer size to 16k relations. So unlikely random workload will randomly update 16k relations. Most likely there will be just one big relation (as pgbench_accounts) randomly updated. In this case this PR will not change size of most layers.

What's the downside of my proposal? Quoted here:

Frankly speaking I do not think much about your proposal.
It seems to be something much more complex and requires much more changes of current implementation.
I do not say that it is bad - may be it is really right way to change our work with layers. But looks like it requires change of storage format. Are we ready to do it with few week before launch?

Also #2563 doesn't have this problem, as an alternative solution

I do not think that this approach is the best and only possible altrnative.
It has it pros and cons, for example write amplification , because image layers will be generated more frequently.
But at least it doesn't require storage format changes, because this partial image layers have no difference with delta layers with FPIs.

@bojanserafimov
Copy link
Contributor

Ok so we might get one small L1 per batch (in the non-rel part of the key space), but that doesn't seem like a problem on average. Seems worth it.

Have you checked this code has the desired effect? I thought we were changing L1 layer bounds, but I see changes to repartitioning code, which is only used in image layer generation (i think).

// TODO: this actually divides the layers into fixed-size chunks, not

@knizhnik
Copy link
Contributor Author

Have you checked this code has the desired effect? I thought we were changing L1 layer bounds, but I see changes to repartitioning code, which is only used in image layer generation (i think).

As far as I understand image generation for the whole database happens because of layers with larger key range generated by recompaction. Only htis layers are taken in account when we check whether to perform image generation or not.
Assume that we just append one table. But such append also affect system tables and SLRUs.
This is why instead of producing small L1 layer which will contain only range of the appended blocks [first-appended-block..last-appended-block], we will give two "wide" delta layers: [MIN_KEY...first-appeded-block], [last_appended_block...MAX_KEY].
Repeat it 3 times and we get three delta layers covering most of database key space, which force image layer generation for the whole database and not only for [first-appended-block..last-appended-block].

@bojanserafimov
Copy link
Contributor

Yes that's how it works. You misunderstood my question: how does changing KeySpace::partition affect L1 generation? The function is not called at all from compact_level0.

Have you tried running this on pgbench init 10GB and drawing the layers to see that it works? (or any other test)

@knizhnik
Copy link
Contributor Author

how does changing KeySpace::partition affect L1 generation? The function is not called at all from compact_level0.

Yeh, you are right. Layer key dimension should be limited not in KeySpace::partition but in compact_level0 functions. Redo it now.

@bojanserafimov
Copy link
Contributor

Screenshot from 2023-01-13 14-18-47

Timeline after pgbench 10gb python test. Doesn't seem fixed. Maybe there are factors other than non-rel files that contribute to extremely sparse L1s.

To make your own graph run ls test_output/test_pgbench\[neon-45-684\]/repo/tenants/b81c45370fa23e38d4f71e5ccd4f1bf8/timelines/6576e861feaaf82aeaae21bee6dd8220 | grep "__" | cargo run --release --bin draw_timeline_dir > out.svg and then firefox out.svg

@bojanserafimov
Copy link
Contributor

NOTE: When reading these pictures remember that both the x and y axis are compressed. I only preserve the relative order of any rectangle endpoints without preserving size (otherwise we won't be able to see anything). So when I say a layer is "sparse", i mean "i see that later we have 10 layers in the same key space, so it's at least 10x sparser than those layers"

@knizhnik
Copy link
Contributor Author

Sorry, can you explain how to interpret this graph?
What are this empty rectangles represent? L1 lasers?
And "dotted" bold line?

@bojanserafimov
Copy link
Contributor

empty rectangle = delta. We have 2 L0s at the top, and below 4 batches of L1 layers.

Images are black rectangles with no height, so they look like dots. We often reimage a big range at once so we get a dotted line of images. They're mostly the same width, but some look wider because of coordinate compression (I only preserve the order of rectangle endpoints, not relative size).

Images sometimes overlap in lsn range with an L1 delta layer because we take images at latest LSN, which includes WAL from L0 and inmem layers, which later disappear.

All layers are taken with ls from the timeline directory directly, so I guess there's no gc kicking in yet.

So the story in this picture is:

  1. Pgbench init creates 3 batches of L1s. For unknown reasons batch 1 and 3 cover an aggressive amount of key space.
  2. All 3 batches cover the middle-ish part of the key range, so that get's covered by images
  3. Simple update phase adds a 4th batch of L1s which (for good reason) covers the entire keyspace
  4. Now the rest of the keyspace gets image layers because L1 batches 1 3 and 4 together cover the entire keyspace

@bojanserafimov
Copy link
Contributor

I think the problem is larger and this approach only goes so far. Nonrel pages are one obvious cause of the problem, but even without them the problem exists.

Take for example an average OLTP workload. Let's squint and model this hypothetical workload as follows: Write random pages, with a certain part of the keyspace being a hotspot, and the hotspot moving every few minutes to a different part.

Not only will this workload produce enough sparse deltas to trigger the problem, but actually most deltas will be very sparse and cover the entire keyspace. Only the hotspot will have deltas worth reimaging over, and by definition most of the keyspace is not in a hotspot. If we add layer barriers like you do in this PR, that would only create tiny gaps between the L1 layers and eliminate less than 1% of the images.

We actually need to address the fact that currently the definition of L1 is very broad. We either need to introduce L2, and reimage only after a certain number of L2s (I haven't thought about this, and I'm not sure if just L2 will be enough), or go with my approach (classify layers based on density, as defined by "number of partitions inside the layer"), or something else.

@knizhnik
Copy link
Contributor Author

Sorry, I still do not understand how coordinate compression is performed.
So if I have delta images

00_00000000_00000000_00004001_00_000000000 .. 00_00000000_00000000_00004001_00_000000100
00_00000000_00000000_00004002_00_000000000 .. 00_00000000_00000000_00004004_00_000000100

what will be the ratio of length of correspondent rectangles?

If we perform random updates, then not wonder that image layers need to be generated for the whole database range even if percent of updated pages is relatively small. But let's estimate. So L0 layer contains 256Mb wal records.
For simplicity lets assume hat it is FPIs - 8kb (if wa record is smaller, it will only increase my estimation.
So L0 layer with random updates modifies. 32k random pages. Lets multiply it on image layer size: 128MB.
It is 4TB! It means that only for databases with size larger than 4TB delta layers produced by compact_level0 cause generation of
"useless" image layers - image layers which page were not actually changed.

The problem I am trying to address is not related with random/sparse updates - here it is really har to improve something.
But assume that we just append one very larger table (i.e.with 1 TB size). So we add 256Mb of data to this table and it produce 256Mb L0 layers.
The we reshuffle it and ... get delta layer something like:

00_00000000_00000000_00000000_00_000000000 .. 00_00000000_00000000_00004001_00_08008000

So it actually covers all this huge table. There such L1 layers and we will generate 1Tb image layers, while we need actually image layers for 256*3 = 768Mb. Limiting layer key range can avoid such behavior.

@bojanserafimov
Copy link
Contributor

what will be the ratio of length of correspondent rectangles?

I take those 4 numbers, sort them, deduplicate them, and replace them with their rank.
So the layers now become:

layer1: 0..1
layer2: 2..3

If there are no other layers, then their width would be the same. But
if there was a third layer 00_00000000_00000000_00004001_00_000000000..00_00000000_00000000_00004001_00_000000001
then we'd have

layer1: 0..2
layer2: 3..4
layer3: 0..1

Here's more on "coordinate compression" (competitive programming slang) https://medium.com/algorithms-digest/coordinate-compression-2fff95326fb I'm sure you'll hear the word again.

So if you're looking at a picture and want to estimate the actual width, maybe count how many image layers it gets covered by later.

@bojanserafimov
Copy link
Contributor

The problem I am trying to address is not related with random/sparse updates ...

Yeah we can commit this and solve a special case, but that doesn't close #2948

It means that only for databases with size larger than 4TB delta layers produced by compact_level0 cause generation of "useless" image layers

True, to see extreme problems we need extreme scale. But it's fairly common to see mostly useless images at any scale.

Look at this project https://neondb.slack.com/archives/C03UVBDM40J/p1669655812511299

num_images: 24070
num_deltas: 2620

That's an image:delta ratio of 12. Since (TIL) deltas are 2x larger, that means our bloat factor is 7 (1 delta + 6 times it's size in images). IMO we can get the bloat down to 2, and that will make a big difference for pricing.

If we perform random updates, then not wonder that image layers need to be generated for the whole database range even if percent of updated pages is relatively small.

I think it's a solveable problem. We just compact those sparse deltas instead of reimaging them. The cost of compaction is smaller, the effect is the same (to decrease the max diffuculty below 3). We should have a limit on key_range size such that any deltas larger than it should be compacted rather than reimaged.

If we really need to reimage because it's time to GC that sparse delta, then we could (ignoring it is also a fine choice). But if we're reimaging because of GC, then we can immediately delete the old images, so we're not causing any actual pageserver bloat. Only write amplification, s3 bloat. In all cases we've seen so far, pageserver bloat is the problem. Also for total cost of services, pageserver bloat will probably be a bigger problem.

All visualizations I've made so far come directly from pageserver files that haven't been GCed yet. See for example this pathological case https://neondb.slack.com/archives/C03UVBDM40J/p1669736655098039?thread_ts=1669655812.511299&cid=C03UVBDM40J

@knizhnik
Copy link
Contributor Author

Hmmm...
First layer contains 0x1000 blocks of relation 4001, second - 3 relations 4002..4004 with at least 0x0x1000 blocks in 4004 and your third just one block of relation 4001. And you say that first layer will be represented by box with width 2 and second and third - with box with length 1? Seems to be either incorrect either confusing.

But in any case as far as this diagram doesn't represent real key range, it seems to be not possible to say just looking at this diagram that some long layer will cause generation of page images for the whole key range.
Actually the fact that some delta layer at diagram is several times wider than other doesn't indicate some problem - yes it can happen with rando distribution.

Yeah we can commit this and solve a special case, but that doesn't close #2948

First of all I do not agree that it is "special case". Append-only tables is quite frequent use case. And ore over almost an import of data to the database corresponds to such use case, so images created during such import will suffer from this problem.

Second - I once again do not completely understand picture at #2984 and it is not clear to me why this PR will not fix this problem.

We just compact those sparse deltas instead of reimaging them.

Let's return to the begging: why do we ever need to produce image layers?

  1. To reduce number of layers which need to e inspected to perform page reconstruction and so reduce reconstruction time.
  2. To be able to perform GC (which we perform at image layer).

With delta layers or with partial image layers you can not say if layer contains wal records for the particular page or not without fetching this page and performing B-Tree search. We can use bloom filter or whatever else to avoid fetching all layers. But it will require additional memory for layer map. Also precise last written LSN cache can point us to the right LSN and so avoid scanning of large number of layers.

Sorry, but I do not currently understand how you suggestion can address case with rando updates and not sure if something can really be done in this case within current layering model.

But it's fairly common to see mostly useless images at any scale. That's an image:delta ratio of 12.

I believe (can nor prove) that it really caused by "corner effect" - when we create delta layers which covers large range of key space and so require multiple image layers. If we have some test to reproduce such layers layout, it will be quite trivial to test if this patch address this problem or not.

@knizhnik
Copy link
Contributor Author

I have performed some experiments.
You are absolutely right - this "trick" with limiting key range is not working well even for single table.
There are several reasons for it: autovacuum, placing forknum after relnum in Key,...
So we really need some more generic algorithm for image layer generation.
I thought about your idea of sparse delta layers.
Actually our delta layers are already sparse: the contains only updates keys.
So we can add more precise check that delta layers really overlaps with image layer (contains some entries in this key range).
I have create new PR #3348 where I have added Layer::overlaps method and use it in LayerMap::count_deltas.
Certainly it involves disk B-Tree, so requires reading layer data fro the disk. It may increase time of compaction.
Alternatives - bloom filter or BRIN.

@knizhnik knizhnik closed this Jan 15, 2023
@bojanserafimov
Copy link
Contributor

And you say that first layer will be represented by box with width 2 and second and third - with box with length 1? Seems to be either incorrect either confusing.

It's not incorrect, just confusing. You're welcome to improve on the visualization method. Without any "coordinate compression" most rectangles will have 0 width and be invisible. Maybe a better method is to just use collect_keyspace and exclude unused keys. I haven't tried that.

But in any case as far as this diagram doesn't represent real key range, it seems to be not possible to say just looking at this diagram that some long layer will cause generation of page images for the whole key range.

Yes. Only after the images are created you can see in retrospect that one delta caused many images. But for some reason I found it hard to reproduce this locally (to get the useless images to actually appear). Maybe it takes too long and the test finishes before? Not sure. But in the specific workload of pgbench init no delta layer will be denser than an image layer, so if a delta spans 10 other deltas it probably spans 20 images too.

You are absolutely right - this "trick" with limiting key range is not working well even for single table. There are several reasons for it: autovacuum, placing forknum after relnum in Key,...

I'm sure there are other cases too. If most updates occasionally update the root of some BTree index, etc. (I don't know pg internals, just guessing that something like that would happen, or at least we can't rule it out)

@bayandin bayandin deleted the limit_layer_key_range branch May 19, 2023 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants