-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit key range of layers generate by compaction to 16k relations to separate rel/non-rel and sys/user relation entries #2995
Conversation
…separate rel/non-rel and sys/user relation entries refer #2948
pageserver/src/keyspace.rs
Outdated
/// dimension range. It means that layers generated after compaction are used to cover all database space. | ||
/// Which cause image layer generation for the whole database, leading to huge rite amplification. | ||
/// Catalog tables (like pg_class) are also used to be updated frequently (for example with estimated value of relation rows/size). | ||
/// Even if we have on append only table, still generated delta layers will cover all this table, despite to the fact that only tail is updated. | ||
/// | ||
pub fn partition(&self, target_size: u64) -> KeyPartitioning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a good idea to have a #[test]
example for this fn, even though the implementation is not too complicated yet. Though, I cannot see an easy example right away.
My only concern is that now a random workload will create tiny L0 layers. What's the downside of my proposal? Quoted here:
Yes it means the layer gets compacted multiple times. But compaction of these sparse layers is orders of magnitude cheaper than image generation so it's a strict improvement. |
Also #2563 doesn't have this problem, as an alternative solution |
I think you mean L1 layers? Because L0 layers in any case cover all key range.
Frankly speaking I do not think much about your proposal.
I do not think that this approach is the best and only possible altrnative. |
Ok so we might get one small L1 per batch (in the non-rel part of the key space), but that doesn't seem like a problem on average. Seems worth it. Have you checked this code has the desired effect? I thought we were changing L1 layer bounds, but I see changes to repartitioning code, which is only used in image layer generation (i think). neon/pageserver/src/tenant/timeline.rs Line 2529 in 87c3e55
|
As far as I understand image generation for the whole database happens because of layers with larger key range generated by recompaction. Only htis layers are taken in account when we check whether to perform image generation or not. |
Yes that's how it works. You misunderstood my question: how does changing Have you tried running this on pgbench init 10GB and drawing the layers to see that it works? (or any other test) |
Yeh, you are right. Layer key dimension should be limited not in |
Timeline after pgbench 10gb python test. Doesn't seem fixed. Maybe there are factors other than non-rel files that contribute to extremely sparse L1s. To make your own graph run |
NOTE: When reading these pictures remember that both the x and y axis are compressed. I only preserve the relative order of any rectangle endpoints without preserving size (otherwise we won't be able to see anything). So when I say a layer is "sparse", i mean "i see that later we have 10 layers in the same key space, so it's at least 10x sparser than those layers" |
Sorry, can you explain how to interpret this graph? |
empty rectangle = delta. We have 2 L0s at the top, and below 4 batches of L1 layers. Images are black rectangles with no height, so they look like dots. We often reimage a big range at once so we get a dotted line of images. They're mostly the same width, but some look wider because of coordinate compression (I only preserve the order of rectangle endpoints, not relative size). Images sometimes overlap in lsn range with an L1 delta layer because we take images at latest LSN, which includes WAL from L0 and inmem layers, which later disappear. All layers are taken with So the story in this picture is:
|
I think the problem is larger and this approach only goes so far. Nonrel pages are one obvious cause of the problem, but even without them the problem exists. Take for example an average OLTP workload. Let's squint and model this hypothetical workload as follows: Write random pages, with a certain part of the keyspace being a hotspot, and the hotspot moving every few minutes to a different part. Not only will this workload produce enough sparse deltas to trigger the problem, but actually most deltas will be very sparse and cover the entire keyspace. Only the hotspot will have deltas worth reimaging over, and by definition most of the keyspace is not in a hotspot. If we add layer barriers like you do in this PR, that would only create tiny gaps between the L1 layers and eliminate less than 1% of the images. We actually need to address the fact that currently the definition of L1 is very broad. We either need to introduce L2, and reimage only after a certain number of L2s (I haven't thought about this, and I'm not sure if just L2 will be enough), or go with my approach (classify layers based on density, as defined by "number of partitions inside the layer"), or something else. |
Sorry, I still do not understand how coordinate compression is performed. 00_00000000_00000000_00004001_00_000000000 .. 00_00000000_00000000_00004001_00_000000100 what will be the ratio of length of correspondent rectangles? If we perform random updates, then not wonder that image layers need to be generated for the whole database range even if percent of updated pages is relatively small. But let's estimate. So L0 layer contains 256Mb wal records. The problem I am trying to address is not related with random/sparse updates - here it is really har to improve something. 00_00000000_00000000_00000000_00_000000000 .. 00_00000000_00000000_00004001_00_08008000 So it actually covers all this huge table. There such L1 layers and we will generate 1Tb image layers, while we need actually image layers for 256*3 = 768Mb. Limiting layer key range can avoid such behavior. |
I take those 4 numbers, sort them, deduplicate them, and replace them with their rank.
If there are no other layers, then their width would be the same. But
Here's more on "coordinate compression" (competitive programming slang) https://medium.com/algorithms-digest/coordinate-compression-2fff95326fb I'm sure you'll hear the word again. So if you're looking at a picture and want to estimate the actual width, maybe count how many image layers it gets covered by later. |
Yeah we can commit this and solve a special case, but that doesn't close #2948
True, to see extreme problems we need extreme scale. But it's fairly common to see mostly useless images at any scale. Look at this project https://neondb.slack.com/archives/C03UVBDM40J/p1669655812511299
That's an image:delta ratio of 12. Since (TIL) deltas are 2x larger, that means our bloat factor is 7 (1 delta + 6 times it's size in images). IMO we can get the bloat down to 2, and that will make a big difference for pricing.
I think it's a solveable problem. We just compact those sparse deltas instead of reimaging them. The cost of compaction is smaller, the effect is the same (to decrease the max diffuculty below 3). We should have a limit on key_range size such that any deltas larger than it should be compacted rather than reimaged. If we really need to reimage because it's time to GC that sparse delta, then we could (ignoring it is also a fine choice). But if we're reimaging because of GC, then we can immediately delete the old images, so we're not causing any actual pageserver bloat. Only write amplification, s3 bloat. In all cases we've seen so far, pageserver bloat is the problem. Also for total cost of services, pageserver bloat will probably be a bigger problem. All visualizations I've made so far come directly from pageserver files that haven't been GCed yet. See for example this pathological case https://neondb.slack.com/archives/C03UVBDM40J/p1669736655098039?thread_ts=1669655812.511299&cid=C03UVBDM40J |
Hmmm... But in any case as far as this diagram doesn't represent real key range, it seems to be not possible to say just looking at this diagram that some long layer will cause generation of page images for the whole key range.
First of all I do not agree that it is "special case". Append-only tables is quite frequent use case. And ore over almost an import of data to the database corresponds to such use case, so images created during such import will suffer from this problem. Second - I once again do not completely understand picture at #2984 and it is not clear to me why this PR will not fix this problem.
Let's return to the begging: why do we ever need to produce image layers?
With delta layers or with partial image layers you can not say if layer contains wal records for the particular page or not without fetching this page and performing B-Tree search. We can use bloom filter or whatever else to avoid fetching all layers. But it will require additional memory for layer map. Also precise last written LSN cache can point us to the right LSN and so avoid scanning of large number of layers. Sorry, but I do not currently understand how you suggestion can address case with rando updates and not sure if something can really be done in this case within current layering model.
I believe (can nor prove) that it really caused by "corner effect" - when we create delta layers which covers large range of key space and so require multiple image layers. If we have some test to reproduce such layers layout, it will be quite trivial to test if this patch address this problem or not. |
I have performed some experiments. |
It's not incorrect, just confusing. You're welcome to improve on the visualization method. Without any "coordinate compression" most rectangles will have 0 width and be invisible. Maybe a better method is to just use
Yes. Only after the images are created you can see in retrospect that one delta caused many images. But for some reason I found it hard to reproduce this locally (to get the useless images to actually appear). Maybe it takes too long and the test finishes before? Not sure. But in the specific workload of
I'm sure there are other cases too. If most updates occasionally update the root of some BTree index, etc. (I don't know pg internals, just guessing that something like that would happen, or at least we can't rule it out) |
refer #2948