Add Layer::overlaps method and use it in count_deltas to avoid unnecessary image layer generation #3348

knizhnik · 2023-01-15T14:13:22Z

knizhnik · 2023-01-16T07:22:58Z

For pgbench -i 1000 with subsequent 1000 seconds of updates the difference in storage size is not so large:

main: 67GB
PR: 56GB

knizhnik · 2023-01-16T14:06:43Z

It is not so easy to reproduce behavior reported in #2948 (so that number of image layers is greater than delta layers).
I was able to get about 2 times difference with main using the following test:

pgbench -i -s 1000
pgbench -s 1000 -c 10 -M prepared -T 300 -P 10 -f update.sql

update.sql is the following:

update pgbench_accounts set abalance = abalance + 1 where aid=1;
update pgbench_accounts set abalance = abalance + 1 where aid=:scale * 100000;

main: 22GB
this PR: 14GB

bojanserafimov · 2023-01-17T18:12:48Z

Now we can pile up an unlimited number of deltas. The last lsn cache will help avoid false positives and locate the first relevant layer, but won't help finding the second relevant layer. It's a solveable layer map problem with some effort, but IMO the bigger problem is if we need to download any of these layers on demand. Doesn't feel good to not have any bound on how many layers need to be downloaded to serve a get_page if the layers are not on the pageserver. This was the whole point of compaction from the start: to make on demand download feasible. If we don't care about this we shouldn't be creating L1 layers at all, just use L0.

knizhnik · 2023-01-17T20:01:08Z

Now we can pile up an unlimited number of deltas. The last lsn cache will help avoid false positives and locate the first relevant layer, but won't help finding the second relevant layer. It's a solveable layer map problem with some effort, but IMO the bigger problem is if we need to download any of these layers on demand. Doesn't feel good to not have any bound on how many layers need to be downloaded to serve a get_page if the layers are not on the pageserver. This was the whole point of compaction from the start: to make on demand download feasible. If we don't care about this we shouldn't be creating L1 layers at all, just use L0.

Yes, I also thought about it.
But do not know some good solution at this moment.

bojanserafimov · 2023-01-17T23:08:26Z

But do not know some good solution at this moment.

I tried my method of changing the definition of L1 to mean "sufficiently dense layer", such that those 2 out of 10 sparse layers would get re-compacted. I somewhat prefer this approach to adding L2 layers (which was probably the original plan) because at least the L1 layers that come out dense from the start don't need to be compacted.

It was a 2 line change in LayerMap::get_level0_deltas but it caused some panics that I haven't investigated yet.

bojanserafimov · 2023-01-18T19:37:07Z

it caused some panics that I haven't investigated yet.

Here's a branch if you'd like to help debugging. I'm getting

    0: Failed to open file '/home/bojan/src/neondatabase/neon/test_output/test_pgbench[neon-45-684]/repo/tenants/eaaec3d04cf8c30a6e48ef1bf37da1c5/timelines/917ef6fff2e33c86b9b03819c4457b17/000000067F0000334E0000400C01FFFFFFFF-030000000000000000000000000000000002__00000000016A9231-00000000A0C81501'
    1: No such file or directory (os error 2)

https://github.com/neondatabase/neon/compare/density-based-l0?expand=1

(after the bug is found I'll have to tweak the threshold and then we can see if it's effective)

knizhnik · 2023-01-19T12:29:39Z

I do not think that this approach will really work.
First of all because it may lead to infinite compaction loop: layers produced by compactor will never meet "dense" criteria and will e once and once again added to L0 layers list until compacting them will take unacceptable amount of time.

Also, as I found myself with PR trying to limit layer's key range, it is hard if ever possible to find some reasonable key rage threshold. There can be just one big relation (i.e. 1Tb). But it key range will e just 256Mb (so not satisfying your criteria of sparse layer). But it can contain just one page of this relation. Fro the other side, there can be thousands of relatively small relation and splitting them in separate layers will lead to large number of very small layers. Also not good.

We have yesterday discussed this problems with @hlinnaka and @shanyp
Followup is here:
https://docs.google.com/document/d/1GcmSW9_DXHou3tuezhjKyL2-ZNuxdRTAaKoi1BC08ns/edit
We decided that first step can be to add to layer map information about ranges, so it will be possible to perform fast lookup wihtout reading layer content from the disk. For our purposes storing ranges seems to be omre efficient than bloom filter.
But still unclear how much ranges can we keep in memory to avoid too much memory footprint of layer map.

So, summarizing all above: right now we can propose four different approaches for solving problems with image layers:

Reduce period when image layers are created. For example create them not after 3 but 30 delta layers. Improved layer map can be used to efficiently skip not relevant delta layer. To reduce reconstruction cost with can either store partia images, either rely on reconstructed page cache.
Force compaction of sparse layers. I think your approach can be improved by more pricese and sophisticated criteria of layer density (may rely on number of range stored in layer map for this layer)
When calculated number of delta layers as criteria of image layer generation, ignore layers not intersecting with target image layer key range (this PR). Once again, instead of accessing B-Tree for checking overlaps we can use ranges stored n memory.
Avoid creation of sparse deltas at all. Looks like key range is not good criteria.Detecting large "holes" seems to be better criteria. But I think we should take in account not only content of delta layer, but also image coverage. I.e. if we have delta layers with few pages belonging to two relations, we should consider how large are this relations. If them are small, then there is no sense to try to split this delta layer.

bojanserafimov · 2023-01-20T18:47:31Z

Here's the effect on pgbench 10gb:

It's not the best workload to test this on, because I don't think this change has much effect on the random updates phase (top of the diagram). But at least we can see that layers are being created later than usual, and all at the same time. So the two long L1s at the bottom probably have large holes and are not affecting the image heuristic.

I don't know if there's a better way to test this than to deploy and see what happens. The cost is not that big (only log_2(10) increased pressure on the new layer map, not sure about current layer map).

Off topic: I don't know how to interpret all the layers on the right end of the key space. Something changed on main branch since fe8cef3. See for example: #2995 (comment) which is a PR branched from an older version on main and doesn't have those layers. I also see some 24KB L1 deltas now. There might be some regression on main

hlinnaka · 2023-01-20T18:51:59Z

What would be the worst case workload to demonstrate the problem?

bojanserafimov · 2023-01-20T18:58:42Z

What would be the worst case workload to demonstrate the problem?

Worst case is probably synthetic uniform random updates. The top 10 holes wouldn't take up a big percentage of the range. But this seems adversarial, and in real workloads I'd expect that the top 10 holes would take up a significant percentage of the range.

hlinnaka · 2023-01-20T19:01:57Z

What would be the worst case workload to demonstrate the problem?

Worst case is probably synthetic uniform random updates. The top 10 holes wouldn't take up a big percentage of the range. But this seems adversarial, and in real workloads I'd expect that the top 10 holes would take up a significant percentage of the range.

I meant, worst case to demonstrate the original problem that this fixes.

bojanserafimov · 2023-01-20T19:15:34Z

I meant, worst case to demonstrate the original problem that this fixes.

The bigger the database the worse it is. If we have 1TB database we will spontaneously create 1TB images once in a while even if compute is doing nothing but vacuum. We routinely see databases with 5-10x their size in useless images.

It's hard to reproduce this because of nondeterminism. But generally because we do a full reimage of the database once we get 3 batches of L1s that cover the entire key space, and most L1 batches cover the entire keyspace, 30-50GB of wal is (my guess) enough to trigger reimaging. But it can also trigger just from passage of time. We can periodically flush tiny L0s, then compact them into tiny L1s and reimage. We've also seen this in prod.

bojanserafimov · 2023-01-25T20:17:26Z

We could write a binary that we execute against a timelie on prod to download layer metadata: layer range + 100 largest holes.

This gives us enough information to run the count_deltas function (locally!!) for each image, and compute how many of those would not have existed with this PR. We can then fine tune the number of splits to create (IMO 10 is the right number, and 100 is also affordable after we pay the tech debt on layer map rebuilds)

Yes, all this testing is extra effort compared to just merging this, waiting a week and then looking at prod metrics, but IMO that local reimaging simulation would make a nice unit test anyway. We should be testing compaction/reimaging/gc/branching/upload/download in unit tests, just using metadata. There's an old issue for this assigned to me that I somehow got distracted from :) #2031

knizhnik · 2023-01-26T14:48:27Z

I have implemented proposed binary.
Have not yet run it on prod, juts try it on Kontor and Ketteq data with different number of stored holes:

data set	max holes	delta layers	image layers	excess image layers
Kontor	5	151	2184	1321
Kontor	10	151	2184	1322
Kontor	100	151	2184	1322
Ketteq	5	50	112	54
Ketteq	10	50	112	61
Ketteq	100	50	112	69

So looks like this hole optimization should really help to reduce number of extra image layers.
And keeping information about 10 largest holes is enough.

bojanserafimov

I left a few nits.

I'm curious about results from at least one of these databases: https://neondb.slack.com/archives/C03UVBDM40J/p1674749418292749

bojanserafimov · 2023-01-27T03:06:54Z

pageserver/src/tenant/layer_map.rs

- let image_exact_match = img_lsn + 1 == end_lsn;
- if image_is_newer || image_exact_match {
+ pub fn search(&self, key: Key, mut end_lsn: Lsn) -> Option<SearchResult<L>> {
+ loop {


It's more efficient to change the insert method instead of search (insert the occupied ranges separately). Maybe leave a TODO if that's not planned for this PR.

Will it be ok to have multiple references to the same layers?
Can it cause returning the same layer twice in iteration through layers?

When iterating the coverage it's good to return the same layer multiple times. But iter_historic should return each layer only once.

bojanserafimov · 2023-01-27T03:14:47Z

pageserver/src/tenant/layer_map.rs

@@ -417,18 +433,21 @@ where
 /// TODO The optimal number should probably be slightly higher than 1, but to
 /// implement that we need to plumb a lot more context into this function
 /// than just the current partition_range.
- pub fn is_reimage_worthy(layer: &L, partition_range: &Range<Key>) -> bool {
+ pub fn is_reimage_worthy(layer: &L, partition_range: &Range<Key>) -> Result<bool> {


nit: It never returns Err. Is there a reason for this change?

overlaps may return error

See #3348

knizhnik · 2023-02-01T15:01:18Z

Ugh...
Propagating everywhere RequestContext everywhere is really terrible.

Will be pleased if @problame or @hlinnaka can review if I did everything as expected.
Also this PR fixes two remote tests added by @hlinnaka - can you look at them please.

bojanserafimov · 2023-02-02T15:51:36Z

Just so the comment doesn't get lost in history:

I'm curious about results from at least one of these databases: https://neondb.slack.com/archives/C03UVBDM40J/p1674749418292749

these projects have potential for more than 50% images disappearing, and I'm wondering if that will happen

Co-authored-by: Joonas Koivunen <joonas@neon.tech>

Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>

koivunej

A quick comment while trying to understand the on-demand calculation of holes while not actually holding any locks. Are there intentional changes for vendor/postgres-v1{4,5} as well? I cannot see how these changes could require postgres chjanges.

koivunej · 2023-02-13T09:52:17Z

pageserver/src/tenant/layer_map.rs

+ anyhow::bail!("replacing downloaded layer into layermap failed because layer was not found");
+ }
+ Replacement::RemovalBuffered => {
+ unreachable!("current implementation does not remove anything")


This cannot be pulled into the replace_historic_noflush, because this was true only for the one callsite.

But I didn't find any other call where Replacement::RemovalBuffered is handled differently, did you?

The buffered updates api allows you to do remove_historic and replace which would make this path viable.

knizhnik · 2023-02-13T12:49:59Z

A quick comment while trying to understand the on-demand calculation of holes while not actually holding any locks.

Layers are read-only. So why do we need any locks here?

Are there intentional changes for vendor/postgres-v1{4,5} as well? I cannot see how these changes could require postgres chjanges.

Sorry, them are completely unrelated.
I will remove them. Just ignore them for now.

## Describe your changes This is yet another attempt to address problem with storage size ballooning #2948 Previous PR #3348 tries to address this problem by maintaining list of holes for each layer. The problem with this approach is that we have to load all layer on pageserver start. Lazy loading of layers is not possible any more. This PR tries to collect information of N largest holes on compaction time and exclude this holes from produced layers. It can cause generation of larger number of layers (up to 2 times) and producing small layers. But it requires minimal changes in code and doesn't affect storage format. For graphical explanation please see thread: #3597 (comment) ## Issue ticket number and link #2948 #3348 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

knizhnik requested review from a team as code owners January 15, 2023 14:13

knizhnik requested review from bojanserafimov and hlinnaka and removed request for a team January 15, 2023 14:13

knizhnik mentioned this pull request Jan 15, 2023

Limit key range of layers generate by compaction to 16k relations to separate rel/non-rel and sys/user relation entries #2995

Closed

sharnoff mentioned this pull request Jan 15, 2023

Add VM informant to vm-compute-node #3324

Merged

bojanserafimov mentioned this pull request Jan 20, 2023

Possible regression: small L1s on main #3393

Closed

knizhnik mentioned this pull request Jan 26, 2023

Add layer_map_analyzer tool #3451

Merged

knizhnik force-pushed the delta_layer_overlaps branch from 2985120 to d9c1ee3 Compare January 26, 2023 17:38

bojanserafimov reviewed Jan 27, 2023

View reviewed changes

knizhnik added a commit that referenced this pull request Jan 31, 2023

Add layer_map_analyzer tool (#3451)

895f929

See #3348

knizhnik force-pushed the delta_layer_overlaps branch from 4f96c77 to ecc2381 Compare February 1, 2023 13:37

knizhnik and others added 20 commits February 9, 2023 09:52

Propagate RequestContext

d63a745

Store historic layers in separate set

7c3c271

Remove all occupied segments in layer map

a58b703

Remove all occupied segments in layer map

161e0d8

Rebase with main

5e2cea2

Cleanup after merge with main

e020247

Minor refactoring

88fc5c3

Minor refactoring

6aab1a9

Ignore load layer error in get_occupied_ranges

d89e5a7

Ignore load layer error in get_occupied_ranges

be24282

Sort layers in LayerMapInfo

689ef9e

Sort layers in LayerMapInfo

e068e48

Update pageserver/src/tenant/layer_map.rs

8051f94

Co-authored-by: Joonas Koivunen <joonas@neon.tech>

Add test for format version 2 of index_part.json

04c81fc

Try to use wait_for_upload in test_ondemand_download_timetravel

4b3fe74

Restore v1_indexpart_is_parsed test

ed44d66

Merge with main

ad4d678

Restore sleep in test_ondemand_download.py

284d167

Update pageserver/src/tenant/layer_map.rs

e9b0e43

Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>

Treate Arc as raw pointers in hash implementation for LayerRef

6e5efc5

knizhnik force-pushed the delta_layer_overlaps branch from a62d1bd to 6e5efc5 Compare February 9, 2023 08:59

Mak eclippy happy

d78fbb0

koivunej reviewed Feb 13, 2023

View reviewed changes

This was referenced Feb 13, 2023

Epic: allow to call gc via an api #3590

Closed

Compaction with on-demand download is unimplemented #3591

Closed

knizhnik mentioned this pull request Feb 15, 2023

Skip largest N holes during compaction #3597

Merged

4 tasks

knizhnik closed this Mar 22, 2023

bayandin deleted the delta_layer_overlaps branch May 19, 2023 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Layer::overlaps method and use it in count_deltas to avoid unnecessary image layer generation #3348

Add Layer::overlaps method and use it in count_deltas to avoid unnecessary image layer generation #3348

knizhnik commented Jan 15, 2023

knizhnik commented Jan 16, 2023

knizhnik commented Jan 16, 2023

bojanserafimov commented Jan 17, 2023

knizhnik commented Jan 17, 2023

bojanserafimov commented Jan 17, 2023

bojanserafimov commented Jan 18, 2023

knizhnik commented Jan 19, 2023

bojanserafimov commented Jan 20, 2023

hlinnaka commented Jan 20, 2023

bojanserafimov commented Jan 20, 2023

hlinnaka commented Jan 20, 2023

bojanserafimov commented Jan 20, 2023

bojanserafimov commented Jan 25, 2023

knizhnik commented Jan 26, 2023 •

edited

Loading

bojanserafimov left a comment

bojanserafimov Jan 27, 2023

knizhnik Jan 27, 2023

bojanserafimov Jan 30, 2023

bojanserafimov Jan 27, 2023

knizhnik Feb 1, 2023

knizhnik commented Feb 1, 2023

bojanserafimov commented Feb 2, 2023

koivunej left a comment

koivunej Feb 13, 2023

knizhnik Feb 13, 2023

koivunej Feb 13, 2023 •

edited

Loading

knizhnik commented Feb 13, 2023

Add Layer::overlaps method and use it in count_deltas to avoid unnecessary image layer generation #3348

Add Layer::overlaps method and use it in count_deltas to avoid unnecessary image layer generation #3348

Conversation

knizhnik commented Jan 15, 2023

knizhnik commented Jan 16, 2023

knizhnik commented Jan 16, 2023

bojanserafimov commented Jan 17, 2023

knizhnik commented Jan 17, 2023

bojanserafimov commented Jan 17, 2023

bojanserafimov commented Jan 18, 2023

knizhnik commented Jan 19, 2023

bojanserafimov commented Jan 20, 2023

hlinnaka commented Jan 20, 2023

bojanserafimov commented Jan 20, 2023

hlinnaka commented Jan 20, 2023

bojanserafimov commented Jan 20, 2023

bojanserafimov commented Jan 25, 2023

knizhnik commented Jan 26, 2023 • edited Loading

bojanserafimov left a comment

Choose a reason for hiding this comment

bojanserafimov Jan 27, 2023

Choose a reason for hiding this comment

knizhnik Jan 27, 2023

Choose a reason for hiding this comment

bojanserafimov Jan 30, 2023

Choose a reason for hiding this comment

bojanserafimov Jan 27, 2023

Choose a reason for hiding this comment

knizhnik Feb 1, 2023

Choose a reason for hiding this comment

knizhnik commented Feb 1, 2023

bojanserafimov commented Feb 2, 2023

koivunej left a comment

Choose a reason for hiding this comment

koivunej Feb 13, 2023

Choose a reason for hiding this comment

knizhnik Feb 13, 2023

Choose a reason for hiding this comment

koivunej Feb 13, 2023 • edited Loading

Choose a reason for hiding this comment

knizhnik commented Feb 13, 2023

knizhnik commented Jan 26, 2023 •

edited

Loading

koivunej Feb 13, 2023 •

edited

Loading