Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Layer::overlaps method and use it in count_deltas to avoid unnecessary image layer generation #3348

Closed
wants to merge 44 commits into from

Conversation

knizhnik
Copy link
Contributor

refer #2948

@knizhnik
Copy link
Contributor Author

For pgbench -i 1000 with subsequent 1000 seconds of updates the difference in storage size is not so large:

  • main: 67GB
  • PR: 56GB

@knizhnik
Copy link
Contributor Author

It is not so easy to reproduce behavior reported in #2948 (so that number of image layers is greater than delta layers).
I was able to get about 2 times difference with main using the following test:

pgbench -i -s 1000
pgbench -s 1000 -c 10 -M prepared -T 300 -P 10 -f update.sql

update.sql is the following:

update pgbench_accounts set abalance = abalance + 1 where aid=1;
update pgbench_accounts set abalance = abalance + 1 where aid=:scale * 100000;

main: 22GB
this PR: 14GB

@bojanserafimov
Copy link
Contributor

Now we can pile up an unlimited number of deltas. The last lsn cache will help avoid false positives and locate the first relevant layer, but won't help finding the second relevant layer. It's a solveable layer map problem with some effort, but IMO the bigger problem is if we need to download any of these layers on demand. Doesn't feel good to not have any bound on how many layers need to be downloaded to serve a get_page if the layers are not on the pageserver. This was the whole point of compaction from the start: to make on demand download feasible. If we don't care about this we shouldn't be creating L1 layers at all, just use L0.

@knizhnik
Copy link
Contributor Author

Now we can pile up an unlimited number of deltas. The last lsn cache will help avoid false positives and locate the first relevant layer, but won't help finding the second relevant layer. It's a solveable layer map problem with some effort, but IMO the bigger problem is if we need to download any of these layers on demand. Doesn't feel good to not have any bound on how many layers need to be downloaded to serve a get_page if the layers are not on the pageserver. This was the whole point of compaction from the start: to make on demand download feasible. If we don't care about this we shouldn't be creating L1 layers at all, just use L0.

Yes, I also thought about it.
But do not know some good solution at this moment.

@bojanserafimov
Copy link
Contributor

But do not know some good solution at this moment.

I tried my method of changing the definition of L1 to mean "sufficiently dense layer", such that those 2 out of 10 sparse layers would get re-compacted. I somewhat prefer this approach to adding L2 layers (which was probably the original plan) because at least the L1 layers that come out dense from the start don't need to be compacted.

It was a 2 line change in LayerMap::get_level0_deltas but it caused some panics that I haven't investigated yet.

@bojanserafimov
Copy link
Contributor

it caused some panics that I haven't investigated yet.

Here's a branch if you'd like to help debugging. I'm getting

    0: Failed to open file '/home/bojan/src/neondatabase/neon/test_output/test_pgbench[neon-45-684]/repo/tenants/eaaec3d04cf8c30a6e48ef1bf37da1c5/timelines/917ef6fff2e33c86b9b03819c4457b17/000000067F0000334E0000400C01FFFFFFFF-030000000000000000000000000000000002__00000000016A9231-00000000A0C81501'
    1: No such file or directory (os error 2)

https://github.com/neondatabase/neon/compare/density-based-l0?expand=1

(after the bug is found I'll have to tweak the threshold and then we can see if it's effective)

@knizhnik
Copy link
Contributor Author

I do not think that this approach will really work.
First of all because it may lead to infinite compaction loop: layers produced by compactor will never meet "dense" criteria and will e once and once again added to L0 layers list until compacting them will take unacceptable amount of time.

Also, as I found myself with PR trying to limit layer's key range, it is hard if ever possible to find some reasonable key rage threshold. There can be just one big relation (i.e. 1Tb). But it key range will e just 256Mb (so not satisfying your criteria of sparse layer). But it can contain just one page of this relation. Fro the other side, there can be thousands of relatively small relation and splitting them in separate layers will lead to large number of very small layers. Also not good.

We have yesterday discussed this problems with @hlinnaka and @shanyp
Followup is here:
https://docs.google.com/document/d/1GcmSW9_DXHou3tuezhjKyL2-ZNuxdRTAaKoi1BC08ns/edit
We decided that first step can be to add to layer map information about ranges, so it will be possible to perform fast lookup wihtout reading layer content from the disk. For our purposes storing ranges seems to be omre efficient than bloom filter.
But still unclear how much ranges can we keep in memory to avoid too much memory footprint of layer map.

So, summarizing all above: right now we can propose four different approaches for solving problems with image layers:

  1. Reduce period when image layers are created. For example create them not after 3 but 30 delta layers. Improved layer map can be used to efficiently skip not relevant delta layer. To reduce reconstruction cost with can either store partia images, either rely on reconstructed page cache.
  2. Force compaction of sparse layers. I think your approach can be improved by more pricese and sophisticated criteria of layer density (may rely on number of range stored in layer map for this layer)
  3. When calculated number of delta layers as criteria of image layer generation, ignore layers not intersecting with target image layer key range (this PR). Once again, instead of accessing B-Tree for checking overlaps we can use ranges stored n memory.
  4. Avoid creation of sparse deltas at all. Looks like key range is not good criteria.Detecting large "holes" seems to be better criteria. But I think we should take in account not only content of delta layer, but also image coverage. I.e. if we have delta layers with few pages belonging to two relations, we should consider how large are this relations. If them are small, then there is no sense to try to split this delta layer.

@bojanserafimov
Copy link
Contributor

Here's the effect on pgbench 10gb:

Screenshot from 2023-01-20 12-48-01

It's not the best workload to test this on, because I don't think this change has much effect on the random updates phase (top of the diagram). But at least we can see that layers are being created later than usual, and all at the same time. So the two long L1s at the bottom probably have large holes and are not affecting the image heuristic.

I don't know if there's a better way to test this than to deploy and see what happens. The cost is not that big (only log_2(10) increased pressure on the new layer map, not sure about current layer map).

Off topic: I don't know how to interpret all the layers on the right end of the key space. Something changed on main branch since fe8cef3. See for example: #2995 (comment) which is a PR branched from an older version on main and doesn't have those layers. I also see some 24KB L1 deltas now. There might be some regression on main

@hlinnaka
Copy link
Contributor

What would be the worst case workload to demonstrate the problem?

@bojanserafimov
Copy link
Contributor

What would be the worst case workload to demonstrate the problem?

Worst case is probably synthetic uniform random updates. The top 10 holes wouldn't take up a big percentage of the range. But this seems adversarial, and in real workloads I'd expect that the top 10 holes would take up a significant percentage of the range.

@hlinnaka
Copy link
Contributor

What would be the worst case workload to demonstrate the problem?

Worst case is probably synthetic uniform random updates. The top 10 holes wouldn't take up a big percentage of the range. But this seems adversarial, and in real workloads I'd expect that the top 10 holes would take up a significant percentage of the range.

I meant, worst case to demonstrate the original problem that this fixes.

@bojanserafimov
Copy link
Contributor

I meant, worst case to demonstrate the original problem that this fixes.

The bigger the database the worse it is. If we have 1TB database we will spontaneously create 1TB images once in a while even if compute is doing nothing but vacuum. We routinely see databases with 5-10x their size in useless images.

It's hard to reproduce this because of nondeterminism. But generally because we do a full reimage of the database once we get 3 batches of L1s that cover the entire key space, and most L1 batches cover the entire keyspace, 30-50GB of wal is (my guess) enough to trigger reimaging. But it can also trigger just from passage of time. We can periodically flush tiny L0s, then compact them into tiny L1s and reimage. We've also seen this in prod.

@bojanserafimov
Copy link
Contributor

We could write a binary that we execute against a timelie on prod to download layer metadata: layer range + 100 largest holes.

This gives us enough information to run the count_deltas function (locally!!) for each image, and compute how many of those would not have existed with this PR. We can then fine tune the number of splits to create (IMO 10 is the right number, and 100 is also affordable after we pay the tech debt on layer map rebuilds)

Yes, all this testing is extra effort compared to just merging this, waiting a week and then looking at prod metrics, but IMO that local reimaging simulation would make a nice unit test anyway. We should be testing compaction/reimaging/gc/branching/upload/download in unit tests, just using metadata. There's an old issue for this assigned to me that I somehow got distracted from :) #2031

@knizhnik
Copy link
Contributor Author

knizhnik commented Jan 26, 2023

I have implemented proposed binary.
Have not yet run it on prod, juts try it on Kontor and Ketteq data with different number of stored holes:

data set max holes delta layers image layers excess image layers
Kontor 5 151 2184 1321
Kontor 10 151 2184 1322
Kontor 100 151 2184 1322
Ketteq 5 50 112 54
Ketteq 10 50 112 61
Ketteq 100 50 112 69

So looks like this hole optimization should really help to reduce number of extra image layers.
And keeping information about 10 largest holes is enough.

Copy link
Contributor

@bojanserafimov bojanserafimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few nits.

I'm curious about results from at least one of these databases: https://neondb.slack.com/archives/C03UVBDM40J/p1674749418292749

let image_exact_match = img_lsn + 1 == end_lsn;
if image_is_newer || image_exact_match {
pub fn search(&self, key: Key, mut end_lsn: Lsn) -> Option<SearchResult<L>> {
loop {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more efficient to change the insert method instead of search (insert the occupied ranges separately). Maybe leave a TODO if that's not planned for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be ok to have multiple references to the same layers?
Can it cause returning the same layer twice in iteration through layers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When iterating the coverage it's good to return the same layer multiple times. But iter_historic should return each layer only once.

@@ -417,18 +433,21 @@ where
/// TODO The optimal number should probably be slightly higher than 1, but to
/// implement that we need to plumb a lot more context into this function
/// than just the current partition_range.
pub fn is_reimage_worthy(layer: &L, partition_range: &Range<Key>) -> bool {
pub fn is_reimage_worthy(layer: &L, partition_range: &Range<Key>) -> Result<bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It never returns Err. Is there a reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overlaps may return error

knizhnik added a commit that referenced this pull request Jan 31, 2023
@knizhnik
Copy link
Contributor Author

knizhnik commented Feb 1, 2023

Ugh...
Propagating everywhere RequestContext everywhere is really terrible.

Will be pleased if @problame or @hlinnaka can review if I did everything as expected.
Also this PR fixes two remote tests added by @hlinnaka - can you look at them please.

@bojanserafimov
Copy link
Contributor

Just so the comment doesn't get lost in history:

I'm curious about results from at least one of these databases: https://neondb.slack.com/archives/C03UVBDM40J/p1674749418292749

these projects have potential for more than 50% images disappearing, and I'm wondering if that will happen

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick comment while trying to understand the on-demand calculation of holes while not actually holding any locks. Are there intentional changes for vendor/postgres-v1{4,5} as well? I cannot see how these changes could require postgres chjanges.

anyhow::bail!("replacing downloaded layer into layermap failed because layer was not found");
}
Replacement::RemovalBuffered => {
unreachable!("current implementation does not remove anything")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be pulled into the replace_historic_noflush, because this was true only for the one callsite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I didn't find any other call where Replacement::RemovalBuffered is handled differently, did you?

Copy link
Member

@koivunej koivunej Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffered updates api allows you to do remove_historic and replace which would make this path viable.

@knizhnik
Copy link
Contributor Author

A quick comment while trying to understand the on-demand calculation of holes while not actually holding any locks.

Layers are read-only. So why do we need any locks here?

Are there intentional changes for vendor/postgres-v1{4,5} as well? I cannot see how these changes could require postgres chjanges.

Sorry, them are completely unrelated.
I will remove them. Just ignore them for now.

knizhnik added a commit that referenced this pull request Feb 22, 2023
## Describe your changes

This is yet another attempt to address problem with storage size
ballooning #2948
Previous PR #3348 tries to address this problem by maintaining list of
holes for each layer.
The problem with this approach is that we have to load all layer on
pageserver start.
Lazy loading of layers is not possible any more.

This PR tries to collect information of N largest holes on compaction
time and exclude this holes from produced layers.
It can cause generation of larger number of layers (up to 2 times) and
producing small layers.
But it requires minimal changes in code and doesn't affect storage
format.

For graphical explanation please see thread:
#3597 (comment)

## Issue ticket number and link

#2948
#3348

## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
@knizhnik knizhnik closed this Mar 22, 2023
@bayandin bayandin deleted the delta_layer_overlaps branch May 19, 2023 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants