-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible regression: small L1s on main #3393
Comments
Isn't it expected result of "greedy:" compaction algorithm: it just try to produce layers of specified size ( |
Maybe. But it reproducess pretty well: I used to never see it now I always see it. And it comes with a few other anomalies:
Not sure how to debug this, other than git bisect |
The first bad commit is 12e6f44 So there's no regression, just the test changed. Is it expected that server side generation during init should create 50% more deltas (among all the other differences in behavior)? I'd expect the pageserver workload to be the same |
There are three possible cases when delta layers may have size smaller than specified:
Only 1) is applicable to L1 layers, 2) and 3) - to L0 layers. In case of pgbench we have case 1):
lets look at some of delta layers with size less than 128M and search this LSN range:
as you can see - it is last segment in the key range. So it is on more argument against compaction: right now compaction may produce arbitrary small L1 layers which is not good. |
|
In my case (the filenames I posted in the issue) it looks more like case 2. The layers are at the end of the key range, but their short LSN ranges and the way they stack on top of each other can only be explained by case 2. Picture (see the 3 layers on the bottom right): I wonder when we split by LSN for case 2, what's our file size threshold? Hopefully not 24KB. That would be too small. But at the same time, if it's larger then we won't make image layers for the hot page, and reads will get very expensive. We need to inject these tall layers with page images for the hot page. |
A very useful number to have for such heuristics: For what number X is the latency of redo IPC equal to the cost of redoing X records? |
Threshold is the same as for any other delta layers: 256Mb. But last layer with duplicates can be arbitrary small, because we can not place any other keys in this layer. |
Sorry, what IPC stands for ? |
Hmm then maybe this is not case 2. But also it doesn't look like case 1, based on the LSN range (usually layers come in batches, but these three are not part of a batch)
Interprocess communication. Or is that no longer important after async walredo (I haven't reviewed that PR)? |
I do not think that comparison of redo tie with IPC latency has a lot of sense. The question "to redo or not to redo" is really very difficult. I think that success of Neon will greatly depends on whether we find answer for this question. There are lot of different aspects. Assume that some page is frequently updated. So there are a large number of WAL records associated with this page. Should we force reconstruction of this page? If page is frequently updated, then most likely it is kept in Shared buffer and so compute will not retrieve it from page server. So redoing it ca e considered as useless job. Also as far as page is frequently updated, reconstructed image will deteriorated very fast - we still need to perform wal redo. I do not remember precise numbers but looks like applying 100 wal records tkes almost the same time as applying 1.So there is no much sense i reconstructing page too frequently. Fro the other size if chain of wal records becomes too long then fetching and applying it may take a lot of time. Ideally, page should be reconstructed right before it is requested y compute. But it is hard to predict it. If page is thrownaway from shared buffers, it doesn't means then oni the near future it will be accessed and compute need to send |
A-ha, yes, there is a big difference between pgbench's client-side and server-side generation. With client-side generation, the biggest table ( tldr; client-side generation produces a lot less WAL. |
Now I better understand why we get small layers in What does it mean? We should either not perform compaction at all (partial image layers). Either treat this small L1 layers as L0 layers and use for further compaction. @bojanserafimov tries to implement something like this (but using layer sparsity criteria). |
After running pgbench 10gb I get some 24KB L1 deltas. Here are some of their filenames:
More context:
Originally posted by @bojanserafimov in #3348 (comment)
The text was updated successfully, but these errors were encountered: