-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: Storage perf improvements #3401
Comments
Lemmas:
So what can we do to improve performance and support large databases based on this lemmas? The second important question is whether be need compaction (reshuffling) and at which level. Please notice that with this approaches wal records are stored only in full key range L0 delta layers. So layer may can not somehow help us in this case to locale wal records associated with a page. But the assumption is that it is rarely needed and last written lsn cache and layer map with information about holes can speed up search among partial image layers. If reshuffling is still considered to be important, then it should be performed not only for 6 L0 layers.Because in this case, doing a lot of expensive disk IO, we can only get 6 delta layers for all key range. For large databases is definitely not enough and seems to have no significantly differ from having just one (L0) layer for all key range. Also we faced with the problem of too small last delta layer produced while reshuffling (#3393) So from my point of view we should either make reshuffling more flexible and adaptable, either eliminate it at all: generate partial image layers instead. |
More thoughts bout compaction-reshuffling. Right now it is performed only for 6 L0 layers. Result of compaction also consists of 6 layers but splitted vertically (by key range) rather than horizontally (by LSN). So reshuffling fixed number of L0 layers seems to be quite strange idea (unless we want to optimize our storage for some particular size). What are the alternatives:
Any other ideas? |
Compaction only helps with pageserver recovery, tenant migration, and possibly unlimited PITR (PITR from s3). If we don't do any compaction and pageserver's disk dies, we'll need to download 7 days' worth of data before we can start serving reads (or depends how we do partial images, could be better). But with our current algorithm, we just need 1 image, 3 L1 deltas and 6 L0 deltas to serve the first read, and probably any reads in the same key region. So we can start serving and lazily recover. Obviously compaction is not the only solution but if we're dropping it we need to consider this problem. But yes, compaction is probably irrelevant for read performance. Without compaction we can just use L0s and partial images. As you noted in this case we need a more accurate layer map, but that's a solveable problem (more on this later). Back when we were debating LSM vs alternatives, this L0-only strategy was my preferred approach. Heikki called it "Index+WAL scheme". I didn't have a good solution for recovery, so I accepted the LSM tree approach. Both approaches needed L0 layers at least, so that was a good move.
Yes. Another approach I had in mind is to store inside the delta layer some hints about where to find previous entries for that page. So we can lookup the last written lsn cache to find the latest entry, and from there we load the hints and lookup the historical entries. What are these hints precisely? Maybe a pointer to the previous entry. Maybe also pointers to the 2-nd, 4-th, 8-th, etc. previous entries so that we can binary search in the LSN dimension and answer non-latest queries. Specifics don't matter, we have options.
Yes 6 feels arbitrary. It should be a constant though. Why is it different than the Btw have you tried digging holes in the L0 layers too? And then only doing L1 compaction when we have 6 (or whatever) overlapping L0s? If that has any effect, it's a free win. I suspect these constants (3 and 6) have a lot to do with how we do GC. We create too many images on purpose so that we don't need to complicate GC code, and we can rely on the fact that eventually things get covered by images. With some more GC nuance we can start relaxing these numbers. For example, when a layer needs gc but it's not covered, we should compare the cost (in money) of keeping it vs covering it, and go with what's cheaper.
Not a fan of this idea. It means pageserver recovery startup time would depend on database size. Also, the 4th approach is to continue with L2, L3 compaction, etc. Not advocating for it, just including it for completeness. Let's find some small steps we can make. I'm optimistic about the hole digging approach for two reasons:
|
Another problem with "no compaction" maximalism is that we have no defense against uniform random writes. The best thing we can do is create extremely scattered partial image layers. But the lack of any locality here seems like a problem. It means that if pageserver gets a sequential scan during recovery it won't be able to keep up with it. It will get stuck trying to download 7 days of layers. How real is this problem in practice? Not sure. Maybe the first step here is to test pageserver recovery. It's a good idea to do that anyway. |
Hmmm... It seems to me that the main goal of compaction is to reduce number of access layers needed for page reconstruction. I do not see much difference with "pageserver recovery, tenant migration, and PiTR". In all this cases layers are accessed to perform page reconstruction. Yes, when layer is not available locally at pageserver (due to pageserver crash, local pageserver storage corruption or migration) then we have to download it from S3 and price of layer access will be significantly larger.
Yes, it is true. But if we do not perform compaction (reshuffling) and just produce new mage layer after 6 L0 layers, then to server first request we need 1image and 6 L0 deltas: it is even smaller. So it contradicts your statement that compaction is needed for recovery.
And once again - quite opposite. If we have to read 1 L1 delta layer instead of 6 L0, then we increase page reconstruction speed almost 6 times and overall performance is also increased 6 times. So the question is whether it makes sense to produce L1 delta layers (do reshuffling) or immediately generate image layers instead (dense or sparse).
May be makes sense, may be not... Instead of trying to maintain chain of layers we may just prefer to cut this chain by generation page image. Ideally we should read not more than one delta layer to reconstruct page. There are two polar cases: initial table population when there are large number of subsequent wal records belonging to the same page which total size exceeds size of page image. In this case we should try to store page image for this page ASAP. Most likely this WAL records will not be needed at all even in case of PiTR, because nobody needs partly initialized paged.
Why it should be a constant? I am not sure, but it seems to be more natural to make it depend on database size if L1 layer represents 1/N-s part of database keyspace.
As you know yourself, there are widely updated keys in both ends of key dimension. Yes, maintaining holes allows not to generate excess image layers, but most likely there will be overlaps between L0 layers in any case. So we will have to always perform compaction. Unless we have some threshold for number (size?) of overlapped regions.
Doubtful.I do not take in account image layers, then the large database size is, the larger probability that N updates of the same page will be stored in N delta layers and so require loading N layers on page reconstruction. Increasing compaction threshold may help to group more WAL records assigned to the same page together.
Looks like no reshuffling or page reconstruction policy can be efficient for random updates of huge database. |
Motivation
See https://docs.google.com/document/d/1GcmSW9_DXHou3tuezhjKyL2-ZNuxdRTAaKoi1BC08ns/edit#heading=h.1r2wls36zj0n
DoD
Implementation ideas
Tasks
Other related tasks and Epics
The text was updated successfully, but these errors were encountered: