-
Notifications
You must be signed in to change notification settings - Fork 30
[Draft] A cache sweeper for kubo (go-ipfs) #428
Comments
@Stebalien please take a look, when you got some time. :) |
We need to start much smaller. The first step will be tracking reference counts on-disk instead of having to walk the entire data set each time. From this proposal, it's unclear how this is all supposed to work in practice, especially how you plan to do this atomically and reliably without losing data. That's the very first part that needs to be fleshed out before we even try to tackle the problem of keeping the most used data. |
Well, I got some more notes on that. I try to simplify here the pipeline:
This obviously requires some sniffing of pin/mfs operations, so pin/mfs operations will send a list of blocks from time to time to the cache cleaner. |
This note (if fully implemented) will likely supersede or implement or fix the following ideas/tickets:
|
UpdateCurrently looking at a planned memory footprint for each block of 12 byte total. No pointer overhead, hash table overhead or need to store the CID in memory. This way a database of 20M blocks would be just 240 MB in memory – however there's a 12 byte overhead per hash-collision. So maybe 1 GB is a conservative, realistic value here. So the whole database would be written out to disk periodically to recover the cache side of things (age, size, usage counter). However if the database has no clean shutdown flag, the whole pin-tree would need to be rescanned on startup, to update the ref-counter side of things. @iand you've asked for the feature for your usecase (ipfs/kubo#8870), so here's the Current statusBlockers: After completing ipfs/ipfs-docs#1176, ipfs/go-ipfs-chunker#31 and finding a workaround or a fix for ipfs/kubo#8694 to get pacman.store back up and running I will work full time on this. Culprit: However no idea on the timespan to completion, I still need to learn a bit golang on the way. |
@RubenKelevra thanks, you might want to look at my in-progress implementation as part of ipfs/kubo#8870, although at the moment I have punted all the harder decisions to focus on getting metrics that can be used to assess viability (phase 1 in the linked issue). See ipfs/kubo@master...iand:block-accounting |
Had a look at it and was thinking of it. I think the amount of information you gather is too large. I mean if you save the fetch time of the last 3 fetches you expect to save the information indefinitely about all discarded blocks? They idea to save the fetch time is quite interesting, but I think the metric will be pretty much useless in most cases, here's why: Fetching the first block always takes a lot of time (Time To First Byte), this will dominate the metric on the upper end, so you end up storing just the first blocks of all requests if you weight that too heavy. Also if you transfer many blocks at the same time, the average received time will be much higher than the optimal fetch-time as the bandwidth is shared. So I really don't know if that metric will be of much help. However, I think it's fair to add a flag if a block was hard to fetch, as long as its saved. For example the dial had multiple backoffs or the connection was unreliable for all sources etc. What do you think about that? :) |
Certainly worth thinking about whether we can record certain difficulties with fetches and factor that in. The end result needs to be a score than we can use to remove low value blocks. Note that the objective here is to reduce the overall time to first byte metric of the gateway, so recording those first block fetch times is high signal for us. There are more notes on the actual project here notion: prodeng project 1 Did you happen to read the caching paper I linked to in ipfs/kubo#8870? |
I think the metric is worthless. If you record the time to fetch the first block and store it, you will retain the first block, but fetching the second block will become the culprit. As the first block is basically just references for datablocks the time to first byte will stay the same. If you want to improve the response time I see a couple of options:
I could not access it. Can you share a CID of it? |
This may be relevant ipfs/boxo#364 |
Added the CID in the original comment ipfs/kubo#8870 (comment) |
Not sure I gonna need that for my implementation. The idea was to just count the space requirements for newly added blocks to the storage and do a tail drop at the same rate. So there's no need for explicit Garbage Collection invoking, as it is running all the time to keep the storage within the set boundaries. We will just "flag" blocks which are stale, but not remove them from the storage (yet), just do the decision if they should be removed if there's pressure. (stale cache) So between sweeps of the database, 5% of the storage can be rewritten. Short status-update: I'm digging into go right now to see how to implement this "best", meaning low complexity but also very high performance. |
New data structure
|
The described algorithm seems to be tailored specifically to mutable data under a static URI IPFS has immutable data under a static URI. So ui and ci doesn't make any sense in our context. I feel like si is similarly useless as well: While it makes sense to take the size of a document into account on a webserver where you may have 1 KB html files and multi gigabyte archives or videos, it doesn't make sense in the application on block level. Sure the IPFS blocks vary in size (that's why I need to keep some track of the size here as well), but I don't use the size as metric as LNC-R-W-U does - for obvious reasons: Even a 1 MB block is much quicker transferred than an mutli-GB file. Latency, DHT discovery and slow start of the TCP/QUIC connection has a much greater impact on the transfer speed than the size. What fits IPFS block level storage better is Multi-Generational-MRU (that's what Google Engineers recently built for an upcomping version of the Linux kernel[1][2]), and that's basically what this proposal is doing, just with a very large amount of generations while keeping the space requirements as low as possible. [1] https://lwn.net/Articles/856931/ |
Status
Blockers (to do first):
Todo list
Update draft to include
Introduction
Hey guys,
I'm pretty new to this project, but I see the opportunity to contribute a little and learn some go on the way. I learned C a long time ago, and I'm currently just firm in Python and shell script - so make sure to thoroughly check the contribution.
Automatic cache cleanup
Current approach
There's a garbage collector that is designed to sweep the whole block storage and remove everything which is not part of the MFS and not part of the pins. The garbage collector can be run manually offline, manually online or enabled to run if a specified repo size reached a certain percentage of fill level.
There's no logging of the number of requests of a block, its age nor how often a block is referenced. The garbage collector run is pretty slow since it traverses the whole local database to look for pins and their recursive references as well as the MFS to make sure to drop just non-referenced objects.
New approach
A background task will be added which will use a database to keep track of references, the number of requests, age, and size of each block.
On first initialization (or with a flag) it will traverse the whole pin/MFS database to create its database (or make sure it's consistent).
Goals
(in priority order)
Limitations
(in no particular order)
2^25/(10*(0.1+0.01)/2)/60/60/24
= 706 days - if zero hits occur and given that the storage is large enough to hold it for this timeframe. Objects which exceeds those value, will be placed in the stale section of the cache - which is by default capped at 10% relative size. Blocks in the stale section will be moved back to the main section on requests hitting them - or dropped at any time on cache pressure. (time limit can be raised by switching to 27 bit or by changing the clock interval - default is 10 seconds)Proposed Features
(in no particular order)
Potential configuration knobs
(all sizes are estimates)
(relative: excludes pinned/new blocks)
Potential Datastructure
32-bit integer value:
= 0
New block (no reference information yet)The dirty flag marks incomplete writes to recover from them without analyzing the whole data store.
The seldom flag marks if a cache entry is part of the seldom cache. If the reference bit is set, this bit has no function.
The Precaching flag marks if a cache entry is part of the preaching flag. If the reference bit is set, this bit has no function.
Size estimate values
The size bits will be rounded to the nearest value in this list:
Caching algorithm
A newly added block will be added to the new block list and get a timestamp. If the timestamp is older than the latest completed pinning/writing which is completed and the block wasn't referenced, it will be moved to the cache with the default value. The timestamp is discarded.
Probability curve
0.1
.0.01
at0
.0.2
at 2^25 / 2 + 2^25 / 4 (25165824). Exact curve TBD.Left side:
The text was updated successfully, but these errors were encountered: