-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue for UnixFS automatic unsharding #8148
Comments
Given the sharding implementation based on size already landed (though not bubbled to Size estimation. Here and in the code size is a loose reference to the estimation the byte size the directory has when serialized. The current sharding implementation does this by aggregation of link (directory entry) size (more precisely, its name and CID byte lengths). The exact formula of the estimation is not relevant for the decision needed here, we just note that we estimate size based on links. This means the size changes only when we add or remove an entry in the directory (both entry points are already identified in the current implementation).
The broad ends of the spectrum to do the size computation can be described as:
My recommendation without much idea of the underlying use case is to keep the
|
|
I'm fine going this way but just to be totally sure we have a shared understanding let me ask a few basic questions. First some clarifications I think were missing from my previous writeup:
By this we mean enumerating all links each time an entry is removed, right? The
Could you expand on this trade-off please? |
On motivations: I'd like to suggest a much simpler and stronger lens, and a piece of vocab, for framing and understanding the value of this: Does the data structure have Hysteresis or not? In other words, does the structure vary based on its history, or is it determined purely by its content at the current point in time? This is relevant because one of the key selling points of HAMTs is that they do not have hysteresis. Therefore it's extremely odd if we were to make any choice that's directly adjacent to how the HAMT is used which throws away one of its primary virtues. (If we throw away hysteresis, then there's a lot of other algorithms that could potentially outperform a HAMT anyway. We've been betting on HAMTs for a long time largely because of their non-hysteresis.) Working through the user stories and aiming to minimize the number of surprises in end user experiences when re-importing of data does also produce pretty similar justifications. I find this "hysteresis" keyword also helps a lot in clarity coming from the engineering side up: it's a property, and it's one that we can lose. Compositions of algorithms which do not lose this property are valueable because they remain more composable and because they minimize user surprise. |
Yes, we've been using the 'Hysteresis' term informally but feel free to push any doc in the code to reflect that (as it seems the UnixFS/MFS API implicitly promises this). Because of my personal background I'm very in favor of including this term as mainstream. Note that hysteresis needs to be properly qualified though. We do not care so much about the determinacy through time of the data structure implementing the |
Done. |
Tracking implementation of #7022
Continuation of #8106
Per the linked issue(s) in order to enable sharded directories by default we want to shard the directories when required. However, we don't just want to automatically shard directories that would be too big, but also unshard directories that would be too small.
The reasoning here is that it is more intuitive to users if the CID of a directory is the same irrespective of how elements were added/removed from it. While it is almost always a bad idea to try and recreate a CID of a DAG from the data that made it up (as I mentioned in #7022 (comment)), there are some use cases where people may want to do something like https://github.com/ipfs-shipyard/ipfs-pack and therefore want a way to specify exactly how a DAG was created from some other format (e.g. files + directories). If we didn't support automatic unsharding at the same cutoff as the automatic sharding then the MFS (
ipfs files
) commands would be unusable for those users.A possible implementation strategy is suggested by @Stebalien is in #8106 (comment).
cc @schomatis
The text was updated successfully, but these errors were encountered: