-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: auto-shard based on node size #171
Conversation
js counterpart to ipfs/kubo#8114 Changes the `shardSplitThreshold` parameter to mean the size of the final DAGNode (including link names, sizes, etc) instead of the number of entries in the directory. Fixes: #149 BREAKING CHANGE: `shardSplitThreshold` now refers to node size, not number of entries
Consider changing the config key option as the meaning of the key is changing, to make it easier to throw a sensible error message to users who miss the memo |
e.g use |
We won't ship until the go counterpart (ipfs/kubo#8114) ships. |
Per 2022-01-21: this is no longer blocked on go-ipfs (just needs maintainer to be able to finish it off). |
Hey, so I was just fixing the impl for this in go-unixfsnode to match what go-unixfs is doing for go-ipfs (it's already being used in the wild, such as in go-car, and currently doesn't do the right thing with sharding); I could tweak that test case to find the exact edge where it switches from plain to sharded and generate two test cases for the switch. Then we'd have root CIDs that we could duplicate across implementations in our tests to ensure we're doing the exact same thing and generating the exact same data. Worthwhile? |
I went ahead and did it @ ipfs/go-unixfsnode#21 The tests pack up a directory called
|
2022-03-18 conversation: thanks for finishing @rvagg . We'll get to reviewing after the js-libp2p typescript conversion. |
2022-05-27 conversation: need to tidyup now that go-ipfs has shipped the feature. This will involve a change in the interop test suite. |
@rvagg can you remember what the shard split threshold used by go-unixfsnode in these tests is? With 1343 entries I can generate a tree with the expected CID of With 1344 entries and the shard limit at 256KiB I get the expected CID Both tests use raw leaves and CIDv1. |
256KiB, like the kubo link but more explicit is where it's checked; which looks a lot like what you have here |
* @returns {number} | ||
*/ | ||
calculateNodeSize () { | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be doing something more than 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the abstract superclass of dir-flat
and dir-sharded
- this returns 0 to appease TypeScript, the implementation in the concrete subclasses actually do the work.
This will be refactored in a future PR to be an interface so it'll look less weird.
Ah, right the estimateDirSize function in go-unixfsnode only counts DAGLink Tsize and Name length. The version here calculates the actual block size by serializing the DAGNode. So consequently the block size go-unixfsnode produces for the version with 1343 links is 276895 which is over the sharding threshold (256KiB = 262144b). Do we want that or should it be accurate? |
It's somewhat arbitrary, the actual block size will be bigger but doesn't have a way of busting block size limits. I figure that we just should roll with historical calculations to make implementations match, it's good enough and having consistency is worth more than achieving a perfect block size (documentation should set the expectation right for the user though). We could go with accuracy, but it'd have to be a coordinated effort and probably be a breaking change for Kubo since people will find their hashes changing for the same data. |
Ok, I'll make it consistent. What happens when people set the chunk size bigger than the sharding threshold? Seems a bit weird to have a maybe-256KiB limit on the root block but then files with 512KB leaves, for example. |
My guess is that whoever made this decision in the first place thought it'd be the most efficient because it has to be calculated so frequently and serialising is expensive. But there are cheap ways of calculating pb encoded size, particularly with these blocks, so it wouldn't be too expensive (our codecs do it internally as they write so we could even expose that). |
Well that's just it, it's really cheap to calculate the protobuf overhead on each field, plus things like CIDs are fixed length so you don't need to hash anything so you could make it fast to get an accurate size calculation 🤷 |
All PR comments have been addressed, I'm going to merge this to unblock ipfs/helia-unixfs#1 |
## [ipfs-unixfs-v10.0.0](ipfs-unixfs-v9.0.1...ipfs-unixfs-v10.0.0) (2023-02-09) ### ⚠ BREAKING CHANGES * the `shardSplitThreshold` option has changed to `shardSplitThresholdBytes` and reflects a DAGNode size where sharding might kick in ### Features * auto-shard based on node size ([#171](#171)) ([6ef187f](6ef187f)), closes [#149](#149)
🎉 This PR is included in version ipfs-unixfs-v10.0.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
## [ipfs-unixfs-importer-v13.0.0](ipfs-unixfs-importer-v12.0.1...ipfs-unixfs-importer-v13.0.0) (2023-02-09) ### ⚠ BREAKING CHANGES * the `shardSplitThreshold` option has changed to `shardSplitThresholdBytes` and reflects a DAGNode size where sharding might kick in ### Features * auto-shard based on node size ([#171](#171)) ([6ef187f](6ef187f)), closes [#149](#149) ### Dependencies * update sibling dependencies ([1a37705](1a37705))
🎉 This PR is included in version ipfs-unixfs-importer-v13.0.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
## [ipfs-unixfs-exporter-v11.0.0](ipfs-unixfs-exporter-v10.0.1...ipfs-unixfs-exporter-v11.0.0) (2023-02-09) ### ⚠ BREAKING CHANGES * the `shardSplitThreshold` option has changed to `shardSplitThresholdBytes` and reflects a DAGNode size where sharding might kick in ### Features * auto-shard based on node size ([#171](#171)) ([6ef187f](6ef187f)), closes [#149](#149) ### Dependencies * update sibling dependencies ([54018a1](54018a1)) * update sibling dependencies ([1a37705](1a37705))
🎉 This PR is included in version ipfs-unixfs-exporter-v11.0.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
this is very 🥳 🎆 having the 3 implementations agree on this, have the same test fixtures, and get the same CIDs for a complex operation, is awesome! |
js counterpart to ipfs/kubo#8114
Changes the
shardSplitThreshold
parameter to mean the size of the final DAGNode (including link names, sizes, etc) instead of the number of entries in the directory.Fixes: #149
BREAKING CHANGE: the
shardSplitThreshold
option has changed toshardSplitThresholdBytes
and reflects a DAGNode size where sharding might kick in