Fix `Tsize` values, fix HAMT and make recursive builder output match go-unixfs #21

rvagg · 2022-01-21T06:46:06Z

OK, this started as something entirely different, I'll get to that in another PR later. But while using this I noticed that Tsize isn't getting set properly on directories. Then while fixing that, I noticed that the HAMT doesn't quite work properly, so I fixed that.

There's two main tests in here, one for a smaller directory structure and one for a very large one that breaks out the sharding. I've generated the CIDs from go-ipfs by ipfs adding them. (and final Tsize values calculated by looking at link sizes in root blocks and adding the root block size to the total: ipfs dag get bafybei... | jq '[.Links[].Tsize | tonumber] | add' & ipfs block get bafybei... | wc -c).

It's a breaking change because all of the builder BuildX() functions return datamodel.Link, unit64, error now so the size is included (mainly because it needs to be for recursive internal collection, but the reported size is an accurate representation of the structure's weight so should be useful in some situations too).

While in here, I replaced the block size calculation hack (load saved block as raw and use byte length) with a new, even more exotic hack, to count the bytes as they are pushed out of the encoder by wrapping the LinkSystem, EncoderChooser and Encoder at each save point. Not trivial, but saves the round-trip out of the block store. @warpfork @mvdan I think the need for block size at the Store() call for well formed dag-pb graphs makes a pretty strong case that this should be easier to retrieve out of the API—perhaps just Store() should return 3 values?

willscott · 2022-01-21T10:55:37Z

I did not find good documentation the last time i went through this on what the size of directories should be set as. Is that something we can have a spec document on and point to in comments?

ribasushi · 2022-01-21T11:09:22Z

I did not find good documentation the last time i went through this on what the size of directories should be set as. Is that something we can have a spec document on

Note that there is no agreement on what Tsize ought to represent, or whether it is required in the first place. A retroactive spec would be... difficult.

cc @achingbrain

rvagg · 2022-01-21T11:09:49Z

Kind of lacking on the spec front - there's this vague area above the encoding format layer (specs I was responsible for) https://ipld.io/specs/codecs/dag-pb/spec/ and the unixfs layer https://github.com/ipfs/specs/blob/master/UNIXFS.md. These changes get in the middle there. When writing the dag-pb spec, I didn't consider it necessary to go into detail about the proper contents of the various fields, I probably thought that was an IPFS spec concern. But it looks like IPFS speccing was higher level still. So our specs for this are in the form of code for now.

At one stage there was talk of deprecating Tsize since it really only can be advisory, but it's become an important tool in the meantime (e.g.) and having consistency across our libraries means we get CID convergence. I can imagine complaints if/when this library replaces go-unixfs in go-ipfs and starts giving different CIDs for the same data just because we were relaxed about Tsize.

rvagg · 2022-01-24T03:38:31Z

I've changed the test for the sharded directory and added another one such that we now test the boundary condition where it flips from unsharded to sharded by adding an extra file. Again I constructed the data using go-ipfs and found the boundary condition with it. The CIDs and final reported size are calculated from go-ipfs outputs too. No extra fixes required, this lib is getting the boundary right.

willscott

good to know - I certainly saw maintaining absolute byte compatibility with things like switchover point as a somewhat lower priority for new implementations. As long as either style can decode properly, I'm less worried about them generating and switching over at the same point.

I know there are some edge cases that would like it, but it's equally true that in setting that expectation of 'these are the only bytes that should be generated for this input' we make it harder to evolve the format and move to updated versions of unixfs.

rvagg added 2 commits January 21, 2022 15:39

fix: encode Tsize correctly everywhere (using wraped LinkSystem)

1f99e18

fix: more Tsize fixes, fix HAMT and make it match go-unixfs output

fcafb64

rvagg requested review from willscott, warpfork and mvdan January 21, 2022 06:46

rvagg changed the title ~~Fix Tsize values, fix HAMT and match recursive builder output match go-unixfs~~ Fix Tsize values, fix HAMT and make recursive builder output match go-unixfs Jan 21, 2022

willscott approved these changes Jan 21, 2022

View reviewed changes

rvagg mentioned this pull request Jan 24, 2022

feat!: auto-shard based on node size ipfs/js-ipfs-unixfs#171

Merged

fix: add extra test to span the shard/no-shard boundary

ab6e4a3

willscott approved these changes Jan 24, 2022

View reviewed changes

rvagg merged commit d3fccee into main Jan 25, 2022

rvagg deleted the rvagg/tsize-all-the-things branch January 25, 2022 04:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `Tsize` values, fix HAMT and make recursive builder output match go-unixfs #21

Fix `Tsize` values, fix HAMT and make recursive builder output match go-unixfs #21

rvagg commented Jan 21, 2022

willscott commented Jan 21, 2022

ribasushi commented Jan 21, 2022

rvagg commented Jan 21, 2022

rvagg commented Jan 24, 2022

willscott left a comment

Fix Tsize values, fix HAMT and make recursive builder output match go-unixfs #21

Fix Tsize values, fix HAMT and make recursive builder output match go-unixfs #21

Conversation

rvagg commented Jan 21, 2022

willscott commented Jan 21, 2022

ribasushi commented Jan 21, 2022

rvagg commented Jan 21, 2022

rvagg commented Jan 24, 2022

willscott left a comment

Choose a reason for hiding this comment

Fix `Tsize` values, fix HAMT and make recursive builder output match go-unixfs #21

Fix `Tsize` values, fix HAMT and make recursive builder output match go-unixfs #21