What does it mean to be a directory in the UnixFS layer? #5157

schomatis · 2018-06-26T20:59:26Z

The Directory type of a UnixFS object (contrary to its name) doesn't actually indicate a directory, or at least it's not the only possible type of a directory, the HAMTShard type also indicates a directory, leading to confusing comparisons like:

https://github.com/ipfs/go-ipfs/blob/ecf7d157a6d5e525b122079367f4b6c2ba25e951/mfs/dir.go#L159-L161

So Directory is only the "plain implementation" type of a directory, it's not the directory type (a very similar confusion arises with the File and Raw types). What are the requisites of a UnixFS directory? It seems that there is no interface that would define it, at this point the closest that can be find is the Directory from the unixfs.io package,

https://github.com/ipfs/go-ipfs/blob/ecf7d157a6d5e525b122079367f4b6c2ba25e951/unixfs/io/dirbuilder.go#L28-L36

but this structure emerged just to accommodate the HAMT implementation (#3042, it evolved from a previous directoryBuilder structure), not as a clear and documented definition of a directory.

The biggest consequence of all this (IMO) is that a new reader needs to go through the *hamt.Shard structure pointer to understand what's happening when content is added to a directory (with the major danger of diving in the really complex code of the hamt package).

So the main objective is to provide the user a clear explanation of what it can expect from a directory, possibly presenting a documented interface, while also using it to hide as much as possible the HAMT directory variant, to avoid confusing code lines when the user is following the execution path of how are files added to a directory in the MFS hierarchy,

https://github.com/ipfs/go-ipfs/blob/ecf7d157a6d5e525b122079367f4b6c2ba25e951/unixfs/io/dirbuilder.go#L101-L107

For example, looking for common characteristics of the plain and the HAMT (which I mostly do not understand) implementations I'm observing that its children are still referenced as DAG links (HAMT contains a map for a quick access to them but the links remain at the DAG layer); on the contrary, whereas the plain implementation does not hold information in the Data field the HAMT stores the bit field there.

The text was updated successfully, but these errors were encountered:

Stebalien · 2018-06-26T22:54:26Z

So Directory is only the "plain implementation" type of a directory, it's not the directory type (a very similar confusion arises with the File and Raw types)

Yes. Sharded directories came later so they got a different type.

For example, looking for common characteristics of the plain and the HAMT (which I mostly do not understand) implementations I'm observing that its children are still referenced as DAG links (HAMT contains a map for a quick access to them but the links remain at the DAG layer).

So, all links must go in the Links section of the PBNode. This is just a restriction of the PBNode IPLD format (and the primary motivation to move away from it).

However, sharded directories span multiple nodes. Therefore, for better performance, we cache these shards when we load them. That's why we have the children map.

schomatis · 2018-06-27T11:21:09Z

However, sharded directories span multiple nodes.

Oh, could you point me to that part of the code? I thought a shard was encoded in a single Data of a DAG node,

https://github.com/ipfs/go-ipfs/blob/ecf7d157a6d5e525b122079367f4b6c2ba25e951/unixfs/hamt/hamt.go#L142-L143

schomatis · 2018-06-27T13:40:57Z

So, while working on #5160 I realized that MFS is the single consumer of the UnixFS directory abstraction, the definition of the interface is basically what the MFS layer needs from UnixFS.

schomatis · 2018-06-27T15:06:44Z

@Stebalien

So, all links must go in the Links section of the PBNode. This is just a restriction of the PBNode IPLD format

I'm not sure what kind of restriction is that. I could just save the links in the Data field, as long as the UnixFS methods called by MFS know how to operate on it I don't need to use the links of the DAG layer, I should, of course, but as the API is just that set of functions defined in the dirbuilder.go file, I don't see any requirement (and there is also no spec) that actually indicates that. That's what I'm mean with the question of this issue, it seems to me that a UnixFS directory is whatever UnixFS wants it to be as long as it delivers to MFS, and that mental model is hard to transmit to a new user, that's why I want to have a concrete list of "things a UnixFS directory should comply with" (besides a set of methods), to be able to provide more information to the new user.

Stebalien · 2018-06-27T17:49:24Z

Oh, could you point me to that part of the code? I thought a shard was encoded in a single Data of a DAG node,

Take a look at getChild and loadChild (on Shard).

I'm not sure what kind of restriction is that. I could just save the links in the Data field, as long as the UnixFS methods called by MFS know how to operate on it I don't need to use the links of the DAG layer, I should, of course, but as the API is just that set of functions defined in the dirbuilder.go file, I don't see any requirement (and there is also no spec) that actually indicates that.

You could. However, tools like pin, refs, etc. wouldn't work properly.

That's what I'm mean with the question of this issue, it seems to me that a UnixFS directory is whatever UnixFS wants it to be as long as it delivers to MFS, and that mental model is hard to transmit to a new user, that's why I want to have a concrete list of "things a UnixFS directory should comply with" (besides a set of methods), to be able to provide more information to the new user.

So, there are multiple types of unixfs directories. At the end of the day, they should all expose a standard interface (really, iterate and lookup). However, we're always going to have multiple very different representations. Currently, we have:

The Directory type for small directories.
The HAMTShard type for shards of larger directories.

We actually have the same split in files. We have files that are raw bytes and sharded files.

So, for context, HAMT stands for hash array mapped trie. To insert a key/value (name/file) into a HAMT, you first hash the key. Then, roughly,

Pop off a prefix from the hash of the key (we take 1 byte giving us 256 entries per shard) and use this prefix as an index into the children array of the current node.
Lookup the child node at that index.
If it's a value:
1. If the keys are equal, replace it, return.
2. Otherwise, replace it with a new shard and continue to step 4 (inserting both the displaced value and the new value).
If it's a shard, recurse back to step 1 (popping a new byte off the hash of the key).
If there's nothing at that index, insert the value.

However, instead of storing sparsely populated arrays as suggested by this algorithm, we use a bitfield. To lookup a value in the array, you:

Lookup the corresponding bit in the bitfield. If it's 1, there's a value at this position.
Count the number of 1 bits before that bit. This count is the index in the Links array.

schomatis · 2018-06-28T02:19:58Z

Extra 👍 for the detailed HAMT explanation.

schomatis · 2018-09-24T12:46:40Z

This issue has been addressed in a couple of PRs now (the last one being ipfs/go-unixfs#16) we can close it.

Frando · 2022-08-15T13:08:29Z

It would be great to add this information on how the UnixFS HAMT works to the UnixFS spec which currently doesn't have any info on how the HAMTShards are actually encoded and supposed to be used. This is the best explanation here that I found when googling for a spec of the UnixFS HAMT.

schomatis added topic/docs-ipfs Topic docs-ipfs topic/files Topic files topic/UnixFS Topic UnixFS labels Jun 26, 2018

schomatis added this to the Files API Documentation milestone Jun 26, 2018

schomatis self-assigned this Jun 26, 2018

schomatis mentioned this issue Jun 26, 2018

[WIP] unixfs: document protocol buffer definition #5153

Closed

schomatis changed the title ~~What means to be a directory in the UnixFS layer?~~ What does it mean to be a directory in the UnixFS layer? Jun 27, 2018

schomatis mentioned this issue Jun 27, 2018

unixfs: add a directory interface #5160

Merged

schomatis mentioned this issue Jul 17, 2018

Fetch Hamt Children Nodes in Parallel #4979

Closed

schomatis closed this as completed Sep 24, 2018

ribasushi mentioned this issue Mar 20, 2020

Object/put truncates objects without error #7018

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What does it mean to be a directory in the UnixFS layer? #5157

What does it mean to be a directory in the UnixFS layer? #5157

schomatis commented Jun 26, 2018

Stebalien commented Jun 26, 2018

schomatis commented Jun 27, 2018

schomatis commented Jun 27, 2018

schomatis commented Jun 27, 2018

Stebalien commented Jun 27, 2018

schomatis commented Jun 28, 2018

schomatis commented Sep 24, 2018

Frando commented Aug 15, 2022 •

edited

Loading

What does it mean to be a directory in the UnixFS layer? #5157

What does it mean to be a directory in the UnixFS layer? #5157

Comments

schomatis commented Jun 26, 2018

Stebalien commented Jun 26, 2018

schomatis commented Jun 27, 2018

schomatis commented Jun 27, 2018

schomatis commented Jun 27, 2018

Stebalien commented Jun 27, 2018

schomatis commented Jun 28, 2018

schomatis commented Sep 24, 2018

Frando commented Aug 15, 2022 • edited Loading

Frando commented Aug 15, 2022 •

edited

Loading