Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does it mean to be a directory in the UnixFS layer? #5157

Closed
schomatis opened this issue Jun 26, 2018 · 8 comments
Closed

What does it mean to be a directory in the UnixFS layer? #5157

schomatis opened this issue Jun 26, 2018 · 8 comments
Assignees
Labels
topic/docs-ipfs Topic docs-ipfs topic/files Topic files topic/UnixFS Topic UnixFS

Comments

@schomatis
Copy link
Contributor

The Directory type of a UnixFS object (contrary to its name) doesn't actually indicate a directory, or at least it's not the only possible type of a directory, the HAMTShard type also indicates a directory, leading to confusing comparisons like:

https://github.com/ipfs/go-ipfs/blob/ecf7d157a6d5e525b122079367f4b6c2ba25e951/mfs/dir.go#L159-L161

So Directory is only the "plain implementation" type of a directory, it's not the directory type (a very similar confusion arises with the File and Raw types). What are the requisites of a UnixFS directory? It seems that there is no interface that would define it, at this point the closest that can be find is the Directory from the unixfs.io package,

https://github.com/ipfs/go-ipfs/blob/ecf7d157a6d5e525b122079367f4b6c2ba25e951/unixfs/io/dirbuilder.go#L28-L36

but this structure emerged just to accommodate the HAMT implementation (#3042, it evolved from a previous directoryBuilder structure), not as a clear and documented definition of a directory.

The biggest consequence of all this (IMO) is that a new reader needs to go through the *hamt.Shard structure pointer to understand what's happening when content is added to a directory (with the major danger of diving in the really complex code of the hamt package).

So the main objective is to provide the user a clear explanation of what it can expect from a directory, possibly presenting a documented interface, while also using it to hide as much as possible the HAMT directory variant, to avoid confusing code lines when the user is following the execution path of how are files added to a directory in the MFS hierarchy,

https://github.com/ipfs/go-ipfs/blob/ecf7d157a6d5e525b122079367f4b6c2ba25e951/unixfs/io/dirbuilder.go#L101-L107

For example, looking for common characteristics of the plain and the HAMT (which I mostly do not understand) implementations I'm observing that its children are still referenced as DAG links (HAMT contains a map for a quick access to them but the links remain at the DAG layer); on the contrary, whereas the plain implementation does not hold information in the Data field the HAMT stores the bit field there.

@schomatis schomatis added topic/docs-ipfs Topic docs-ipfs topic/files Topic files topic/UnixFS Topic UnixFS labels Jun 26, 2018
@schomatis schomatis added this to the Files API Documentation milestone Jun 26, 2018
@schomatis schomatis self-assigned this Jun 26, 2018
@Stebalien
Copy link
Member

So Directory is only the "plain implementation" type of a directory, it's not the directory type (a very similar confusion arises with the File and Raw types)

Yes. Sharded directories came later so they got a different type.

For example, looking for common characteristics of the plain and the HAMT (which I mostly do not understand) implementations I'm observing that its children are still referenced as DAG links (HAMT contains a map for a quick access to them but the links remain at the DAG layer).

So, all links must go in the Links section of the PBNode. This is just a restriction of the PBNode IPLD format (and the primary motivation to move away from it).

However, sharded directories span multiple nodes. Therefore, for better performance, we cache these shards when we load them. That's why we have the children map.

@schomatis
Copy link
Contributor Author

However, sharded directories span multiple nodes.

Oh, could you point me to that part of the code? I thought a shard was encoded in a single Data of a DAG node,

https://github.com/ipfs/go-ipfs/blob/ecf7d157a6d5e525b122079367f4b6c2ba25e951/unixfs/hamt/hamt.go#L142-L143

@schomatis schomatis changed the title What means to be a directory in the UnixFS layer? What does it mean to be a directory in the UnixFS layer? Jun 27, 2018
@schomatis
Copy link
Contributor Author

So, while working on #5160 I realized that MFS is the single consumer of the UnixFS directory abstraction, the definition of the interface is basically what the MFS layer needs from UnixFS.

@schomatis
Copy link
Contributor Author

@Stebalien

So, all links must go in the Links section of the PBNode. This is just a restriction of the PBNode IPLD format

I'm not sure what kind of restriction is that. I could just save the links in the Data field, as long as the UnixFS methods called by MFS know how to operate on it I don't need to use the links of the DAG layer, I should, of course, but as the API is just that set of functions defined in the dirbuilder.go file, I don't see any requirement (and there is also no spec) that actually indicates that. That's what I'm mean with the question of this issue, it seems to me that a UnixFS directory is whatever UnixFS wants it to be as long as it delivers to MFS, and that mental model is hard to transmit to a new user, that's why I want to have a concrete list of "things a UnixFS directory should comply with" (besides a set of methods), to be able to provide more information to the new user.

@Stebalien
Copy link
Member

Oh, could you point me to that part of the code? I thought a shard was encoded in a single Data of a DAG node,

Take a look at getChild and loadChild (on Shard).

I'm not sure what kind of restriction is that. I could just save the links in the Data field, as long as the UnixFS methods called by MFS know how to operate on it I don't need to use the links of the DAG layer, I should, of course, but as the API is just that set of functions defined in the dirbuilder.go file, I don't see any requirement (and there is also no spec) that actually indicates that.

You could. However, tools like pin, refs, etc. wouldn't work properly.

That's what I'm mean with the question of this issue, it seems to me that a UnixFS directory is whatever UnixFS wants it to be as long as it delivers to MFS, and that mental model is hard to transmit to a new user, that's why I want to have a concrete list of "things a UnixFS directory should comply with" (besides a set of methods), to be able to provide more information to the new user.

So, there are multiple types of unixfs directories. At the end of the day, they should all expose a standard interface (really, iterate and lookup). However, we're always going to have multiple very different representations. Currently, we have:

  1. The Directory type for small directories.
  2. The HAMTShard type for shards of larger directories.

We actually have the same split in files. We have files that are raw bytes and sharded files.


So, for context, HAMT stands for hash array mapped trie. To insert a key/value (name/file) into a HAMT, you first hash the key. Then, roughly,

  1. Pop off a prefix from the hash of the key (we take 1 byte giving us 256 entries per shard) and use this prefix as an index into the children array of the current node.
  2. Lookup the child node at that index.
  3. If it's a value:
    1. If the keys are equal, replace it, return.
    2. Otherwise, replace it with a new shard and continue to step 4 (inserting both the displaced value and the new value).
  4. If it's a shard, recurse back to step 1 (popping a new byte off the hash of the key).
  5. If there's nothing at that index, insert the value.

However, instead of storing sparsely populated arrays as suggested by this algorithm, we use a bitfield. To lookup a value in the array, you:

  1. Lookup the corresponding bit in the bitfield. If it's 1, there's a value at this position.
  2. Count the number of 1 bits before that bit. This count is the index in the Links array.

@schomatis
Copy link
Contributor Author

Extra 👍 for the detailed HAMT explanation.

@schomatis
Copy link
Contributor Author

This issue has been addressed in a couple of PRs now (the last one being ipfs/go-unixfs#16) we can close it.

@Frando
Copy link

Frando commented Aug 15, 2022

It would be great to add this information on how the UnixFS HAMT works to the UnixFS spec which currently doesn't have any info on how the HAMTShards are actually encoded and supposed to be used. This is the best explanation here that I found when googling for a spec of the UnixFS HAMT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/docs-ipfs Topic docs-ipfs topic/files Topic files topic/UnixFS Topic UnixFS
Projects
None yet
Development

No branches or pull requests

3 participants