Skip to content
This repository has been archived by the owner on Aug 9, 2018. It is now read-only.

Implement missing features to include IPLD in IPFS #14

Open
mildred opened this issue Oct 26, 2015 · 14 comments
Open

Implement missing features to include IPLD in IPFS #14

mildred opened this issue Oct 26, 2015 · 14 comments

Comments

@mildred
Copy link
Contributor

mildred commented Oct 26, 2015

Perhaps we don't need to do anything, but I would think that perhaps we are missing a fes things to integrate IPLD in IPFS.

If I'm not mistaken, the go-ipld package should replace github.com/ipfs/go-ipfs/merkledag package. Is that right ?

If so, I'd think we should modify both the old merkledag package to look a bit more like go-ipld (remove external access to struct members, rename a few things to match go-ipld function names if needed). That will also require modifying the core of ipfs.

Then, we must implement any missing function (if any) in go-ipld, and make the switch.

Does that sound like a good idea ? @jbenet, anyone ?

@mildred
Copy link
Contributor Author

mildred commented Oct 26, 2015

The protobuf links all have a field called Size, which represents the size of the linked data.

  • Do we want to make it mandatory for every link in IPLD? In which case we must update IPLD to check this information when checking for links.
  • Do we want to have this information as an optional field?

In IPLD, the Size field is used in a few places:

  • ls command: to display the item size
  • object command, when creating an object
  • unixfs ls command: to display the size
  • HTTP gateway, when listing a directory
  • in the merkledag implementation itself

If the size is optional, we will have to modify IPFS to handle the case when it is absent. And we should remind ourselves that even in case where is is here or mandatory, it may not be correct. If we rely of the advertised size to. That's why I would prefer the size information to be optional.

@mildred
Copy link
Contributor Author

mildred commented Oct 26, 2015

There is another issue with the Node field in ipfs/go-ipfs:merkledag/node.go. We can't store it in our IPLD Node object because is is not serializable. It is a pointer to the node object pointed by the merkle link hash (which can be nil if the object hasn't been fetched).

It is used basically in the functions AddRecursive(), Remove() and Link.GetNode() of the merkledag package. Uses in other packages don't seem problematic to me (they could be replaced by a variable on the stack for the most of them).

This Node pointer makes it possible to have in memory a merkle graph, using direct pointers to link merkle objects one to another. This will never be possible with our Node object as it is in IPLD without cheating.

  • Should we try to remove this Node pointer in the ipfs code base? This will make it impossible to express a merkle tree in memory as a tree. It seems risky to me and not desirable.

  • We should perhaps update the Node type in IPLD to be something like:

    type Node {
    map[string]interface{}
    Node *Node // Linked node if this is a link, or nil
    }
    

    We would need to teach the encoders not to encode the Node pointer.

  • Perhaps we might want to add this pointer to the map some way, but I'm strongly against this. We shouldn't mess up the user data just for our own algorithms. Especially we mustn't add another directive. But we can perhaps replace the existing "mlink" keys and update the value from a plain string to a struct { hash string, node *Node }. We will of course need to teach encoders not to use the node pointer, or sanitize the node objects before encoding.

@mildred
Copy link
Contributor Author

mildred commented Oct 26, 2015

Well, actually the AddRecursive() function doesn't seem to be used. I think we could do away with that Node pointer. hat would probably be more elegant.

edit: or perhaps not. I was using golang tooling that hide away a few calling sites

edit again: but instead of constructing a merkle tree in memory, we should perhaps imagine a streaming interface to allow adding huge trees that cannot fit in memory.

@jbenet
Copy link
Contributor

jbenet commented Oct 28, 2015

(first responses, more coming)

the go-ipld package should replace github.com/ipfs/go-ipfs/merkledag package. Is that right ?

that's right.

If so, I'd think we should modify both the old merkledag package to look a bit more like go-ipld (remove external access to struct members, rename a few things to match go-ipld function names if needed). That will also require modifying the core of ipfs.

Then, we must implement any missing function (if any) in go-ipld, and make the switch.

Does that sound like a good idea ? @jbenet, anyone ?

this sounds good. all of this should happen over the dev0.4.0 branch of go-ipfs.

so the steps shoul dbe:

  • rip out merkledag pkg into own repo. (could even be housed in this repo as a subdir)
  • first implement any changes needed to merkledag as you suggest
  • then implement any changes needed to ipld
  • swap to ipld.

the ripping out to own repo may make things easier or harder, not sure. whatever is seen as easiest.

@jbenet
Copy link
Contributor

jbenet commented Oct 28, 2015

@mildred i think your analysis re the Node pointer is correct. and agree with your thoughts.

the pointers are used all over importing, i believe, to make mutations of whole graphs easier. and we will want to make this a possibility anyway to enable users to write algorithms running over the graph natively.

in this, i'd like to borrow from how implementations of ORMs, and graph OMs do it. in those, the link object includes a pointer which is never serialized, and which may or may not be nil. thus, if it's there, we use the node, if it's not, we must fetch it, etc.

this may also be doable in the current go-ipld package by making the Link object hold a pointer. this fits into how people will have to use the Links function anyway. Though, am not super happy about that in the first place, as it doesnt make for a nice, single Node type traversable graph. one option may be that the Node object should actually hang on to the Link objects too... another that we try to make this work with a single type again (i tried for a while but had no good ideas)

but, note that there is not an easy way around all of this in go, that preserves the ease of serialization the type Node map[string]interface provides. the current type makes it trivial to go from the wire to the datastructure without any further copy operations. this allows us to use native parsers (eg cbor), instead of having to parse it once into go, then parse it again into our Node types.

we could give up on that for the sake of nice programmatic use. not sure. it may be that it is ok to have two packages, or promote experimentation defining them for a while, until we find a great one.


by the way, it's good to bikeshed all our concerns here, as "very nice programmatic interfaces" are notoriously hard to get right, and can enable tons of very nice use cases and client programs. the Go team routinely points out they had long discussions in defining the core types of the stdlib.


Updated, so please re read if reading from email

@mildred
Copy link
Contributor Author

mildred commented Oct 29, 2015

I was also wondering if we had to deserialize everything from JSON/CBOR/ProtoBuf or if we could just deserialize just what we are interested into. In that case, we must keep the original bytes around in case we need to do something more with the object.

What I am saying here is that IPFS is only interested in links and the data section of an object. Other keys might be interesting, but only to some parts of the system. The HTTP gateway might want to read content type information, the IPNS might want to read a few more things.

This way, instead of having a very generic (and inefficient) data structure (map of maps of maps), we could have more efficient data structures constructed only when necessary, and always by deserializing the original bytestream. These datastructure can hold anything of interest (like the Node pointer) easily.

We would have to make sure of course that we don't serialize them back to an object directly, as it will miss important data that wasn't deserialized. But because objects are immutable, I don't think we will run much into this issue. The only use case where we serialize objects is when we construct them anew.

@jbenet
Copy link
Contributor

jbenet commented Oct 29, 2015

i do like the idea of keeping the thing serialized and only pulling out what you need, this in general makes programs faster, because sometimes can avoid deserializing, or can even just update the relevant pieces. ((btw, we can make capnp like support for go-ipld (or really just cbor docs) which only pull out the data when you need it. so R/W is a bit slower, but by not reading everything we win.))

anyway, what you describe might make for a nice implementation of go-ipld. Im not sure though-- there is a lot of benefit that comes from supporting maps natively and giving users that type of easy access (json-like). I think it would be nice to be able to have both. the tricky part comes when users use it.

i wonder then if maybe we should be implementing this as an type Node interface{} and allowing multiple implementations.

btw:

IPFS is only interested in links and the data section of an object.

well, not anymore in that the data section is now "everything else except the links".

@jbenet
Copy link
Contributor

jbenet commented Oct 29, 2015

cc @whyrusleeping to take a look and provide feedback, as he manipulates nodes a lot.

@whyrusleeping -- do you prefer the "convert to map" approach, or the "keep it all serialized and only read out (duplicated, or on every access) what you need"

@whyrusleeping
Copy link
Contributor

I'm not super sure whats happening here, but some feedback i can provide:

The 'Node' field in the link objects is very useful, but isnt something we want serialized, its just a cache to make working on the trees easier. Removing it would be fine, but might make certain operations on dags a little slower.

In regards to moving merkledag out into its own package, gx is getting a lot nicer, i can probably try it out for extracting that package soon

@mildred
Copy link
Contributor Author

mildred commented Oct 29, 2015

What is preventing from accessing the Node through a hash map that associates a link hash to the Node pointer? The performance wouldn't be much worse than the performance of the ipld package compared to merkledag.

Because, in this ipld package as it is now, the Node Object is just a hash map that point to other hash maps. This will degrade performances compared to the merkledag Node object.

But as we discussed just before, we can just keep the encoded data in encoded form, and decode just what we need for efficient reading and traversing operations. We keep the data structure simple (structs with pointers instead of hash maps). Node modification would be a little more difficult as we mustn't lose the data that we didn't bother to decode.

@jbenet
Copy link
Contributor

jbenet commented Nov 1, 2015

i think having the pointers on the object is very useful to write algorithms that traverse the graphs. it makes thinking about them much easier to be able to do curr.GetLink(nextName).GetNode(). i would probably end up writing an abstraction that does that on top of IPLD package then-- not sure what that would look like yet.

also, we could write a DagStore (new name for the DagService thing) that had an internal cache and a pointer to another DagStore. could construct one for the run of each algo and cache the nodes necessary then. (like say on add, with a dageditor). though a complication is that mutating the nodes now introduces a big cache invalidation pain, where before we only had to "traverse in to change, and recompute hashes on the way out".

@hackergrrl
Copy link

Late commenter here!

I'm really interested in @mildred's comment about size. Without it, we lose out on a big UX feature: progress bars. Without these hints we can't tell the user how e.g. their downloads are going.

Some examples in the spec include a size property as siblings of @link properties, but it's unclear if this is canonical.

@davidar
Copy link

davidar commented Feb 20, 2016

@noffle I think creating ipfs/unixfs objects should remain the same (adding size information whenever it makes sense to), but when you're reading a general IPLD object, you shouldn't assume there to be a size attached (and even if there is, it's not necessarily accurate).

@hackergrrl
Copy link

@davidar: I think that makes sense. Baking size into the format seems like an overly stringent requirement. Not needing to know size is a nice property: it means you can finally build objects without needing to actually look up the things you link to -- multihash is sufficient.

The best we can do then is ensuring that our tooling tries to include whenever it's reasonable to do so (much more important on binary blobs than on metadata that is likely small).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants