Skip to content
This repository has been archived by the owner on Aug 11, 2021. It is now read-only.

Major API changes #50

Closed
wants to merge 8 commits into from
Closed

Major API changes #50

wants to merge 8 commits into from

Conversation

vmx
Copy link
Member

@vmx vmx commented Nov 22, 2018

The whole IPLD APIs get a review to make them more consistent and
easier to use.

The biggest change is that there's not resolve API anymore. From
now on you access the properties of the JavaScript objects directly.

Issues that were taken into account:

Closes #48.

The whole IPLD APIs get a review to make them more consistent and
easier to use.

The biggest change is that there's not `resolve` API anymore. From
now on you access the properties of the JavaScript objects directly.

Issues that were taken into account:

 - #31
   - [x] Remove `isLink()` method from formats: `isLink()` is no
         longer needed as all links will be CID objects you can easily
         check for
 - #44
   - [x] Proposal: Move resolver to use CID instances for links: Not
         applicable anymore as `resolve.resolve()` is removed
 - #46
   - [x] properties: align spec with implementation: Covered by spec
 - #49
   - [x] Define how `toJSON()` should look like: Binary and CIDs are
         defined with example
 - #34
   - [x] Implementation of nested objects: Won’t fix as `tree()` is
         not part of the API anymore

Closes #48.
@vmx vmx requested review from mikeal and daviddias November 22, 2018 16:59
@ghost ghost assigned vmx Nov 22, 2018
@ghost ghost added the in progress label Nov 22, 2018
README.md Outdated Show resolved Hide resolved
`toJSON()` is out of scope for this spec and can be implemented
outside of it.
Renaming properties back to their original names.

defaultHashCode -> defaultHashAlg
codec - format

The intention was to make clear that the type is a multicodec. But
this shouldn't be done by the name of the property, but by a type
description.

Generally we should always use the codes and not the string identifiers
of the multicodecs.
@vmx
Copy link
Member Author

vmx commented Nov 28, 2018

@mikeal I've removed toJSON(). I'd like your feedback especially on using the multicodec codecs byte values and not their string identifiers. I think we should do that in all of our APIs. What do you think?

README.md Outdated Show resolved Hide resolved
@mikeal
Copy link
Contributor

mikeal commented Nov 28, 2018

@vmx I like that change, I think it has some other implications I've noted inline.

Instead of requiring them to be in the table, make it recommendation.
So it's still possible to have your custom ones.
@vmx
Copy link
Member Author

vmx commented Nov 28, 2018

@mikeal I also changed the hash codec to "should".

Once this change is approved, I'd like to merge it myself as I'd like to get a single clean commit and I need to edit the commit message a bit :)

@mikeal
Copy link
Contributor

mikeal commented Nov 29, 2018

@vmx approved :)

@vmx
Copy link
Member Author

vmx commented Nov 29, 2018

@daviddias As this is a major change I'd like to get your approval before I merge it too.

- [`util.cid(binaryBlob[, options], callback)`](#utilcidbinaryblob-options-callback)
- [Local resolver methods](#local-resolver-methods)
- [`resolver.resolve(binaryBlob, path, callback)`](#resolverresolvebinaryblob-path-callback)
- [`resolver.tree(binaryBlob, callback)`](#resolvertreebinaryblob-callback)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did these methods go?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see the new js-ipld with the formats updated and resolving through things with all of these updates. It seems that we are preemptively removing things but I might be missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the commit message:

The biggest change is that there's not resolve API anymore. From
now on you access the properties of the JavaScript objects directly.

So instead of resolve.resolve(binaryBlob, '/deeply/nested/path)' you deserialize the node first (which resolve also needs to do) and then on the resolved node, let's cal it ipldNodeyou doipldNode.deeply.nested.path`.

For tree()-like functionality (if needed) you do what dag-json is doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now it is the user that needs to convert an IPLD Path into JS notation?

Copy link
Member

@daviddias daviddias Nov 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that dag-pb supported resolve as in cid/name-of-the-link-to-the-other-node and cid/links/0/hash/. It was basically packing a transformation inside the format. It will be an extra breaking change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the same behaviour as the current implementation

Hmm, could we make it better?

You could implement the same semantics with JavaScript Proxies if needed.

What's the browser support for that like? I found babel-plugin-proxy but the README says it's got really bad performance and don't recommend using it in production environments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, could we make it better?

I would get rid of the named links traversal (I call it "IPFS traversal") or put it under a field (e.g. ipfs) so it won't collide. Though I think being just as broken as we are currently is good enough for now, especially since this is again dag-pb specific and shouldn't be a problem with other formats.

What's the browser support for that like?

Pretty good: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy#Browser_compatibility

Copy link
Member

@achingbrain achingbrain Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would get rid of the named links traversal

This is not a crazy idea. It's only of limited use for IPFS because there isn't a one-to-one mapping between IPFS paths and IPLD paths for HAMT shards.

E.g.

resolve.resolve(binaryBlob, '/deeply/nested/path.txt')

Might end up needing to be

resolve.resolve(binaryBlob, '/deeply/A0/01/FC-nested/path.txt')

..and you don't know what the prefix for each segment is going to be until you descend into that level of the trie as it's dependent on the siblings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly kept the named links to show that such things are still possible.

Copy link

@Gozala Gozala Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally do think that move to JS based traversal has following disadvantages:

  1. As @achingbrain pointed out that implies that that whole node is de-serialized in one step. Sure JS proxies could be utilized but they are very slow and are really difficult to get right as in ensure that they you don't end up creating memory leaks (we used them a lot for cross compartment boundaries in gecko & in my opinion for non trivial cases e.g process.env they're not worth it).

    I have being wanted to implement ipld resolver for flatbuffers who's primary advantage over proto-buffs is laziness - only relevant fields can be decoded without instantiating objects for intermediaries. It seems like with this move taking advantage of this would be impossible.

    Proxies only work if you can decode everything synchronously, if that is not the case then you wind up with bunch of promises you have to await on to get to the nested data, which IMO is less convenient than it is to provide a path.

@vmx
Copy link
Member Author

vmx commented Dec 6, 2018

After implementing this spec for dag-pb I learned about enumerable properties. Hence I'd like to add this paragraph to the deserialize() section:

All enumerable properties (the ones that are returned by a Object.keys() call) of the deserialized object are considered for resolving IPLD Paths. They must only return values whose type is one of the IPLD Data Model.

What do others think about this? /cc @mikeal @daviddias

@achingbrain
Copy link
Member

Another thing to consider - the performance analysis done by @mcollina and the NearForm team has revealed we spend a lot of time waiting for artificially induced asynchronous behaviour to resolve (basically setImmediate and friends).

If the serialization/deserialization can be done synchronously we should consider making these synchronous methods instead of taking callbacks or returning promises.

The alternative is to invoke callbacks synchronously or use process.nextTick but we run the risk of releasing Zalgo - better to just examine whether async is necessary at all in this case.

@mikeal
Copy link
Contributor

mikeal commented Dec 10, 2018

FYI, I wrote a synchronous version of dag-cbor in JS. All of our sync code in that module is self-induced :) https://github.com/mikeal/dag-cbor-sync

@vmx
Copy link
Member Author

vmx commented Dec 11, 2018

If the serialization/deserialization can be done synchronously we should consider making these synchronous methods instead of taking callbacks or returning promises.

Are already resolved promises a big bottleneck? Promises seem like a good trade-off to make async stuff possible.

@mcollina
Copy link

Are already resolved promises a big bottleneck? Promises seem like a good trade-off to make async stuff possible.

That will cause the value to be deferred in the microtask queue. In practice, the impact will be similar to process.nextTick. It is better than a full setImmediate as that is triggered after a full run of the event loop.

Note that await Promise.resolve(42) causes multiple tasks (4 in Node v10) to be scheduled in the microtask queue. See https://www.youtube.com/watch?v=DFP5DKDQfOc for more details.

@daviddias
Copy link
Member

@vmx following your request for the review of this proposal.

This is such a massive change that I don't believe it is wise to merge this one without having sibling PRs to all of its dependents in IPFS land. Can you take the ownership as a builder of IPLD to test the assumptions made by this PR through creating PRs to all the IPFS modules that will require changes. This will give you a complete understanding of how this will impact its usage and also get us ready to merge everything in one swipe.

With these PRs, the Lead Maintainers of which module will also be able to observe how this affects the rest of the code and give direct feedback accordingly.

Create an awesome-endeavour to track this work.

@vmx
Copy link
Member Author

vmx commented Dec 12, 2018

@daviddias Of course there will still be changes once things get implemented. For me approval means "looks good for me, I don't see any major problems" and I think that's what I got from you now.

@daviddias
Copy link
Member

@vmx you have my 👍 of "LGTM, but do try it first with all the other repos before merging and creating a massive breaking change" :)

@vmx
Copy link
Member Author

vmx commented Mar 18, 2019

@Gozala I'd like your feedback on this one :)

- [`util.cid(binaryBlob[, options], callback)`](#utilcidbinaryblob-options-callback)
- [Local resolver methods](#local-resolver-methods)
- [`resolver.resolve(binaryBlob, path, callback)`](#resolverresolvebinaryblob-path-callback)
- [`resolver.tree(binaryBlob, callback)`](#resolvertreebinaryblob-callback)
Copy link

@Gozala Gozala Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally do think that move to JS based traversal has following disadvantages:

  1. As @achingbrain pointed out that implies that that whole node is de-serialized in one step. Sure JS proxies could be utilized but they are very slow and are really difficult to get right as in ensure that they you don't end up creating memory leaks (we used them a lot for cross compartment boundaries in gecko & in my opinion for non trivial cases e.g process.env they're not worth it).

    I have being wanted to implement ipld resolver for flatbuffers who's primary advantage over proto-buffs is laziness - only relevant fields can be decoded without instantiating objects for intermediaries. It seems like with this move taking advantage of this would be impossible.

    Proxies only work if you can decode everything synchronously, if that is not the case then you wind up with bunch of promises you have to await on to get to the nested data, which IMO is less convenient than it is to provide a path.

- [`util.cid(binaryBlob[, options], callback)`](#utilcidbinaryblob-options-callback)
- [Local resolver methods](#local-resolver-methods)
- [`resolver.resolve(binaryBlob, path, callback)`](#resolverresolvebinaryblob-path-callback)
- [`resolver.tree(binaryBlob, callback)`](#resolvertreebinaryblob-callback)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large naming changes might be out of the scope here, but if it isn't I would favor following terminology:

- [Local resolver methods](#local-resolver-methods)
- [`resolver.resolve(binaryBlob, path, callback)`](#resolverresolvebinaryblob-path-callback)
- [`resolver.tree(binaryBlob, callback)`](#resolvertreebinaryblob-callback)
- [`serialize(IpldNode)`](#serializeipldnode)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode(node)

- [`resolver.resolve(binaryBlob, path, callback)`](#resolverresolvebinaryblob-path-callback)
- [`resolver.tree(binaryBlob, callback)`](#resolvertreebinaryblob-callback)
- [`serialize(IpldNode)`](#serializeipldnode)
- [`deserialize(binaryBlob)`](#deserializebinaryblob)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode(node)

- [Properties](#properties)
- [`defaultHashAlg`](#defaulthashalg)
- [`multicodec`](#multicodec)
- [`format`](#format)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codec


> serializes a dagNode of an IPLD format into its binary format
### `serialize(IpldNode)`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe serialize(IpldNode):Uint8Array|Promise<Uint8Array> ?


> deserializes a binary blob into the instance
Returns a Promise containing a `Buffer` with the serialized version of the given IPLD Node.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really encourage to remove requirement to return Buffer in favor of Uint8Array. This requirement makes browser use case unnecessarily complicated, and in practice isn't really necessary. Other IPFS libraries can do the casting where necessary.

P.S. node Buffer is also a valid Uint8Array so node case will require no change.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love if there was some word about errors as well. Specifically how decoder can return error and how that error propagates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving from Node.js to Browser based primitives is certainly on my todo list, but out of scope for these API changes. I don't want to do everything at once.

README.md Outdated

> resolves a path in block, returns the value and or a link and the partial missing path. This way the IPLD Resolver can fetch the link and continue to resolve.
> Return the CID of the binary blob.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a question.

From what I can tell implementation of the cid is pretty much the same across the board. Is there reason to require implementation of cid in first place ? Could this be made optional, so only implementation that need a custom behavior can provide it ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An IPLD Format has all the information it needs to calculate the CID. The multicodec code as well as the hash. So you can call cid() from layers that build on top of IPLD Formats without knowing those details.

Options include:
- version - the CID version to be used (defaults to 1)
- hashAlg - the hash algorithm to be used (default to the one set by the format)
The result is a JavaScript object, which may also be a Proxy object in case the data shouldn’t be deserialized as a whole. Its fields are the public API that can be resolved through. It’s up to the format to add convenient methods for manipulating the data. The returned object may also be a Proxy object in case the data shouldn’t be deserialized as a whole.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did wrote more about this above but I think I want add one more note here - Requirement for implementation to provide a proxy just to avoid full decoding makes a high bar, in fact I would argue that custom DataView class with getters is probably a far better alternative. However that still implies 5 allocations when accessing something like /deeply/A0/01/FC-nested/path.txt which may seem insignificant but in some cases (e.g. representing Virtual DOM changelists in dominion library) that makes a difference between fast and unusable performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a requirement, "may also be a Proxy object", but I'm very negative on Proxies regardless. Aside from performance concerns, I think we've learned from many previous attempts in other languages that things like Proxies should be reserved for metaprogramming and other reflection use-cases rather than general magical data structures. Pitfalls abound.
Perhaps the way forward is to remove the reference to proxies completely and leave it up to each format and allow experimentation.

Alternatively go the container route which is much nicer--I'm not sure DataView is the right container, maybe a bit too low-level, something that feels more like a Map might be more accessible. In fact, a whole custom Map-like that you could .toObject() to fully instantiate as a non-lazy object representation might provide the best of both worlds?

let data = await format.deserialize(blob)
data.get('foo') // could be lazy
data.has('/deeply/A0/01/FC-nested/path.txt') // could be lazy
data.toObject() // skip laziness and recursively instantiate
// { foo: 'bar', deeply: { A0: { '01': { 'FC-nested': { 'path.txt': UInt8Array<00, 00, 00, 00> } } } } }

It would require specification, and even provision of a custom class, maybe just a simple extension of Map.

For deserialization where you know the result is cheap and you want it all you could just instantiate it in the same call:

let copy = Object.extend((await format.deserialize(blob)).toObject(), { lastUpdate: new Date() })
Map {
  get (key/path)
  has (key/path)
  toObject()
  put() // throws? is mutability helpful here? maybe until it starts to hit awkward edges
}

Are there possible use-cases where getters might need to be async or is it safe to frontload that in decode()?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may have misunderstood what I meant by DataView - just a container tgat wraps underlying buffer and provides getters to access / decode fields on demand get key() {. So pretty much what you described as Map except named getters on prototype instead of get(key)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the way forward is to remove the reference to proxies completely and leave it up to each format and allow experimentation.

The whole point of IPLD is to provide common interface for traversing linked data no ? If each implementation is free to design one, the common interface is gone & I think that’s a step in a wrong direction - I much rather have common imperfect interface that can be used across arbitrary formats than have to learn each one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a container tgat wraps underlying buffer and provides getters to access / decode fields on demand

But this then won't be lazy for nested objects, will it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(2) usability for implementer (lower complexity)

For implementing eager case this will make no difference. For lazy case avoiding proxies and lazy getters would be significantly simpler than needing to have them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let a = await deserialize(blob).nested
let b = await a.b
let c = await b.c

When I wrote my initial response in this thread I was trying to imagine a use-case where the properties would need to be asynchronously accessed and I couldn't see an obvious one except for very large blobs that need to be stored on disk--but those wouldn't be passed in to a deserialize(blob) call in the first place. Maybe there's a case if there was some really heavy crypto applied to sections of a blob, but even then it's questionable. Async computational tasks are usually made pointless by the boundary crossing problems and is why async methods in Node's 'crypto' module have been mostly resisted, computationally intensive thread offloading can't be demonstrated except for very heavy work. The browser is a slightly different case because you're not maximising core use as with a server and responsiveness is important, maybe that's a worthy argument? I'm doubtful.

So again, it looks like premature optimisation in search of a use-case.

(2) usability for implementer (lower complexity)

For implementing eager case this will make no difference. For lazy case avoiding proxies and lazy getters would be significantly simpler than needing to have them

Agreed, and it'll help make quicker progress on this PR!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I care about JS implementation in browser context (on server go is likely a better choice anyway). In browser crypto APIs are async (for good reasons) ignoring that seems strange to me.

And honestly it’s not about optimizations, just about not making currently well supported use cases second class.

P.S. I am working on crypto heavy use case for browsers right now and this API channge will require me to switch to handrolled solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gozala ok this is good context, excuse my ignorance. In your use case, is it necessary that the crypto be handled at the format layer or is it reasonable to push it up to ipld itself? I was imagining ipld/ipld#64 being something in ipld and the blocks just providing the encrypted binary data. Perhaps an encrypted block coming from a format gets transposed into an unencrypted ipld block where the key is provided but this happens at the ipld layer, not in the format. Or are you seeing a different mechanism? Again, I'm fairly new to thinking about this stuff so additional insight is appreciated!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #51 which hopefully accommodates all the points brought up here. It's a smaller change to the current API. Let's have a discussion over there.

README.md Outdated Show resolved Hide resolved
@Gozala Gozala mentioned this pull request Mar 21, 2019
@vmx
Copy link
Member Author

vmx commented Mar 25, 2019

I'm closing this one to prevent confusion. The current discussion on an improved API are at #51.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API review
8 participants