Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPIP-49: CIDv2 “fat pointers” #49

Closed
wants to merge 1 commit into from
Closed

IPIP-49: CIDv2 “fat pointers” #49

wants to merge 1 commit into from

Conversation

mikeal
Copy link

@mikeal mikeal commented Jul 25, 2022

After numerous discussions at IPFS Thing I decided it’s time to pull the trigger on CIDv2.

This isn’t the only PR we’ll need to do, but this should serve as a way to resolve any objections or concerns.

We (DAG House) have a pressing need for these in the short term and will be implementing them rather quickly.

We’ve floated a lot of different solutions to this problem and the one that everyone seems to disagree the least on is the simplest, which is what I’ve proposed: two cids.

The first is the data pointer, the second is the context. If you want inline context, use identity multihash.

This also makes CIDv2 a valid CIDv1 codec and can be used for reverse compatibility when necessary (although we should do the work of supporting them natively in the codecs).

Since CIDv2 can be viewed as a tuple of CIDs it’s possible to add support across the existing interfaces representing CIDv2 as a simple list of two CIDs in the existing IPLD Data Model.

There were a lot of discussions about this in-person, so there’s plenty of details I’m sure I’m leaving out, but it’s time to discuss.

@Winterhuman
Copy link

Winterhuman commented Jul 25, 2022

Leaving a comment to state my support for this. The project I'm part of (IPNS-Link: https://github.com/ipns-link/ipns-link) could definitely benefit from this, especially the fact that IPFS Gateways can shorten the CIDs for subdomains.

I had originally written https://github.com/Winterhuman/ipns-link-alt/wiki/IPNS-Link-V2.1 as a way to use half the digest for a 128 bit hash, and then the rest for encoding the Host: header needed for the connection, however, it meant the header could only be 13 bytes which made it really restrictive.

But with this, the context CID can inline the Host header instead (and possibly other info too), and an IPNS-Link Gateway can just take the CIDv1 multihash of the CIDv2 OriginID.

@rvagg
Copy link
Member

rvagg commented Jul 26, 2022

Shouldn't it get a v2 prefix in it? aka multicodec-cidv2? In the language of the CIDv1 spec above it, in multibase encoded form, wouldn't it be this?

<cidv2> ::= <multibase-prefix><multicodec-cidv2><data-cid><context-cid>

(with no multibase-prefix in byte form)

Or .. are you expecting all our decoders to read a CID and then optionally expect to possibly read a second one straight after it? I can imagine places where that would work (I think it would come cleanly through dag-cbor, although backward compatibility would be a problem). But other places where it probably wouldn't (dag-json might be a problem? CAR sections would certainly be a problem).

If we have a 2 in there then at least we get "unknown version: 2" from "legacy" CID parsers and their dependent decoders.

  • Not having a length up-front is mildly annoying, since you have to jump around to figure out how long this beasty is (you can't just build a varint-style "read the first few bytes and tell me how long this CID is going to be" which we currently can). But I don't imagine that's a blocker.
  • Is it worth constraining one or both of the sub-CIDs to CIDv1 so we can skip some CIDv0 hassle? Perhaps context-cid at least could be constrained to a v1?

@Ericson2314
Copy link

I don't know what conversation led here, but this doesn't look like a good design.

Throwing more "who knows what this means!" metadata at a problem just weakens CIDs having a clear semantics distinct from specific implementations.

This seems like a classic case of https://wiki.c2.com/?OneMoreLevelOfIndirection.

@mikeal
Copy link
Author

mikeal commented Jul 26, 2022

@rvagg it’s “any CID of any [current or future] version.”

Thinking of it like a codec, it always decodes to a tuple of two CID’s. Could be v0, v1, or v2, but as far as the codec is concerned it just returns two Link value types.

So ya, you can nest them indefinitely, but if you think about it long enough you’ll realize you could do the same thing today with identity CID’s :)

@mikeal
Copy link
Author

mikeal commented Jul 26, 2022

Throwing more "who knows what this means!" metadata at a problem just weakens CIDs having a clear semantics distinct from specific implementations.

That’s exactly the opposite of what is happening here.

Today, there’s a bunch of “who knows what this means” data in the network, the context for which exists only in the applications reading and writing the data. That context, currently, does not live in the network and the data lacks sufficient self-description. CIDv2 gives us the ability to write that context into the link layer with existing and future mutliformat protocols.

The reason there isn’t a hyper-specific definition of exactly what “context” means is because it’s meant to extend to encompass the totality of all applications.

@mikeal
Copy link
Author

mikeal commented Jul 26, 2022

If we have a 2 in there then at least we get "unknown version: 2" from "legacy" CID parsers and their dependent decoders.

ya, there’s no getting around that if we want to upgrade though. at least there’s the CIDv1 encoding that applications can use if they need to make sure they work with systems that have older parsers.

Not having a length up-front is mildly annoying, since you have to jump around to figure out how long this beasty is (you can't just build a varint-style "read the first few bytes and tell me how long this CID is going to be" which we currently can). But I don't imagine that's a blocker.

By the time you’re in the mutlihash you’ve got a length, so even with just jamming the CID’s next to each other you’ve never gotta parse more than a few bytes before you’ve got the end of the first CID. So I don’t see a super compelling reason to put the length a few bytes earlier, but maybe there’s a good reason to have a static guarantee of which byte the length is at? Having a couple varints in front of it will make it vary.

Is it worth constraining one or both of the sub-CIDs to CIDv1 so we can skip some CIDv0 hassle? Perhaps context-cid at least could be constrained to a v1?

I don’t think so at all. There’s going to be a block limit you’re inside of when you encode them most of the time, and you’re free to break them up into separate blocks in order to control the size of any nesting.

That’s the beauty of this whole approach, encoders have a lot of flexibility in how they choose to encode the pointers, and you kinda need that because fat pointers are, by definition “fatter” than average. You won’t find a single encoding that solves all use cases, so having the flexibility of encoding as a separate block let’s the encoders get whatever they need.

@Winterhuman
Copy link

What about constructing CIDv2 like this:

Example CIDv1:

multibase: base32
    multicodec: CIDv1
    multicodec: DAG-PB
        multihash: sha2-256 256

Alternate CIDv2:

multibase: base32
    multicodec: CIDv2
    multicodec: DAG-CBOR
        multihash: identity 256
    multicodec: CIDv2
    multicodec: DAG-CBOR
        multihash: identity 256
    multicodec: CIDv1
    multicodec: DAG-PB
        multihash: sha2-256 256
  • Requiring multibase to be shared for data and context CIDs avoids having a string with multiple bases in it, which could make the base encoding of the whole string a mess otherwise, and saves having multiple multibase bytes.
  • Putting the context CID first means CIDv1 parsers can immediately fail upon reading "CIDv2" and exit, this avoids applications which don't properly handle bytes appended after CIDv1.
  • For parsers, if you read multicodec: CIDv2, then you know there is at least one more CID after it. If you read multicodec: CIDv1, then you know there are no more CIDs after this one.
  • Since "CAR sections would certainly be a problem", the relevant applications can be modified to skip the bytes between multicodec: CIDv2 and multihash: identity 256 thus creating CIDv1 again (though that means CIDv2 wouldn't be supported in CAR, at least initially).

Not saying my idea is any good, just putting this out there to continue the discussion.

@aschmahmann
Copy link

aschmahmann commented Jul 26, 2022

This spec proposal is IMO missing a few things to make it useful to reason about having nothing to do with the technical aspects here.

While AFAIK the multiformats repo does not have as formalized a spec proposal process as others in our ecosystem (e.g. off the top of my head IPFS, libp2p, and Filecoin it's still important to run through some process here. In fact it's probably more necessary here since part of the reason there's no specs process in multiformats is that the specs have largely not changed in years.

Some things that IMO are missing as illustrated by the comments and questions above mine:

  • What is the actual spec change being proposed?
    • There seem to be two, one is a special type of CIDv1 with identity multihash and the other is two concatenated CIDv1s
    • Perhaps writing out some examples would help here
  • Why is it required?
    • You mention DAG House needing it and some in person discussions, but nothing concrete to hold onto. Spec changes should have rationale behind them supporting the change that can serve as an artifact.
      • Lots of the community who cares about CIDs was not at the event you mentioned and are unaware of any discussions
      • I was even at that event and don't recall having conversations with these conclusions, so how could this be expected of others 😄
  • What were the alternatives evaluated?
  • What are the ramifications of the change?
    • This spec is at the bottom of a lot of other specs and there may be a lot of fallout here, understanding which ones you've thought of allows people to ask questions about those or the ones you have not.

The linked spec processes above may give some more insight as to other details to add here.

I have some thoughts on the proposal, but I suspect my comments will be more helpful once some of the above is written up. Otherwise I'm trying to make guesses without sufficient context - which is rarely a good idea 😅

@mikeal
Copy link
Author

mikeal commented Jul 26, 2022

There seem to be two, one is a special type of CIDv1 with identity multihash and the other is two concatenated CIDv1s

Nothing being proposed here in CIDv1 is outside of what CIDv1 already does. I think what is missing is a more straightforward definition of “CIDv2 as a CIDv1 codec.”

What were the alternatives evaluated?

We’ve been discussing “fat pointers”/CIDv2 for years, some of those discussions exist in issues and PR’s, many don’t. I’m not going to go and dig all of those up just to go back through the process of shooting them down in a new forum.

If people want to bring up alternatives I’m happy to discuss them, but most of the other approaches we’ve looked at were focused on addressing more specific features like “i want a link with a path” and “i want a link that is both an immutable address (CIDv1) and a reference to a mutable pointer” all of which are accomplishable within this approach but I really don’t want to start proposing what those will look like for fear of being pulled into an endless bikeshed.

While AFAIK the multiformats repo does not have as formalized a spec proposal process as others in our ecosystem (e.g. off the top of my head IPFS, libp2p, and Filecoin it's still important to run through some process here. In fact it's probably more necessary here since part of the reason there's no specs process in multiformats is that the specs have largely not changed in years.

Is there any reason why we shouldn’t just use the IPIP process? I’m happy to agree on this there and then come back to this repo for an update after it’s agreed upon.

The lack of formal governance in multiformats makes me pretty hesitant to invest a lot of time in formalizing any of these arguments as there is no mechanism for resolving disputes or calling the discussion to a close.

@mikeal
Copy link
Author

mikeal commented Jul 26, 2022

You mention DAG House needing it and some in person discussions, but nothing concrete to hold onto.

Our use case is “all the context our services build about the data we receive and transport.” Knowing that “this block is a UCAN capability” and “this block is a UCAN invocation” is entirely based on signaling we do outside the data.

Since this signal is invisible to the link layer, none of what we put into the network can be properly leveraged by other actors in the network because the data lacks sufficient self-description for it to be useful in-and-of itself. In order to produce cross compatible applications we have to build additional protocols in the transport or discovery layer, which means we don’t get the “open innovation” we all want to see.

This is not a blocker for us building a service, it’s a blocker for building a real ecosystem around what we do. Frankly, it’s a little frustrating after all these years that people don’t consider this a serious problem, and the response when it’s brought up is usually a lot of moaning about the work we’ll have to do in upgrading some of our tools.

This is a substantial barrier to realizing growth of the network. When data lack self-description, applications and services become the arbiters of how the data in the network can be leveraged, and this represents a substantial barrier to the network effects you would expect to realize in a network of publicly addressable data.

@RangerMauve
Copy link

Would it be possible to add some examples of how this looks like with existing codecs like dag-json or dag-cbor?

As well, would it be possible to have examples of how this could change IPLD schemas. https://ipld.io/docs/schemas/ I think it would be cool to make sure we could includes expected data in fat pointers when defining a schema.

@RangerMauve
Copy link

If people want to bring up alternatives I’m happy to discuss them, but most of the other approaches we’ve looked at were focused on addressing more specific features like “i want a link with a path” and “i want a link that is both an immutable address (CIDv1) and a reference to a mutable pointer” all of which are accomplishable within this approach

I really like this! This could lead to some useful specs built on top for better interoperability between some existing systems like wnfs.

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why we shouldn’t just use the IPIP process?
I’m happy to agree on this there and then come back to this repo for an update after it’s agreed upon.

Agree, following IPIP process will be the best. Filled #51 to write down this as a policy for this repo.

Comment on lines +135 to +137
For instance, IPFS HTTP Gateways redirect to CID based subdomains which introduced a byte limit on the size
of the link. In this case, IPFS HTTP Gateways would create a single block for the CIDv2 link
with a 256b multihash encoded into CIDv1 for any redirect subdomain.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️
Just flagging this may not be the best example.
When subdomain gateways were created, we chose to not do this block generation because asking Gateways to create blocks on the fly is a can of worms (complexity, link rot):

  • what happens when I copy the CID created by the gateway and share it with someone? who is providing the root block now?
    • are gateways expected to cache/pin these artificial blocks and provide them to the network?
    • or is it to every IPFS client to double-publish both CIDs on the DHT?

@Ericson2314
Copy link

The reason there isn’t a hyper-specific definition of exactly what “context” means is because it’s meant to extend to encompass the totality of all applications.

I.e. we're going to be right back where we started! Data that we don't know what it means, because why don't know what the metadata CID means! Will we need a meta-metadata CID too?

CIDv2 is, quite literally, two CIDs.

```
<cidv2> ::= <data-cid><context-cid>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope with cid you mean how CIDs are used today and not what the spec says, i.e. these won't contain a multibase prefix.

@vmx
Copy link
Member

vmx commented Aug 2, 2022

Shouldn't it get a v2 prefix in it?

I agree with @rvagg here. I think it should be a CIDv2, I think that would make implementations easier. If you don't want to fully support CIDv2, you could just add the v2 to your case statement and decode it as v1 and ignore the trailing bytes.

So ya, you can nest them indefinitely, but if you think about it long enough you’ll realize you could do the same thing today with identity CID’s :)

I think I lean towards what @rvagg suggested and making the context CID a v1 only. So that you don't have nesting. Unlimited nesting sounds like a big can of worms, especially thinking about codecs that encode CIDs.

@johnchandlerburnham
Copy link

johnchandlerburnham commented Aug 2, 2022

Just wanted to chime in to support this proposal on behalf of the Yatima team! Our work with Lurk-Lang would hugely benefit from (and in some ways requires) having the ability to add additional context or metadata to CIDs. One concept we played around a couple months ago to do this was to remove the length limit on the multicodec field, which we did a short write-up on when designing Lurk's IPLD content-addressing.

I definitely prefer the proposal here of using a tuple of CIDs though, since it seems a more minimal/compatible change with how CIDs are used. I also think @Ericson2314 comment about not creating recursive nesting CIDs is really important, and any "fat-pointer" CIDv2 proposal should be as constrained as possible, while still achieving the goal of allowing for more expressive metadata beyond just the multicodec in a CID.

Here's my interpretation in Rust of how I understood @rvagg's concept of CidV2 as a pair of CidV1s:

pub struct CidV2<const S: usize> {
    /// the data multicodec
    data_code: u64,
    /// The data multihash.
    data_hash: Multihash<S>,
    /// the metadata codec
    meta_codec: u64,
    /// The data multihash of CID.
    meta_hash: Multihash<S>
}

This would serialize as:

<cidv2> ::= <multicodec-cidv2><multicodec-metadata-content-type><multihash-data><multicodec-data-content-type><multihash-metadata>

(with the prepended multibase prefix when represented in text)

As an example, suppose you wanted a CID which pointed a piece of IPLD data structure and its IPLD schema. Let's say you have the schema Trit, with a particular integer representation

type Trit union {
  | True ("1")
  | False ("2")
  | Unknown ("0")
} representation int

which corresponds to the Ipld data: Ipld::Num(1), Ipld::Num(2), Ipld::Num(0).

While you could in principle propose a new multicodec for Trit, but this might be not suitable if Trit is a temporary or ephemeral structure, or if you have a large number of different schemas (For instance, in Lurk-lang's content-addressing we would need to reserve 16-bits of the multicodec table, or 2^16 distinct multicodecs)

However, since IPLD schemas can be represented as JSON (https://ipld.io/specs/schemas/#dsl-vs-dmt) and hashed, with a CIDv2 we could reserve a single IPLD schema multicodec, along with the codec for the data representation (such as dag-cbor)

name tag code
dag-cbor ipld 0x71
... ... ...
IPLD schema ipld 0x3e7ada7a

We could then use the above CIDV2 definiton to create a pointer to any Schema+Data pair:

CidV2 { 
  data_codec: 0x71,
  data_hash: <data_multihash>,
  meta_codec : 0x3e7ada7a, 
  meta_hash: <schema-multihash> 
}

And thus we could then create an unambiguous hash to Trit::True with

CidV2 {
  data_codec: 0x71, 
  data_hash: Ipld::Num(1).hash(),
  meta_hash: trit_schema.hash(), 
  meta_codec : 0x3e7ada7a,
}

without having to reserve anything new on the multicodec table.

For backwards compatibility, CidV2's could be embedded inside CidV1s by using the cidv2 codec and the identity multihash:

CidV1 {codec: 0x02, hash: <identity-multihash-of-cidv2-serialization> }

(In fact you can already nest CidV1's inside themselves with the identity multihash in this way)

Would love to hear feedback on whether the above idea seems like a reasonable direction to go in. I and the Yatima team would absolutely love to collaborate on this proposal, whether that's working on an IPIP, writing a Rust implementation, etc.

@softwareplumber
Copy link

I very much support the idea implementing some kind of 'fat pointer'. Are there potential vulnerabilities associated with expressing the relation between content and metadata as a tuple which is not, in itself, hashed? I think a lot of people have been toying with similar ideas; my idea was similar but the metadata hash was prepended to the content when hashing it. I'm no cryptographer so quite possible that this is un-necessary but it seemed like a sensible precaution at the time.

@BigLep
Copy link
Contributor

BigLep commented Aug 4, 2022

There's good discussion happening here. Given the push to have this follow the IPIP process (see FAQ), can someone create the IPIP so the comments happen there?

@rvagg
Copy link
Member

rvagg commented Aug 8, 2022

Thanks @johnchandlerburnham for taking this further with ipfs/specs#305, we should probably move most discussion over there so we can get specific about it.

@ivan386
Copy link

ivan386 commented Sep 14, 2022

UnixFS have metadata block type. If you need to have metadata in cid than just inline that block in it.

@rvagg
Copy link
Member

rvagg commented Sep 20, 2022

UnixFS have metadata block type. If you need to have metadata in cid than just inline that block in it.

Except that the ask is for a link and metadata, not just metadata. Although that is certainly a valid way to encode metadata when you have a place to put it—it's just that unixfs is a format on top of a format (dag-pb) so maybe not the most optimal form?

@ivan386
Copy link

ivan386 commented Sep 22, 2022

Metadata block contain link. Example of identity link with metadata.

@vmx
Copy link
Member

vmx commented Jan 17, 2023

I've spent a lot of time talking about CIDv2 at IPFSCamp/LabWeek in Lisbon. I now found the time to write things down a bit. The result is at https://hackmd.io/@vmx/SygxnMmso (it still needs work). What I realized after writing it down is, that my proposal is basically what @mikeal originally suggested. Just having two CIDs, one for the context, one for the content.

The difference to this PR is, that it really is that tuple and not a CIDv2. The reason is that this way, CIDs won't change and also won't change the IPLD Data Model. This is a huge win, this way "fat pointers" (or what I call "Application Context") is a layer on top of that.

Nonetheless I still have one more idea to write down that floated around, once done I'll link it from the HackMD mentioned above.

I'm closing this issue as I'm convinced (after talking with so many folks about this) that it should not bit a CIDv2, but something build on top of those primitives.

@vmx vmx closed this Jan 17, 2023
@lidel lidel changed the title feat: CIDv2 “fat pointers” IPIP-49: CIDv2 “fat pointers” Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.