Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Identifying the critical decisions in IPLD advanced data structure design #130

Closed
warpfork opened this issue Jun 11, 2019 · 6 comments · Fixed by #177
Closed

Identifying the critical decisions in IPLD advanced data structure design #130

warpfork opened this issue Jun 11, 2019 · 6 comments · Fixed by #177

Comments

@warpfork
Copy link
Contributor

(Forward: this is a collection of a lot of stuff, and I'm trying to be comprehensive, but will inevitably fail. If you want to read this and visualize semi-continuous injections of "TODO: expand" and "TODO: more alternatives?", that will probably be appropriate!)


There are approximately three distinct and critical decisions we need to regard:

  1. Signaling
  2. Referencing
  3. Having the implementation

Signaling

"Signaling" refers to deciding / detecting that we need any advanced layout
engagement at all, rather than simply having another regular map, list, or
byte sequence.

Note that "signaling" refers only to the conditional branch that decides
we're going to do something -- it doesn't define what the something *is;
that's the "referencing" decision.

(It's useful to regard signaling as a distinct issue because we should be
able to point on a flow control map to the exact position where signaling
occurs -- it's one point.)

Referencing

(alternate titles: "rendezvous", "implementation selection")

"Referencing" refers to the need to pick a particular implementation of code
we're about to use to understand the data layout.

Roughly, "referencing" is about how something says "I'm a hamt!" or
"i'm a rabin chunked byte sequence!".

Note that referencing is about which of those things a data structure is;
the "signaling" decision already handled whether or not something special
is about to happen. (Referencing can involve more complex decision trees,
whereas signaling is a pure boolean.)

Having the implementation

"Having the implementation" refers to the need to have actual executable code
which understands and manipulates the data layout. This might refer to a
plugin registration system, or a bytecode interpreting system, or... etc.

Other

Some advanced layout algorithms have additional parameters than merely their reference;
for example, HAMTs can have a 'bitwidth' parameter.

These additional parameters may be worthy of special consideration, because
they have the interesting property of being variable without needing new code
in the client libraries acting on the data. (This also means that parameters
as recorded in serial data can vary in ways that don't affect whether or not
we "have the implementation", which may be useful; however, by the other side
of the same coin, it poses further interesting questions.)

We could also choose to remove these from the field of play entirely and
simply state that the must be part of the Referencing data overall and
opaque to the rest of the system.

Known (so far) solution vectors

(These are selected vectors in the solution space; they might not be exhaustive.)

Signaling

defacto signaling

We can implement signaling by leaving it completely unspecified in the data
and completely unstandardized in general, instead having applications "know"
what advanced layouts they expect where.

This is defacto what applications like unixfsv1 already do.
(Take heed: It does work!)

The thing we don't like about defacto signaling is that since it eschews
standard conventions and has no support from library APIs, it involves a large
amount of work per new application. It also does not make it trivial to
link from one kind of application data into another (e.g., gateways currently
defacto engage unixfsv1, but we don't want to embed gateway http links in
other documents; that would kind of miss the point.)

However, perhaps by systematizing some patterns that defacto systems already
employ, we could reach a happier place. (This turns into "explicit signaling".)

explicit signaling

We can implement signaling by standardizing a system for doing it with some coupled data.
That coupled data may or may not contain the Reference; the Reference could also remain in the data itself.

Schemas seem very likely suitable for a mechanism to carry explicit signaling for several reasons;
but we can also define explicit signaling that's independent of schemas.

Explicit signaling has several nice features, such as allowing the same raw data
to be perceived in different ways depending on explicit signal.

in-band signaling

We can implement signaling by reserving a special key string in maps,
and having it contain the Reference.

N.b. this doesn't actually help produce new data; another solution to signaling
is still required when producing new data.

in-band signaling with reserved keys

(I think we're just done with this one right? Blueboxing. Not doing it.)

in-band signaling with multi-codecs

Multicodecs already give us a place to stash an extra 'bit' of information,
which means we can use it to do in-band signaling without blueboxing problems.

Whether or not we want to use more multicodecs is a topic of debate.
(TODO: expand)

Referencing

Forward: Everyone wants an obvious slam dunk here, but there isn't one.

behavioral specification

Behavioral specification refers to using some human-readable strings
to define what we're doing.

For example, we might have a string property in an object with a format
shaped roughly like "ipld/experimental/hamt/v1"; we would then try to
translate that string into a lookup of an implementation which will do
the right thing.

CID linking

CID linking is the idea that we can reference a particular implementation
of an algorithm by a CID.

(CID linking tends to be suggested with the specific implementation of "WASM" in
mind, but there's nothing particularly special about WASM; any linkable code
which can be dynamically interpreted would suffice.)

There are unresolved (possibly unresolvable?) issues with CID linking that
make some of the hopes and dreams we typically associate with merkle designs
not hold: namely, having a CID of some particular hash of a particular bytecode
isn't necessarily a canonical thing. (What's a 'canonical' compiler for a
'canonical' expression of a behavior?)

A CID of some bytecode specifies implementation details rather than semantics
and contract, and that's not necessarily the kind of useful that we're looking for.

A CID of some bytecode also does not itself address any issues of versioning
the interpreter of that bytecode itself, nor versioning any APIs for getting
data into and out of that bytecode's environment.

Having the implementation

(There aren't that many options here,
so just skip down to the "Probable Solutions" section.)

A note about context-freedom

We don't need to put the advanced layout Reference data in every layer
of internal data structure: it would be redundant.

It's already the case that many multi-block data structures are unusable
if you're linked directly into the middle of them (e.g., when traversing
the internal nodes of a HAMT, if you don't have information "on the stack"
which recalls which prefix bytes we've already munched, navigation is not
possible): therefore it should be noted that total context-freedom is not
generally possible. It follows that solutions to our various questions
which lean on some degree of "on the stack" context are potentially viable,
and warrant discussion.

Probable outcomes

Signaling

Vigorous debate.

Referencing

We seem to have a rough consensus that this should be a human-readable string.
CID-linking approaches aren't even defined unless we have a blocking dependency
on WASM, which is no fun.

It should be some reasonably prefixy string pattern -- e.g., "ipld/experimental/hamt/v1" --
and it does not need to be particularly short because it's not frequently appearing.

Having the Implementation

Surprisingly, this is the easy part?

If using non-WASM plugins, they have to be registered in your library of choice,
indexed by the Reference string.

If using WASM plugins, we can refer to their bytecode by CID.
(However, do note there are at least two sub-problems this doesn't address:
first, that the WASM interpreter itself and APIs will need versioning;
second, that users may not want to download and run non-time-bounded code.)

@warpfork
Copy link
Contributor Author

Some previous discussion (I haven't yet checked for complete dedup with this; it has a lot of discussion): #118

@warpfork
Copy link
Contributor Author

Related issue in an implementation poc repo: ipld/js-composites#3

@warpfork
Copy link
Contributor Author

Earlier (but completely superceded) draft of this: https://gist.github.com/warpfork/b315f3518207075b7044f35a0a15d17a

@mikeal
Copy link
Contributor

mikeal commented Jun 11, 2019

Related: #126 (comment)

An issue I plan to resolve in the current spec is that the signaling and the definition are tightly coupled. That will be broken apart, so we’ll have a clear idea of what a “Definition” for an IPLD Generic is, and how we associate that with the data (signaling) will be its own problem space.

Also, the operation descriptions are probably going to move to being defined for WASM rather than generically. The JS implementation will try to closely match this, but the API signatures for WASM need to be well defined in a spec in a way other languages do not and it’s not really a requirement in other languages that they closely match any API given that they can choose to associate the string identifier with any implementation using any API they wish.

With WASM, I think we avoid the “referencing” and “having an implementation” problems (for the most part) but there could be some aspects of the system I don’t see yet given that we don’t yet have a working implementation.

@rvagg
Copy link
Member

rvagg commented Jun 12, 2019

Re WASM as the utopia used for aiming purposes -- even if we don't get there I still quite like the idea that we split out these algorithms into their functional components. "Here's how you GET on this thing, and here's how you KEYS on it" as entirely independent pieces of code. What I'm finding is that it enforces a nice rigour on the algorithm implementations to get the abstractions right in terms of their relationship to blocks of data and pieces of the data model. A stylistic thing but in the same way that good code formatting can aid good code design and a good module system can aid .. modularity.

Re schemas -- it's worth registering that schemas are going to be a useful, perhaps mandatory, tool in the advanced layout / generics flow. I want an algorithm to be able to assert that it's getting the block shapes (or at least data shapes) that it expects at each stage of traversal and avoid lose assumptions. Schemas are also likely going to be important on the write side, particularly where codegen comes in to play, but also as a translation tool for the more awkward layouts (like unions and the various non-plain representations like stringjoin).

So, even though we're not using schemas for this yet, keep in mind that they are going to be in play. So it's not unreasonable to insert them as a mandatory part of this process. It could work in any of the signalling mechanisms cases outlined above - defacto = "expect univfsv2 blocks that look like this", explicit = "here's your link and it will be a hamt that looks like this", in-band = "i'm a hamt and I conform to this".

For referencing, I still like associating serialised schema CIDs with implementations in spite of the naysaying. But maybe we can come back to that later because I think objections are overblown and using strings has the URL-mutability problem and the namespacing problem and I like to think we're trying to move away from both of those problems as much as possible not entangling more of them.

@warpfork
Copy link
Contributor Author

warpfork commented Jun 20, 2019

For referencing, I still like associating serialised schema CIDs with implementations

I should maybe try to clarify my position on this a bit. I'm fine with this. I just think that if trying to check the boxes in the phrase "{necessary} {and|but not} {sufficient}", it's "nice" rather than "necessary", and also not "sufficient" (more than one behaviorally distinct piece of code that assembles the final view of the data can exist and have the same topological structure of serial data).

But "nice and not sufficient" can still be traits of an overall good idea.

And maybe a schema that has sufficiently high entropy (e.g. some fields with intentionally unusual names, such that it effectively becomes "version"-detection) can actually make "sufficient" for-all-practical-purposes true. Hrm.

warpfork added a commit that referenced this issue Jun 20, 2019
(This is opening up a whole new directory tree, the idea of which is
roughly that "it would be nice to collect some experience reports and
user stories, *before* writing final specs"; and we might still want to
be able to easily refer and link to them and accumulate more of them
while we work towards being ready to make final specs proposals.)

(There's correspondingly a series of "in the opinion of this author"
text in this file.  The idea is that other folks might write very
different explorations, and we might merge and retain both of them.)

(I'd actually like to re-issue #130
as a file in this sort of a format.  One can't link to headings in
a github issue; one *can* link to headings in a markdown file, like
the one this diff is adding.  Maybe other organizational formats could
make even better trades, but this deficiency of github issues is
worth noting.)
vmx added a commit that referenced this issue Aug 20, 2019
rvagg pushed a commit that referenced this issue Sep 30, 2019
rvagg pushed a commit that referenced this issue Sep 30, 2019
prataprc pushed a commit to iprs-dev/ipld-specs that referenced this issue Oct 13, 2020
(This is opening up a whole new directory tree, the idea of which is
roughly that "it would be nice to collect some experience reports and
user stories, *before* writing final specs"; and we might still want to
be able to easily refer and link to them and accumulate more of them
while we work towards being ready to make final specs proposals.)

(There's correspondingly a series of "in the opinion of this author"
text in this file.  The idea is that other folks might write very
different explorations, and we might merge and retain both of them.)

(I'd actually like to re-issue ipld#130
as a file in this sort of a format.  One can't link to headings in
a github issue; one *can* link to headings in a markdown file, like
the one this diff is adding.  Maybe other organizational formats could
make even better trades, but this deficiency of github issues is
worth noting.)
prataprc pushed a commit to iprs-dev/ipld-specs that referenced this issue Oct 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants