Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Design Discussion #338

Open
5 tasks
Gozala opened this issue Jul 22, 2020 · 15 comments
Open
5 tasks

API Design Discussion #338

Gozala opened this issue Jul 22, 2020 · 15 comments

Comments

@Gozala
Copy link
Collaborator

Gozala commented Jul 22, 2020

As per ipfs/notes#436 I would like to use ipfs-lite as a foundation for providing IPLD storage & replication system. At the same time I would like to use this as an opportunity to explore / design API fit best for that system instead of trying to provide backwards compatibility with ipfs.dag and ipfs.block APIs that were designed for different set of constraints and use cases. Here is a my short (constraint / wish) list:

  • Interops with (new) IPLD stack instead of mimicing or wrapping it.

    E.g ipfs.dag under the hood creates IPLD BLocks, serializes them and stores them. Instead we should just take serializable IPLD Block and store it's bytes.

  • Interop layer should be extremely minimal to reduce coordination to a bare minimum.

    E.g. Block API could change and may have many bells and whistles, instead of assuming Block instances we should have a minimal Block interface that can be serialized, identified, ...

  • Interop layer should enable case specific Block implementations

    This should be achieved from the above goal, but I wanted to calling it out explicitly.

  • Designed for multi-threaded use.

    Copying things across threads in JS can be costly. API should consider that and create a room for optimizations. That means instead of blocks telling how to consume data they should allow flexibility for a consumer to choose a best way to consume it. E.g Web Response is good example allowing consumer to consume it as stream, blob, text etc... while AsyncIterable<Uint8Array> is not because:

    • It can be read only in one way
    • It is impossible to tell if underlying content is in memory or being allocated as it's being consumed.
    • It provides no information about the amount of content.
    • It provides no context regarding if individual chunks can be consumed (transferred) or need to be copied.
    • Isn't peekable
  • First class batching

    • js-ipfs uses AsyncIterables to put consumer in control providing (pull interface).
    • New car format is used to pack multiple blocks and push them out in one go (push interface).

    What we need is a first class primitive for representing multiple blocks which

    • Allows consumer to pull on it's own schedule.
    • Allow producer to push multiple blocks on it's own schedule.
    • Allows application author decide how tightly two should coordinate.

    This may seem familiar

@Gozala
Copy link
Collaborator Author

Gozala commented Jul 22, 2020

/cc @mikeal @lidel

@mikeal
Copy link

mikeal commented Jul 22, 2020

Love it 😁

@carsonfarmer
Copy link
Collaborator

Agreed. This is clearly a well thought out "upgrade" path for whatever this library becomes. I'm 100% on board, and in particular, am excited about the first class batching proposal and the freedom that this API rethink affords us!

@Gozala
Copy link
Collaborator Author

Gozala commented Jul 22, 2020

In my conversation with @mikeal he stated additional constraints that API would have to meet to enable https://github.com/mikeal/dagdb/ efficient replication. I would like those constraints to be captured here, even if we ultimately choose to not address them in the first iteration. So @mikeal if you'll get a chance to do so I would highly encourage you to.

@mikeal
Copy link

mikeal commented Jul 22, 2020

It’s probably best to look at one of the storage implementations to understand the requirements. https://github.com/mikeal/dagdb/blob/master/src/stores/kv.js

It’s pretty simple. A store needs to maintain certain information about what it’s storing so that it call be queried without parsing every block in a graph that is being replicated.

  • The links from/to every block CID.
  • Whether or not the entire graph for a given CID is already stored.

If a store is keeping track of this then it can present a graph() method that can respond with the missing blocks in that graph for a given depth. Once a query runs once any complete graphs can cache that they already have that full graph, making another walk of even the full index unnecessary.

For many use cases, like DagDB, you simply cannot send the relevant cache state over the protocol in a form of a query or a list of blocks. The potential for overlap is too high and the information about each graph that each peer has is, by definition, incomplete.

Graphsync works well for blockchains because it’s an append only log structure and it’s highly unlikely (sometimes impossible) that there will be large amount of duplicate data in the chain between an old HEAD and a new HEAD.

For structures with more de-duplication and a less straight forward append-only log structure that entire approach just wont’ work. The minimal indexing I mentioned above is the bare minimum a store needs to keep in order to be able to tell a peer “this is what i’m missing in that graph” and request each block it does not have in its store.

@mikeal
Copy link

mikeal commented Jul 22, 2020

I should also mention that the above indexing can be leveraged for opportunistic garbage collection. Knowing the to/from links of a block means that when you mutate a graph you can walk the diff and find the orphaned blocks and remove their references in the storage engine and you know if there are any other outstanding links to each block.

This actually conflicts a little with requirements around storing by multihash, because it’s possible to have block data referenced by different codecs. There’s ways around it, but it would require a pretty specific indexing structure, and the layers don’t separate cleanly. The “block store” will need to think in terms of CID’s but store by multihash and index by [multihash + codec].

@Gozala
Copy link
Collaborator Author

Gozala commented Jul 23, 2020

It’s pretty simple. A store needs to maintain certain information about what it’s storing so that it call be queried without parsing every block in a graph that is being replicated.

  • The links from/to every block CID.
  • Whether or not the entire graph for a given CID is already stored.

Would it be fare to say that this is (important) implementation detail ? In other words you could there be naive compatible implementation that walk and parse blocks per each query, even if the performance is impractical ?

The reason I am asking this is because if it is yes, then that does not necessarily affect an end user API.

If a store is keeping track of this then it can present a graph() method that can respond with the missing blocks in that graph for a given depth.

  • Which actor in the system is calling this method ?
    • Is idea here that peer wishing to sync will call graph to identify what it needs to provide / request ?
  • Is graph() supposed to identify missing blocks outwards of the given CID or inwards of it ?
    • If it's outwards I think I'm starting to understand the motivation.
    • If it is inwards then I'm lost
  • I'm getting an impression that this useful for syncing replicas
    • However when I think of the API for users, I'm thinking of ways to read & write blocks into local replica.
      • Is this a mistake on my end ?
      • I'm inclined to think that replication is a system layer API not the application layer.

you simply cannot send the relevant cache state over the protocol in a form of a query or a list of blocks. The potential for overlap is too high and the information about each graph that each peer has is, by definition, incomplete

So is idea here is that local replica can query others to identify new blocks that link to the know graph ?

  • I'm not sure grasp how that would apply to polytrees like show below. E.g. When I query by B I may also be aware of A and H.

    polytrees

    • Is idea that those would be provided in skiplist ?
      • If so doesn't that imply that for complex enough graph skiplist is going to be too large ?

@mikeal
Copy link

mikeal commented Jul 23, 2020

Which actor in the system is calling this method ?

Any actor, it’s actually best not to think about it that way.

Consider this. The store has a state, it has complete or incomplete graphs of various CID’s, and that state is being queried.

If I want to move data from one store to another I need to ask about this state. In a push situation that’s the client, but you could also structure replication as a pull from the remote end. Either way, the store receiving data needs to provide this for efficient replication and even the client store may want to use it for some local optimizations if it doesn’t have all the blocks already either.

Would it be fare to say that this is (important) implementation detail

The whole point of the graph() method is to enable efficient replication. If you don’t have an efficient way to provide the API it’s probably better not to have it at all and fall back to a replication strategy that doesn’t use it. Sure, it’s possible to implement it “the slow way” but any replication strategy using it is going to assume it’s fast and would prefer to not use it at all if it’s going to be slow.

@mikeal
Copy link

mikeal commented Jul 23, 2020

If it's outwards I think I'm starting to understand the motivation.

outwards, the inbound links are primary there for GC and aren’t strictly necessary for this. it’s just that, if you’re doing the indexing you might as well do both.

@mikeal
Copy link

mikeal commented Jul 23, 2020

I'm inclined to think that replication is a system layer API not the application layer.

It was actually tough for me to come around on this as well. Conceptually, you should be able to do replication at a low level without a lot of information because of the hash linked primitives we have. And to some extent you can, at least you can build a more efficient replication than if you didn’t have those, but to do something very fast you just need more information.

With a little more information about the state of the store we can get pretty fast, that’s what this is. This is most useful for cases like DagDB, where we expect a fair amount of de-duplication and the predictability of the paths effected by a mutation is minimal without just diffing the tree. Blockchains don’t have either of those problems which is why we didn’t need any of this for Graphsync.

@Gozala
Copy link
Collaborator Author

Gozala commented Jul 23, 2020

Off topic: Not relevant to API discussion, but it is related to the discussion we're having so I'd want to bring it up as I suspect @mikeal may have considered this and I would love to learn conclusions.

Blockchains don’t have either of those problems which is why we didn’t need any of this for Graphsync.

The way I have being thinking about this graph replication with IPDF first and then with Textile Threads is: You could have arbitrary shaped graph along the side of git reflog style chain per replica. Assuming each replica maintains reflogs of every participant, syncing effectively becomes fairly straight forward - replicas just need to exchange added entries in their reflogs and then pack up blocks for each other they don't have. This is conceptually similar to how OP based CRDTs work.

Replicas will likely want to maintain indexes so "block packs" could be computed on queries, but those become internal optimization detail. I should also point out that even though there are multiple "reflogs" at play they could be projected as a single log with partial order.

If you do allow editing then you wind up with full blown CRDT.

@mikeal
Copy link

mikeal commented Jul 23, 2020

Assuming each replica maintains reflogs of every participant, syncing effectively becomes fairly straight forward - replicas just need to exchange added entries in their reflogs and then pack up blocks for each other they don't have. This is conceptually similar to how OP based CRDTs work.

Not quite!

You’re forgetting about de-duplication. When data is inserted it updates the log, yes, but if it references other data from prior threads you can’t just say “give me the whole graph for the changes” because you may already have a lot of that graph in cache if it references threads you had already replicated.

That’s why this doesn’t end up looking like straight forward log replication. Once you have the list of logs you need to merge you still need to work your way down the graph only requesting new data, otherwise you could end up requesting very large amounts of data you already have in your local store.

@Gozala
Copy link
Collaborator Author

Gozala commented Jul 24, 2020

You’re forgetting about de-duplication. When data is inserted it updates the log, yes, but if it references other data from prior threads you can’t just say “give me the whole graph for the changes” because you may already have a lot of that graph in cache if it references threads you had already replicated.

No I was not forgetting about it. What I was trying to say is that once two replicas exchange reflog updates, each has enough information to pack all the blocks other does not have because they have all the info locally. And yes to do that efficiently replicas would need to maintain an index, but point is that becomes implementation detail that doesn't need to have a public API.

As of indexing, each replica on local write can with very little overhead create a flat list of new blocks and have new head link to it . This way replica asked for updates just needs to walk from the head to a last know head and concat all new block lists together. In practice it's bit more complicated than that, because there >2 participants, but if you can project merged reflog (which is more or less what textile thread do) same idea would apply.

@mikeal
Copy link

mikeal commented Jul 24, 2020

It needs to be a public API for the “push” case but sure, for “pull” it doesn’t need to be public.

@Gozala
Copy link
Collaborator Author

Gozala commented Oct 11, 2020

It needs to be a public API for the “push” case but sure, for “pull” it doesn’t need to be public.

In fact in described model there is no distinction between push and pull, replicas wishing to sync exchange heads. From there each side can derive locally what it has that other doesn’t and pack it all up end send. Once both do it they’re in sync

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants