Skip to content
This repository has been archived by the owner on Apr 29, 2020. It is now read-only.

API review #7

Closed
achingbrain opened this issue Nov 29, 2018 · 22 comments
Closed

API review #7

achingbrain opened this issue Nov 29, 2018 · 22 comments

Comments

@achingbrain
Copy link
Collaborator

achingbrain commented Nov 29, 2018

Proposal

To change the return type and codify the behaviour of this module.

Current API

const exporter = require('ipfs-unixfs-exporter')
const pull = require('pull-stream')
const through = require('pull-stream/throughs/through')
const onEnd = require('pull-stream/sinks/on-end')

pull(
  exporter('QmFoo.../bar/baz.txt', ipld)
  through(file => {
    console.info(file)

  // {
  //   depth: 2,
  //   name: 'baz.txt',
  //   path: 'QmFoo.../bar/baz.txt',
  //   size: ...
  //   hash: Buffer
  //   content: <Pull stream>
  //   type: 'file'
  // }

  }),
  onEnd(...)
)

Observations

  1. If the user wants any extra information about the node, they must turn the hash into a CID then pass it to an IPLD resolver, but we already had the CID and DAGNodes during the graph traversal so it's just unnecessary work
  2. Using external libraries dictates how the user should structure their code and their choice of dependencies. They came here for the files, just give them the files

Proposal

  1. Emit entries with object versions of CIDs and nodes
  2. Use core JS types, e.g. async iterators
const exporter = require('ipfs-unixfs-exporter')

for await (const entry of exporter('QmFoo.../bar/baz.txt', ipld)) {
  console.info(entry)

  // {
  //   depth: 2,
  //   name: 'baz.txt',
  //   path: 'QmFoo.../bar/baz.txt',
  //   cid: CID
  //   node: DAGNode (cid.codec === 'dag-pb') ||  Object (cid.codec === 'dag-cbor') || Buffer (cid.codec === 'raw')
  //   content: Function // returns an async iterable of file or directory contents
  //   type: 'file' // or 'directory'
  // }

  for await (const content of entry.content()) {
    // content is either a buffer (when entry.type === 'file') or an entry (when entry.type === 'directory')
  }
}

Options

{
  offset: Number,  // If a file is at the end of the path `.content()` will start at this offset
  length: Number, // If a file is at the end of the path `.content()` will only return this many bytes
  fullPath: Boolean, // Emit entries for every path component
  maxDepth: Boolean // Only descend this number of components into the path
}

These would remain the same.

Sharding

Sharded directories are treated like normal directories. That is, they have a type of directory.

Behaviour

When requesting a path: /ipfs/Qmfoo/bar/baz, if baz resolves to a directory, the contents of the directory will be emitted, otherwise baz itself will be emitted.

If maxDepth is specified and matches the zero-indexed number of path components (excluding /ipfs), baz will be emitted even if it is a directory.

If fullPath is specified, each path component encountered will be emitted (excluding /ipfs).

Examples

When baz is a directory

exporter('/ipfs/Qmfoo/bar/baz')

// emits the files contained in `baz`
exporter('/ipfs/Qmfoo/bar/baz', {
  fullPath: true
})

// emits `Qmfoo`, `bar`, `baz`, then the files contained in `baz`
exporter('/ipfs/Qmfoo/bar/baz', {
  maxDepth: 2
})

// emits`baz`
exporter('/ipfs/Qmfoo/bar/baz', {
  fullPath: true,
  maxDepth: 2
})

// emits `Qmfoo`, `bar`, `baz`

When baz is a file

exporter('/ipfs/Qmfoo/bar/baz')

// emits `baz`
exporter('/ipfs/Qmfoo/bar/baz', {
  fullPath: true
})

// emits `Qmfoo`, `bar`, `baz`
exporter('/ipfs/Qmfoo/bar/baz', {
  maxDepth: 2
})

// emits`baz`
exporter('/ipfs/Qmfoo/bar/baz', {
  fullPath: true,
  maxDepth: 2
})

// emits `Qmfoo`, `bar`, `baz`
@achingbrain
Copy link
Collaborator Author

cc @ipfs/javascript-team

@alanshaw
Copy link
Contributor

When entry is a directory, does content() get the files in the directory?

@achingbrain
Copy link
Collaborator Author

I haven't come across a situation where I needed that, but I can see how it'd be useful, will add that in.

@achingbrain
Copy link
Collaborator Author

achingbrain commented Nov 29, 2018

offset and length in that case might mean slices of files?

Maybe they should be passed into the content() function?

E.g.

File:

const exporter = require('ipfs-unixfs-exporter')

for await (const entry of exporter('QmFoo.../bar/baz.txt', ipld)) {
  for await (const content of entry.content({
    offset: 0,
    length: 10
  })) {
    // content is a buffer with the first 10 bytes of `baz.txt`
  }
}

Dir:

const exporter = require('ipfs-unixfs-exporter')

for await (const entry of exporter('QmFoo.../bar', ipld)) {
  for await (const entry of entry.content({
    offset: 0,
    length: 10
  })) {
    // entry will be one of (up to) the first 10 files in the directory 
  }
}

@vmx
Copy link
Contributor

vmx commented Nov 29, 2018

When baz is a directory

What happens if the maxDepth is greater than the depth?

exporter('/ipfs/Qmfoo/bar/baz', {
  maxDepth: 3
})

Will it "emit baz" or "the files contained in baz"?

If it "emit baz"s, then you could use maxDepth: Infinity to always "emit baz".

@lidel
Copy link
Contributor

lidel commented Nov 30, 2018

Sharded directories are treated like normal directories. That is, they have a type of directory.

Just to ensure I got this right: the idea is to hide HAMT structures and always expose flattened one, indistinguishable from regular directory, right? ("They came here for the files, just give them the files")

@achingbrain
Copy link
Collaborator Author

What happens if the maxDepth is greater than the depth?

The intention behind maxDepth is to allow halting the graph traversal early. My thinking is if maxDepth is greater than the possible depth, it'll just do what it would have done if you hadn't specified maxDepth, which is the same as the current behaviour. From your example it would emit the files contained in baz.

The idea is to hide HAMT structures and always expose flattened one, indistinguishable from regular directory, right?

Yes, I just meant that the user wouldn't have to do anything different to extract files from a sharded directory than from a normal one, similar to what it does now, see dir-hamt-sharded.js and dir-flat.js.

If people think it's useful we can include a sharded flag or similar, alternatively you can do ipfs.files.stat(path) and it'll come back with a type of hamt-sharded-directory (works for both MFS and IPFS paths).

@achingbrain
Copy link
Collaborator Author

Actually, you also have the node property, which if it's a dag-pb node, the data from which can be passed through UnixFS.unmarshal(buf) and it'll tell you if it's a sharded directory or not.

I guess we could include the unmarshaled UnixFS entry too, since we've unmarshaled it during our traversal already?

@achingbrain
Copy link
Collaborator Author

I guess actually if we include the unmarshaled UnixFS entry we can omit some of the other fields like type and size 🤔

@alanshaw
Copy link
Contributor

Is maxDepth necessary? You can just break out of the loop after the nth iteration right?

@achingbrain
Copy link
Collaborator Author

achingbrain commented Nov 30, 2018

Good point. Thinking a bit more about this:

FSEntry

{
  depth: 2,
  name: 'baz.txt',
  path: 'QmFoo.../bar/baz.txt',
  cid: CID
  node: DAGNode (cid.codec === 'dag-pb') ||  Object (cid.codec === 'dag-cbor') || Buffer (cid.codec === 'raw')
  content: Function // returns an async iterable of file or directory contents
  entry: UnixFS // or null if cid.codec === 'raw' https://github.com/ipfs/js-ipfs-unixfs
}

API

Export a node

const exporter = require('ipfs-unixfs-exporter')
const result = await exporter('/ipfs/Qmfoo/bar', ipld)

console.info(`Exported a ${result.entry.type}`)

for await (const content of result.content({ offset, length })) {
  // if `entry` was a file, `content` is a buffer, if it was a directory, `content` is an FSEntry
}

Export all nodes on the path to a node

const exporter = require('ipfs-unixfs-exporter')

for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {
  // ...
}

🤖 Future tech zone

Wildcards?

const exporter = require('ipfs-unixfs-exporter')

for await (const fsEntry of exporter('/ipfs/Qmfoo/bar/**/*.js', ipld)) {
  // ...
}

Search?

const exporter = require('ipfs-unixfs-exporter')

for await (const fsEntry of exporter('/ipfs/Qmfoo/bar', {
  query: 'some search string'
}, ipld)) {
  // ... can be combined with globs, above
}

@vmx
Copy link
Contributor

vmx commented Nov 30, 2018

I like symmetric/consistent APIs. Having

const result = await exporter('/ipfs/Qmfoo/bar', ipld)
for await (const content of result.content({ offset, length })) {}

but on the other hand

for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {}

seems strange. Without thinking about how to implement it, I'd rather do either

const result = await exporter('/ipfs/Qmfoo/bar', ipld)
for await (const content of result.content({ offset, length })) {}
for await (const fsEntry of result.path() {}

or

for await (const content of exporter.content('/ipfs/Qmfoo/bar', ipld, { offset, length })) {}
for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {}

or something similar. I guess you got the point.

@achingbrain
Copy link
Collaborator Author

I think the second option is ok:

for await (const content of exporter.content('/ipfs/Qmfoo/bar', ipld, { offset, length })) {}
for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {}

This will be expensive as you'd have to do the traversal twice, once to return the initial result and once to return the path, or you store it in memory as you go which doesn't sound great either:

const result = await exporter('/ipfs/Qmfoo/bar', ipld)
for await (const content of result.content({ offset, length })) {}
for await (const fsEntry of result.path() {}

I think you'd end up doing both:

const result = await exporter('/ipfs/Qmfoo/bar', ipld) // allows inspection of `result` properties
for await (const content of result.content({ offset, length })) {}

// shortcut for the above, just give me the content!
for await (const content of exporter.content('/ipfs/Qmfoo/bar', ipld, { offset, length })) {}

// also this, i want to know how i got to '/ipfs/Qmfoo/bar'
for await (const fsEntry of result.path('/ipfs/Qmfoo/bar', ipld) {}

@vmx
Copy link
Contributor

vmx commented Nov 30, 2018

This will be expensive as you'd have to do the traversal twice, once to return the initial result and once to return the path, or you store it in memory as you go which doesn't sound great either:

That's not what I meant. This shouldn't be read as one program, but as two cases. I want either the content or the path.

// also this, i want to know how i got to '/ipfs/Qmfoo/bar'
for await (const fsEntry of result.path('/ipfs/Qmfoo/bar', ipld) {}

Wouldn't this be just result.path()?

@alanshaw
Copy link
Contributor

FSEntry

{
  depth: 2,
  name: 'baz.txt',
  path: 'QmFoo.../bar/baz.txt',
  cid: CID
  node: DAGNode (cid.codec === 'dag-pb') ||  Object (cid.codec === 'dag-cbor') || Buffer (cid.codec === 'raw')
  content: Function // returns an async iterable of file or directory contents
  entry: UnixFS // or null if cid.codec === 'raw' https://github.com/ipfs/js-ipfs-unixfs
}

API

Export a node

const exporter = require('ipfs-unixfs-exporter')
const result = await exporter('/ipfs/Qmfoo/bar', ipld)

console.info(`Exported a ${result.entry.type}`)

for await (const content of result.content({ offset, length })) {
  // if `entry` was a file, `content` is a buffer, if it was a directory, `content` is an FSEntry
}

Export all nodes on the path to a node

const exporter = require('ipfs-unixfs-exporter')

for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {
  // ...
}

I really like all of this.

Would the exporter.content function provide any significant performance benefit (or any other benefit) other than saving 1 line of code? If not I'd probably drop it.

My only worry is your future tech section, where exporter() is returning an iterable not a promise - how would we handle that in the future? Right now you'll always get one entry, so it makes sense to return a promise not an iterable.

@alanshaw
Copy link
Contributor

@vmx

Wouldn't this be just result.path()?

Yeah I think so I'm assuming that's a copy paste error. However Alex's point on it being expensive still stands.

@achingbrain
Copy link
Collaborator Author

This shouldn't be read as one program, but as two cases
Wouldn't this be just result.path()?

Yes, I also meant that as two cases and not one program. I think we're on the same page here 😄

@vmx
Copy link
Contributor

vmx commented Nov 30, 2018

I don't understand the expensiveness problem of

for await (const content of result.content({ offset, length })) {}
for await (const fsEntry of result.path() {}

Couldn't the exporter be lazy and just initialise without doing any actual work?

@achingbrain
Copy link
Collaborator Author

Calling .content() on a FSEntry is cheap as we've already resolved the FSEntry's CID during the initial traversal and use that to export the content.

Calling .path() on an FSEntry will mean either making the traversal again or caching it in memory just in case the user calls .path().

I think if we put .path() on FSEntry it encourages the user to call it in a loop which may result in surprise when their program does not perform as expected.

For that reason I think we should stick .path() on the exporter root only:

for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {}

They can obviously still call it in a loop, but it's not a natural thing to do as it doesn't get given to you by an iterable.

@achingbrain
Copy link
Collaborator Author

Would the exporter.content function provide any significant performance benefit?

No, it's just syntactic sugar.

My only worry is your future tech section, where exporter() is returning an iterable not a promise - how would we handle that in the future?

I was thinking to look at the string and return a promise that resolved to an iterable if a wildcard character was seen or a search query was passed.

We might end up having them as proper methods on the exporter root instead. The other solution would be making it be an iterator all the time which is kind of what we have now (in that it's always a pull stream) which doesn't feel very ergonomic.

Anyway, it's all future tech, we can solve that problem on a separate issue.

@vmx
Copy link
Contributor

vmx commented Nov 30, 2018

@achingbrain thanks for making things more explicit. At #7 (comment) I meant that exporter('/ipfs/Qmfoo/bar', ipld) wouldn't return an FSEntry, but some "Exporter object". You would then call either content() or path() on it.

@achingbrain
Copy link
Collaborator Author

This has been implemented.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants