-
Notifications
You must be signed in to change notification settings - Fork 30
Proposing some tooling for datasets (ipfs-pack and stuff) #205
Comments
Question about BagIt: how does it do hashes of directories / the tree? i didnt see that when looking at the spec, but i may just have missed it. |
the db thing would be super useful for large things, it could just be a local |
@edsu can we entice you to comment? Your input might make all the difference. |
Lol. "ipfs-pack your bags" That's very clever @jbenet |
@jbenet it's true empty directories are not present in the BagIt manifest. In practice some folks who have wanted to preserve the presence of empty directories have created an empty .keep file in the directory or documented the directories' presence in the bag-Info.txt I'm really interested to learn more about what the IPLD representation would look like. I am definitely not up on all the IPFS features/functionality. Would the putDescriptor allow people to add metadata about their packages, such as a name for the dataset, who created it, etc? My understanding is that IPFS is largely file oriented. Is it fair to say that this proposal adds the notion of sets of files and tools for working with them? You are probably familiar with them already, but this makes me think of two otherpoint of reference for work in this area that you might be interested in:
I suspect both would be interested in the work you are proposing. |
@jbenet I don't know how familiar with some of the Frictionless Data stuff and esp Data Package - http://specs.frictionlessdata.io/data-packages/? I know you have commented a bit a couple of years ago on the Frictionless Data stuff when you were looking at data package managers and we discussed it at some length in London last year. In general, I'd say if you are looking at a simple structure on disk for describing a "package" of data it is would be a good fit. I should say I did not immediately grok what exactly you are up to here from the description above e.g. what is an .ipfs repository (and how it relates to overall design of ipfs). |
Another important reference: pairtrees -- these are commonly used in the archives space. It's the tool that many digital preservation teams reach for first when they want to preserve a lot of bits.
|
(redirected from ipfs-inactive/archives#96 (comment)) If I read the draft proposal correctly:
I also made some sample packaged files which each contains a build script and ipld-based file list for the manifest (where details like the ones in json-schema can be added, e.g. mime-type). Ref: ipfs-inactive/archives#101 (where I assume the default handler for software packages is gx). I have made sure to find the generic common ground among metadata fields and spec from dat, frictionlessdata, json-schema (have yet to fold in bagit, .torrent file, warc, pairtree -- imo it is clearest if there is a comparison matrix table among all of these standards). |
The messaging and the name IPFS mislead people into thinking IPFS is just for files. It's actually a content-addressed protocol for distributing Merkle DAGs, which you can use to represent anything. That's why IPLD is so important -- it gives us a basic structure for representing any data structure as a DAG that can be written directly to IPFS and addressed using IPLD paths.
ipfs-pack is a move in that direction. For the first pass, ipfs-pack will help us support the use case where users Use Manifest Files to Track Directory Structure & Contents, which allows us to Track a Directory and Serve it on IPFS without making duplicate local copies of the data. This will eventually allow us to Round-trip whole directories through IPFS and Mount directories by auto-detecting their ipfs-pack manifests or prebuilt object databases That gives a very strong starting point for using IPLD to properly represent sets of files, but there will certainly be more work to establish the best metadata patterns. |
The draft proposal was made after ipld's existence, but in the example, in the content of the |
@flyingzumwalt have you taken a look at the Data Package specs -- it seems the basic Data Package could act as a reasonable match for the manifest file here. |
This proposes some tooling for large datasets. Warning! As soon as I wrote it, i already want to change it. in particular, I want to change the
db
thing to just be a normal ipfs repo. it would help with the serving, too. We just need to lang making ipfs repos super fast with swappable datastores (right now we can't quite do that).Proposal posted at https://gist.github.com/jbenet/deda429fae2e5af9a86a01b0cbb614f7 and reproduced below for those getting email.
I will update it with the
db -> repo
thoughts, and update the gist and the comment below. I will comment when i update it so people get a notification, at least.IPFS Tooling for datasets
Background
We need some tooling for a certain set of use cases around archival and dataset management. This tooling is for fitting how people work with large files and large datasets.
Grounding Assumptions
Basic grounding assumptions here:
Why current IPFS tooling is not enough
The current ipfs tooling assumes we can import all data into a
.ipfs
repository directory. There are ongoing efforts to buildfilestore
to allow referencing content outside of that directory, but this is not yet finalized, and all metadata is stored in the .ipfs repository, not with the directory in question.We have often discussed Certified ARchives (
.car
) as a replacement for tar. This could be a future replacement, along with a reliable way to mount the.cars
, but this is not yet here either.Other tooling examples
.torrent
fileTools for archiving websites:
Proposed Tooling Additions
This document proposes the addition or adjustment of the following tools:
dagger/dagify
(or whatever is decided here) - a standalone tool that reads in a file or directory and outputs an (in-order) ipld graph, according to a given format string.ipfs-pack
- a standalone tool that creates an "ipfs pack" (similar to WARCs, BagIt, and .torrent files, but with IPLD and importers magic).datadex
or maybegx-dataset
- a tool to prepare and publish a dataset (as an ipfs-pack, guides user to add dataset metadata and license info, and publishes to a registry)car
(still only a proposed tool) which create certified archives (single-file hash-linked archive, like a hash-linked .tar), will work closely with ipfs-pack.ipfs repo filestore
abstractions can leverageipfs-packs
to understand what is being tracked.dagger/dagify
This tool (name discussion here) reads in a file or directory and outputs an (in-order) ipld graph, according to a given format string.
Where
<fmt-string>
is a format string that uniquely determines (for ever) the whole dag structure, including chunking scheme, index layout, what is tracked in the index, what is left as raw nodes, etc. The idea is that this string (which ideally will be short) can uniquely describe a strategy for representing the source content as the output ipld graph, and that it can repeatably do so. Meaning that once a given fmt string produces one output, it should never change (lest there is a major bug). This is because people must retain the ability to verify their content, and they need some primitive to do so.dagger/dagify --only-cid --only-root
This tool will have an
--only-cid
flag that ouputs only the cids:And an
--only-root
flag that returns only the last (root) object or cid.ipfs-pack
filesystem packing toolThe idea is that
ipfs-pack
is a filesystem packing tool, that establishes the notion of a bundle, bag, or "pack" of files. We usepack
to avoid confusing it with a Bag from BagIt, a very similar format (thatipfs-pack
is compatible with). The way "packs" work is this:<path-to-pack-root>/
) It contains all the pack contents and represents the pack in a filesystem.<pack-root>/PackManifest
)Subcommands
Usage Example
ipfs-pack make
create (or update) a pack manifestThis command creates (or updates) the pack's manifest file.
ipfs-pack verify
checks whether a pack matches its manifestThis command checks whether a pack matches its
PackManifest
.ipfs-pack db
creates (or updates) a temporary ipfs object databaseThis command creates (or updates) a temporary ipfs object database (eg at
.ipfs-pack/db
). This object database contains positonal metadata for all IPLD objects contained in the pack. (It follows the ipfs repo filestore metadata concerns). It MAY be a different, simpler object-db format, or be a full-fledged ipfs node repo using filestore.The db is a simple key-value store that supports:
{ <ipld-cid> : <filestore-descriptor> }
list() []<ipld-cid>
to show all cids in dbput(<ipld-object>) <ipld-cid>
get(<ipld-cid>) <ipld-object>
putDescriptor(<ipld-cid>, <filestore-descriptor>)
getDescriptor(<ipld-cid>) <filestore-descriptor>
delete()
to remove itself from diskNotes:
<filestore-descriptor>
is the metadata necessary to reconstruct the entire object from data in the pack.{get,put}
should be able to add or retrieve the objects from db or from the data in the pack.{get,put}Descriptor
should be able to add or retrieve file descriptors for objects stored in the pack.This database basically implements:
And does so both through a programmatic interface (some go package), or via cli tooling:
ipfs-pack serve
starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).This command starts an ipfs node serving the pack's contents (to IPFS and/or HTTP). This command MAY require a full go-ipfs installation to exist. It MAY be a standalone binary (
ipfs-pack-serve
). It MUST use an ephemeral node or a one-off node whose id would be stored locally, in the pack, at<pack-root>/.ipfs-pack/repo
ipfs-pack bag
convert to and from BagIt (spec-compliant) bags.This command converts between BagIt (spec-compliant) bags, a commonly used archiving format very similar to
ipfs-pack
. It works like this:ipfs-pack car
convert to and from a car (certified archive).This command converts between packs and cars (certified archives). It works like this:
datadex
or maybegx-dataset
WIP
a tool to prepare and publish a dataset (as an ipfs-pack, guides user to add dataset metadata and license info, and publishes to a registry)
car
- certified archivesWIP
cars would interop with packs.
The
ipfs repo filestore
WIP
Maybe the
ipfs repo filestore
abstractions can leverageipfs-packs
to understand what is being tracked in a given directory, particularly if those packs have up-to-date local dbs of all their objects.The text was updated successfully, but these errors were encountered: