Manifest Format #321

casey · 2020-04-04T12:30:19Z

Many features are gated on the basic design of the Intermodal manifest. So let's get started on it right away.

Desiderata

Allow integrity checking. Given a manifest, an accompanying release can be checked for integrity using the manifest. This will require including secure hashes of accompanying files in the manifest.
Hashing the manifest should give a secure hash that uniquely identifies the contents of the release.
Multi-level manifest. A lower-level manifest should commit to the contents of a release. A higher -level manifest should commit to both the lower level manifest, as well as any files containing signatures over the lower level manifest. It would be nice to only have one manifest, but since data can't self-sign or contain hashes to itself, it seems necessary to have at least a two-level manifest so that we can produce a hash that uniquely identifies a collection of files, as well as signatures and other commitments to that collection of files.

I'm thinking about calling the lower-level manifest the "content manifest" and the lower-level "bundle manifest". I'm definitely open to naming suggestions though. Other ideas are "file manifest" and "root manifest".

Why not use BitTorrent metainfo?

BitTorrent v1 uses SHA1, which is insecure.
BitTorrent v2 uses a custom tree hash that is vulnerable to attack if the length of the content is not included.
Bencoding is not a particularly popular encoding format.

Why not use the web packaging format?

The web bundle format is a single-file format, so it would be impossible to use natively with BitTorrent, which is an important transport.

Out of Scope

To keep things simple, it would be a good idea to limit the scope of the initial manifest design as much as possible. Things that we should consider for the design, but not worry about the details:

Metadata. Structured metadata can be included in a file that the content manifest commits to.
Signatures, timestamps, and related functionality. A two-level manifest leaves open the ability to include files that are signatures over the hash of the content manifest, which are committed to by the bundle manifest.

In scope

Manifest format. I'm thinking either CBOR or messagepack. They are both lightweight, binary, schemaless formats with an object model that is similar to JSON. The web packaging format uses CBOR, so that's what I'm leaning towards. Keybase's saltpack, however, uses messagepack, so that's a contender too.
The hash function. Since manifests will have to include secure hashes, we should pick a hash function. I'm leaning towards BLAKE3, although someone could probably talk me out of it. BLAKE3 is extremely fast, supports random access and streaming verification, and has a strong rust implementation, all of which are nice features. On the downside, it is very new, and uses a reduced strength construction. However, that reduced strength is argued to not be vulnerable now or in the future.
Whether the manifest should be flat or a tree. Nested will be more compact when there are long directory names with many entires, but is more complex. Nested doesn't explicitly encode path separators, which I think is a bonus.
```
flat: {"foo/bar": "BAR_HASH", "foo/baz": "BAZ_HASH"}
tree: {foo: {bar: "BAR_HASH", baz: "BAZ_HASH"}}
```
BitTorrent V2 uses a tree, so that's what I'm leaning towards.

Where to put the manifest in a release. Since it seems likely that we'll eventually want multiple files, I'm thinking that putting everything into a subdirectory is a good idea, either imdl/ or intermodal/.

For example, if we go with CBOR, the structure could be:

intermodal/root.cbor     # root manifest
intermodal/content.cbor  # content manifest
intermodal/signatures    # signatures over hash of content.manifest
intermodal/timestamp.ots # open timestamps timestamp
intermodal/metadata.cbor # metadata
intermodal/README.txt    # human readable info about intermodal

Postscript

A very weird but nonetheless interesting choice of format would be FIDL, Fuchsia's IPC system:

Modular compiler and language bindings.
Message encoding is canonical — there is exactly one encoding for a given message.
Supports both fixed-size messages where space is important, and extensible tables and unions where schema evolution is important.
Defined to be little endian and uses natural alignment, so no portability issues.
Zero copy and zero parse, by storing variable sized members out-of-line. Can be memory mapped with LMDB or mmap(2) for extremely fast access.
Certain types can be introspected without access to a schema, namely tables and unions.
Could make it fully introspectable where needed by including hash of schema as first element of messages.

Another less wacky choice would be flatbuffers. Flatbuffers also support zero copy and zero parse deserialization.

Misc

Would like to compare the encoding choices of flatbuffer and FIDL.
Would like to add some kind of link attribute, that would indirect through a hash.
Unsure of how to approach "canonical encoding". Forcing the format to output a buffer in a canonical format is inflexible. A more flexible approach would be to calculate hashes over logical traversals of the physical data, so the physical data may be in multiple forms, but hash to the same value, as long as the traversal doesn't change.
Flatbuffers binary format documentation
fleetfs flatbuffers schema
proc macro attribute parser
syn helper

The text was updated successfully, but these errors were encountered:

nijynot · 2020-04-04T14:49:22Z

Looks pretty good to me! Here are some of my initial thoughts.

With manifest format, CBOR seems to have many options which you can choose from while MessagePack is more straightforward. As they seem achieve the same thing at the end of the day, maybe MessagePack would be a better choice as it's more simple? For what it's worth, MessagePack's implementations seems to be a little more up-to-date compared to CBOR's from a quick glance. And given CBOR's amount of options, I don't trust that all implementation have same amount of features implemented.

On the hash function, I think BLAKE3 would def make a lot more sense if it's security is good enough. Haven't read the paper, but given BLAKE3's speed, it'd be a great fit for this manifest as you'd probably hash a lot of data.

BitTorrent V2 uses a tree, so that's what I'm leaning towards.

Yeah, tree is probably the way to go.

casey · 2020-04-04T14:56:17Z

Thanks for the feedback!

I definitely agree that going with something simpler is better. I should look at CBOR and see if any of the bells and whistles it provides make any sense for intermodal.

alethiophile · 2020-04-13T16:50:34Z

Is there a particular reason for preferring binary formats over text?

casey · 2020-04-14T00:02:13Z

I think a few reasons:

The files are likely to contain a lot of 256 bit binary hashes, which will be more compact in a binary format vs hex.
I'd like to eventually be able to embed binary files directly into manifests, and if those are large, hex/base64 encoding will mean a lot of overhead.
Some binary formats, like FIDL and flatbuffers, can be memory mapped directly, and then accessed with very low overhead. This is very hard to do well with text-based formats. I'd like to leave this option open for downstream applications, since some applications might need to deal with huge amounts of manifests, and this would speed those applications up enormously.
Eventually, intermodal might develop one or more protocols, for example a mainline DHT/tracker/data exchange protocol alternative. If the binary format used in the manifest can be used for the protocol, that would be a great bonus.

Text formats definitely have the benefit of being human readable, and losing that is unfortunate. To make up for that, one of things we could do is auto-generate a human readable readme or .nfo file, that would contain all the information that would be useful for a human, and put that in the root of intermodal-created releases.

alethiophile · 2020-04-14T00:13:59Z

I think for the use case of a content manifest, which is designed as basically just a bunch of file hashes, binary is probably reasonable -- the space savings are nontrivial, and the human-readable information in this file is relatively limited anyway. This should probably also be as stupid a format as possible; it strikes me that something as simple as the text output from sha256sum would be functional here, and anything more complicated would need good returns on the extra complexity.

What use case do you envision for data files embedded in manifests? I'm somewhat confused as to what the benefit would be there.

For the other file components, the calculus is potentially different. I think there's a better case for metadata being text-format, maybe TOML with a specified schema or similar, since it's information that's fundamentally designed for humans (rather than crypto algorithms) to consume.

casey · 2020-04-15T00:27:57Z

I think for the use case of a content manifest, which is designed as basically just a bunch of file hashes, binary is probably reasonable -- the space savings are nontrivial, and the human-readable information in this file is relatively limited anyway. This should probably also be as stupid a format as possible; it strikes me that something as simple as the text output from sha256sum would be functional here, and anything more complicated would need good returns on the extra complexity.

I'd like to have recursive maps and lists, otherwise extensions will be hard. I'd like to keep things simple, but still leave the door open for future extensions, and a flat list of hashes wouldn't leave room for extensions.

What use case do you envision for data files embedded in manifests? I'm somewhat confused as to what the benefit would be there.

Digital signatures are one example, another is the inner nodes of a merkle tree, which would allow fast, secure random access into large data files.

For the other file components, the calculus is potentially different. I think there's a better case for metadata being text-format, maybe TOML with a specified schema or similar, since it's information that's fundamentally designed for humans (rather than crypto algorithms) to consume.

I think the metadata manifest will be primarily produced and consumed by computer programs. For example, a program might build an index over a bunch of manifests, and then a human could search it for individual files.

But, that doesn't preclude also generating a .nfo file or index.html in the root which contains a human readable copy of all the metadata in the manifest.

casey changed the title ~~Basic Manifest Design~~ Manifest Format Apr 4, 2020

casey added the design label Apr 4, 2020

casey added this to the eventually milestone Apr 4, 2020

This was referenced Apr 4, 2020

Timestamping #323

Open

Integrity checking #324

Open

Release signing and verification #325

Open

metadata standardization #327

Closed

casey modified the milestone: eventually Apr 11, 2020

casey mentioned this issue May 29, 2020

Release Feature Priorities #450

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manifest Format #321

Manifest Format #321

casey commented Apr 4, 2020 •

edited

Loading

nijynot commented Apr 4, 2020 •

edited

Loading

casey commented Apr 4, 2020

alethiophile commented Apr 13, 2020

casey commented Apr 14, 2020

alethiophile commented Apr 14, 2020

casey commented Apr 15, 2020

Manifest Format #321

Manifest Format #321

Comments

casey commented Apr 4, 2020 • edited Loading

Desiderata

Why not use BitTorrent metainfo?

Why not use the web packaging format?

Out of Scope

In scope

Postscript

Misc

nijynot commented Apr 4, 2020 • edited Loading

casey commented Apr 4, 2020

alethiophile commented Apr 13, 2020

casey commented Apr 14, 2020

alethiophile commented Apr 14, 2020

casey commented Apr 15, 2020

casey commented Apr 4, 2020 •

edited

Loading

nijynot commented Apr 4, 2020 •

edited

Loading