Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest Format #321

Open
casey opened this issue Apr 4, 2020 · 6 comments
Open

Manifest Format #321

casey opened this issue Apr 4, 2020 · 6 comments
Labels

Comments

@casey
Copy link
Owner

casey commented Apr 4, 2020

Many features are gated on the basic design of the Intermodal manifest. So let's get started on it right away.

Desiderata

  • Allow integrity checking. Given a manifest, an accompanying release can be checked for integrity using the manifest. This will require including secure hashes of accompanying files in the manifest.

  • Hashing the manifest should give a secure hash that uniquely identifies the contents of the release.

  • Multi-level manifest. A lower-level manifest should commit to the contents of a release. A higher -level manifest should commit to both the lower level manifest, as well as any files containing signatures over the lower level manifest. It would be nice to only have one manifest, but since data can't self-sign or contain hashes to itself, it seems necessary to have at least a two-level manifest so that we can produce a hash that uniquely identifies a collection of files, as well as signatures and other commitments to that collection of files.

    I'm thinking about calling the lower-level manifest the "content manifest" and the lower-level "bundle manifest". I'm definitely open to naming suggestions though. Other ideas are "file manifest" and "root manifest".

Why not use BitTorrent metainfo?

  • BitTorrent v1 uses SHA1, which is insecure.

  • BitTorrent v2 uses a custom tree hash that is vulnerable to attack if the length of the content is not included.

  • Bencoding is not a particularly popular encoding format.

Why not use the web packaging format?

The web bundle format is a single-file format, so it would be impossible to use natively with BitTorrent, which is an important transport.

Out of Scope

To keep things simple, it would be a good idea to limit the scope of the initial manifest design as much as possible. Things that we should consider for the design, but not worry about the details:

  • Metadata. Structured metadata can be included in a file that the content manifest commits to.

  • Signatures, timestamps, and related functionality. A two-level manifest leaves open the ability to include files that are signatures over the hash of the content manifest, which are committed to by the bundle manifest.

In scope

  • Manifest format. I'm thinking either CBOR or messagepack. They are both lightweight, binary, schemaless formats with an object model that is similar to JSON. The web packaging format uses CBOR, so that's what I'm leaning towards. Keybase's saltpack, however, uses messagepack, so that's a contender too.

  • The hash function. Since manifests will have to include secure hashes, we should pick a hash function. I'm leaning towards BLAKE3, although someone could probably talk me out of it. BLAKE3 is extremely fast, supports random access and streaming verification, and has a strong rust implementation, all of which are nice features. On the downside, it is very new, and uses a reduced strength construction. However, that reduced strength is argued to not be vulnerable now or in the future.

  • Whether the manifest should be flat or a tree. Nested will be more compact when there are long directory names with many entires, but is more complex. Nested doesn't explicitly encode path separators, which I think is a bonus.

    flat: {"foo/bar": "BAR_HASH", "foo/baz": "BAZ_HASH"}
    tree: {foo: {bar: "BAR_HASH", baz: "BAZ_HASH"}}
    

    BitTorrent V2 uses a tree, so that's what I'm leaning towards.

  • Where to put the manifest in a release. Since it seems likely that we'll eventually want multiple files, I'm thinking that putting everything into a subdirectory is a good idea, either imdl/ or intermodal/.

    For example, if we go with CBOR, the structure could be:

    intermodal/root.cbor     # root manifest
    intermodal/content.cbor  # content manifest
    intermodal/signatures    # signatures over hash of content.manifest
    intermodal/timestamp.ots # open timestamps timestamp
    intermodal/metadata.cbor # metadata
    intermodal/README.txt    # human readable info about intermodal
    

Postscript

A very weird but nonetheless interesting choice of format would be FIDL, Fuchsia's IPC system:

  • Modular compiler and language bindings.
  • Message encoding is canonical — there is exactly one encoding for a given message.
  • Supports both fixed-size messages where space is important, and extensible tables and unions where schema evolution is important.
  • Defined to be little endian and uses natural alignment, so no portability issues.
  • Zero copy and zero parse, by storing variable sized members out-of-line. Can be memory mapped with LMDB or mmap(2) for extremely fast access.
  • Certain types can be introspected without access to a schema, namely tables and unions.
  • Could make it fully introspectable where needed by including hash of schema as first element of messages.

Another less wacky choice would be flatbuffers. Flatbuffers also support zero copy and zero parse deserialization.

Misc

  • Would like to compare the encoding choices of flatbuffer and FIDL.
  • Would like to add some kind of link attribute, that would indirect through a hash.
  • Unsure of how to approach "canonical encoding". Forcing the format to output a buffer in a canonical format is inflexible. A more flexible approach would be to calculate hashes over logical traversals of the physical data, so the physical data may be in multiple forms, but hash to the same value, as long as the traversal doesn't change.
  • Flatbuffers binary format documentation
  • fleetfs flatbuffers schema
  • proc macro attribute parser
  • syn helper
@casey casey changed the title Basic Manifest Design Manifest Format Apr 4, 2020
@casey casey added the design label Apr 4, 2020
@casey casey added this to the eventually milestone Apr 4, 2020
@nijynot
Copy link

nijynot commented Apr 4, 2020

Looks pretty good to me! Here are some of my initial thoughts.

With manifest format, CBOR seems to have many options which you can choose from while MessagePack is more straightforward. As they seem achieve the same thing at the end of the day, maybe MessagePack would be a better choice as it's more simple? For what it's worth, MessagePack's implementations seems to be a little more up-to-date compared to CBOR's from a quick glance. And given CBOR's amount of options, I don't trust that all implementation have same amount of features implemented.

On the hash function, I think BLAKE3 would def make a lot more sense if it's security is good enough. Haven't read the paper, but given BLAKE3's speed, it'd be a great fit for this manifest as you'd probably hash a lot of data.

BitTorrent V2 uses a tree, so that's what I'm leaning towards.

Yeah, tree is probably the way to go.

@casey
Copy link
Owner Author

casey commented Apr 4, 2020

Thanks for the feedback!

I definitely agree that going with something simpler is better. I should look at CBOR and see if any of the bells and whistles it provides make any sense for intermodal.

@casey casey modified the milestone: eventually Apr 11, 2020
@alethiophile
Copy link

Is there a particular reason for preferring binary formats over text?

@casey
Copy link
Owner Author

casey commented Apr 14, 2020

I think a few reasons:

  • The files are likely to contain a lot of 256 bit binary hashes, which will be more compact in a binary format vs hex.

  • I'd like to eventually be able to embed binary files directly into manifests, and if those are large, hex/base64 encoding will mean a lot of overhead.

  • Some binary formats, like FIDL and flatbuffers, can be memory mapped directly, and then accessed with very low overhead. This is very hard to do well with text-based formats. I'd like to leave this option open for downstream applications, since some applications might need to deal with huge amounts of manifests, and this would speed those applications up enormously.

  • Eventually, intermodal might develop one or more protocols, for example a mainline DHT/tracker/data exchange protocol alternative. If the binary format used in the manifest can be used for the protocol, that would be a great bonus.

Text formats definitely have the benefit of being human readable, and losing that is unfortunate. To make up for that, one of things we could do is auto-generate a human readable readme or .nfo file, that would contain all the information that would be useful for a human, and put that in the root of intermodal-created releases.

@alethiophile
Copy link

I think for the use case of a content manifest, which is designed as basically just a bunch of file hashes, binary is probably reasonable -- the space savings are nontrivial, and the human-readable information in this file is relatively limited anyway. This should probably also be as stupid a format as possible; it strikes me that something as simple as the text output from sha256sum would be functional here, and anything more complicated would need good returns on the extra complexity.

What use case do you envision for data files embedded in manifests? I'm somewhat confused as to what the benefit would be there.

For the other file components, the calculus is potentially different. I think there's a better case for metadata being text-format, maybe TOML with a specified schema or similar, since it's information that's fundamentally designed for humans (rather than crypto algorithms) to consume.

@casey
Copy link
Owner Author

casey commented Apr 15, 2020

I think for the use case of a content manifest, which is designed as basically just a bunch of file hashes, binary is probably reasonable -- the space savings are nontrivial, and the human-readable information in this file is relatively limited anyway. This should probably also be as stupid a format as possible; it strikes me that something as simple as the text output from sha256sum would be functional here, and anything more complicated would need good returns on the extra complexity.

I'd like to have recursive maps and lists, otherwise extensions will be hard. I'd like to keep things simple, but still leave the door open for future extensions, and a flat list of hashes wouldn't leave room for extensions.

What use case do you envision for data files embedded in manifests? I'm somewhat confused as to what the benefit would be there.

Digital signatures are one example, another is the inner nodes of a merkle tree, which would allow fast, secure random access into large data files.

For the other file components, the calculus is potentially different. I think there's a better case for metadata being text-format, maybe TOML with a specified schema or similar, since it's information that's fundamentally designed for humans (rather than crypto algorithms) to consume.

I think the metadata manifest will be primarily produced and consumed by computer programs. For example, a program might build an index over a bunch of manifests, and then a human could search it for individual files.

But, that doesn't preclude also generating a .nfo file or index.html in the root which contains a human readable copy of all the metadata in the manifest.

@casey casey mentioned this issue May 29, 2020
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants