This spec has been deprecated in favor of IPLD. It offers a clearer description of how to link different kinds of hash-based structures (e.g. linking a file in IPFS to a commit in Git), has a more generalized and flexible format, and uses a JSON-compatible representation, among other improvements.
Authors(s):
Abstract
The ipfs merkledag is a directed acyclic graph whose edges are merkle-links. This means that links to objects can authenticate the objects themselves, and that every object contains a secure representation of its children.
This is a powerful primitive for distributed systems computations. The merkledag simplifies distributed protocols by providing an append-only authenticated data structure. Parties can communicate and exchange secure references (merkle-links) to objects. The references are enough to verify the correctness of the object at a later time, which allows the objects themselves to be served over untrusted channels. Merkledags also allow the branching of a data structure and subsequent merging, as in the version control system git. More generally, merkledags simplify the construction of Secure CRDTs, which enable distributed, convergent, commutative computation in an authenticated, secure way.
TODO
hash
- throughout this document, the wordhash
refers specifically to cryptographic hash functions, such as sha3.dag
- directed acyclic graphmerkle-link
- a link (graph edge) between two objects, which is (a) represented by the hash of the target object, and (b) embedded in the source object. merkle-links construct graphs (dags) whose links are content-addressed, and authenticated.merkledag
- the merkledag is a directed acyclic graph whose links are merkle-links (hashes of the content). It is a hash tree, and (under a very loose definition) a Merkle tree. Alternative names: the merkle-web, the merkle-forest, the merkle-chain.multihash
- the multihash format / protocol.ipfs object
- a node in the ipfs merkledag. It represents a singular entity.merkledag format
- the format that ipfs objects are expressed with.link segment
orlink table
- the part of the merkledag format that expresses links to other objects.data segment
- the part of the merkledag format that expresses non-link object data.protobuf
- protocol buffers, a serialization encoding.multicodec
- a self-describing, generalized serialization format.
The IPFS merkledag format is very simple. It serves as a thin waist for more complex applications and data structure transports. Therefore, it aims to be as simple and small as possible.
The format has two parts, the logical format, and the serialized format.
The merkledag format defines two parts, Nodes
and Links
between nodes. Nodes
embed Links
in their Link Segment
(or link table).
A node is divided in two parts:
- a
Link Segment
which contains all the links. - a
Data Segment
which contains the object data.
Instead of following previous approaches to merkledags, which place data mostly at the edges, the IPFS merkledag adapts the format of the HTTP web: every path endpoint is an object with both links and data. (this is fundamentally different from UNIX files, in which objects have either links (directories) or data (files).).
The logical format -- in protobuf -- looks like this:
// An IPFS MerkleDAG Link
message MerkleLink {
bytes Hash = 1; // multihash of the target object
string Name = 2; // utf string name
uint64 Tsize = 3; // cumulative size of target object
// user extensions start at 50
}
// An IPFS MerkleDAG Node
message MerkleNode {
repeated MerkleLink Links = 2; // refs to other objects
bytes Data = 1; // opaque user data
// user extensions start at 50
}
(TODO remove this? use only protobuf?)
The logical representation is serialized into raw bytes using multicodec
, a self-describing format that abstracts between serialization frameworks. That way, we can use the ipfs merkledag with various marshaling and serialization formats.
The logical representation is serialized into raw bytes using protocol buffers, a serialization format.
Many successful distributed systems employ specialized merkledags at their core:
- Merkle trees -- a special case of the merkledag -- are a well known cryptographic technique used to authenticate large amounts of data. The original use case involved one-time lamport signatures.
- SFS-RO turns a unix filesystem into a merkledag, constructing a secure, distributed filesystem.
- git uses a merkledag to enable distributed version control and source code management. Other DVCSes, such as mercurial and monotone, also feature a merkledag.
- plan9 uses a merkledag to construct its snapshotting filesystems -- Fossil and Venti.
- bittorrent uses a merkledag to provide secure and short infohash links to its downloadable torrents.
- Google Wave -- a distributed communications platform -- used a merkledag to construct its commutative operational transforms, and enable convergent distributed collaboration. This functionality has since been folded into Google Docs.
- bitcoin uses a merkledag to construct the blockchain, a shared append-only ledger with convergent distributed consensus.
- Tahoe-LAFS uses a merkledag to construct a secure, distributed, capability filesystem based on the least-authority principle.
(NOTE: please suggest other systems to reference here.)
At its core, IPFS provides the merkledag as a primitive (or "internet layer") to build sophisticated applications easily. It is a "thin-waist" for secure, distributed applications, which -- by agreeing to follow the common format -- can then run across any replication, routing, and transport protocols. To draw an analogy, this is like the "thin-waist" IP provided to connect hosts across medium-specific networks.
This kind of modularity enables complicated and powerful applications to be built with little effort on top of a common base. All the complexity of authentication, distribution, replication, routing, and transport can be pulled in from other protocols and tools. This type of modularity is what made the layered internet -- the TCP/IP stack -- so tremendously powerful.
In a sense, IPFS is a "web of data-structures", with the merkledag as the common denominator. Agreeing upon a format allows linking different authenticated data structures to each other, enabling sophisticated distributed applications to easily construct, distribute, and link their data.
The merkledag is a type of Linked-Data. The links do not follow the standard URI format, and instead opt for a more general and flexible UNIX filesystem path format, but the power is all there. One can trivially map formats like JSON-LD directly onto IPFS (IPFS-LD), making IPFS applications capable of using the full-power of the semantic web.
A powerful result of content (and identity) addressing is that linked data definitions can be distributed directly with the content itself, and do not need to be served from the original location. This enables the creation of Linked Data definitions, specs, and applications which can operate faster (no need to fetch it over the network), disconnected, or even completely offline.
To facilitate the definition of other data structures and protocols, we define a notation to express merkledag data structures. This defines their logical representation, and also a format specification (when using the ipfs merkledag format).
tree node {
links {
}
data {
}
}
commit node {
"parent" repeated link; // links to the parent commit
"author" link; // link to the author of commit
""
}