-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADR-040: Storage and SMT State Commitments #8430
Changes from 20 commits
11728cf
662ec91
5fdbe5d
fa8e9e3
864927e
78215b2
6dd0323
250b5ff
374916f
e90bf8a
8602b3e
ca39df5
aedce21
f704279
06d1952
7537c84
1cc123e
d321dac
80d0122
962a28b
bb89798
19d2126
356f987
42e7f08
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,173 @@ | ||||||||||
# ADR 040: Storage and SMT State Commitments | ||||||||||
|
||||||||||
## Changelog | ||||||||||
|
||||||||||
- 2020-01-15: Draft | ||||||||||
|
||||||||||
## Status | ||||||||||
|
||||||||||
DRAFT Not Implemented | ||||||||||
|
||||||||||
|
||||||||||
## Abstract | ||||||||||
|
||||||||||
Sparse Merke Tree ([SMT](https://osf.io/8mcnh/)) is a version of a Merkle Tree with various storage and performance optimizations. This ADR defines a separation of state commitments from data storage and the SDK transition from IAVL to SMT. | ||||||||||
|
||||||||||
|
||||||||||
## Context | ||||||||||
|
||||||||||
Currently, Cosmos SDK uses IAVL for both state [commitments](https://cryptography.fandom.com/wiki/Commitment_scheme) and data storage. | ||||||||||
|
||||||||||
IAVL has effectively become an orphaned project within the Cosmos ecosystem and it's proven to be an inefficient state commitment data structure. | ||||||||||
In the current design, IAVL is used for both data storage and as a Merkle Tree for state commitments. IAVL is meant to be a standalone Merkelized key/value database, however it's using a KV DB engine to store all tree nodes. So, each node is stored in a separate record in the KV DB. This causes many inefficiencies and problems: | ||||||||||
|
||||||||||
+ Each object query requires a tree traversal from the root. Subsequent queries for the same object are cached on the SDK level. | ||||||||||
+ Each edge traversal requires a DB query. | ||||||||||
+ Creating snapshots is [expensive](https://github.com/cosmos/cosmos-sdk/issues/7215#issuecomment-684804950). It takes about 30 seconds to export less than 100 MB of state (as of March 2020). | ||||||||||
+ Updates in IAVL may trigger tree reorganization and possible O(log(n)) hashes re-computation, which can become a CPU bottleneck. | ||||||||||
+ The node structure is pretty expensive - it contains a standard tree node elements (key, value, left and right element) and additional metadata such as height, version (which is not required by the SDK). The entire node is hashed, and that hash is used as the key in the underlying database, [ref](https://github.com/cosmos/iavl/blob/master/docs/node/node.md | ||||||||||
). | ||||||||||
|
||||||||||
Moreover, the IAVL project lacks support and a maintainer and we already see better and well-established alternatives. Instead of optimizing the IAVL, we are looking into other solutions for both storage and state commitments. | ||||||||||
|
||||||||||
|
||||||||||
## Decision | ||||||||||
|
||||||||||
We propose to separate the concerns of state commitment (**SC**), needed for consensus, and state storage (**SS**), needed for state machine. Finally we replace IAVL with [LazyLedgers' SMT](https://github.com/lazyledger/smt). LazyLedger SMT is based on Diem (called jellyfish) design [*] - it uses a compute-optimised SMT by replacing subtrees with only default values with a single node (same approach is used by Ethereum2) and implements compact proofs. | ||||||||||
|
||||||||||
The storage model presented here doesn't deal with data structure nor serialization. It's a Key-Value database, where both key and value are binaries. The storage user is responsible for data serialization. | ||||||||||
|
||||||||||
### Decouple state commitment from storage | ||||||||||
|
||||||||||
|
||||||||||
Separation of storage and commitment (by the SMT) will allow the optimization of different components according to their usage and access patterns. | ||||||||||
|
||||||||||
`SS` (SMT) is used to commit to a data and compute merkle proofs. `SC` is used to directly access data. To avoid collisions, both `SS` and `SC` will use a separate storage namespace (they could use the same database underneath). `SC` will store each `(key, value)` pair directly (map key -> value). | ||||||||||
|
||||||||||
SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `hash(key, value)` in a leaf. Since we don't know a structure of a value (in particular if it contains the key) we hash both the key and the value in the `SC` leaf. | ||||||||||
|
||||||||||
For data access we propose 2 additional KV buckets (namespaces for the key-value pairs, sometimes called [column family](https://github.com/facebook/rocksdb/wiki/Terminology)): | ||||||||||
1. B1: `key → value`: the principal object storage, used by a state machine, behind the SDK `KVStore` interface: provides direct access by key and allows prefix iteration (KV DB backend must support it). | ||||||||||
2. B2: `hash(key, value) → key`: an index needed to extract a value (through: SMT → B2 → B1) having only a Merkle Path. Recall that SMT will store `hash(key, value)` in it's leafs. | ||||||||||
3. we could use more buckets to optimize the app usage if needed. | ||||||||||
|
||||||||||
Above, we propose to use a KV DB. However, for the state machine, we could use an RDBMS, which we discuss below. | ||||||||||
|
||||||||||
|
||||||||||
### Requirements | ||||||||||
|
||||||||||
State Storage requirements: | ||||||||||
+ range queries | ||||||||||
+ quick (key, value) access | ||||||||||
+ creating a snapshot | ||||||||||
robert-zaremba marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
+ prunning (garbage collection) | ||||||||||
robert-zaremba marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
State Commitment requirements: | ||||||||||
+ fast updates | ||||||||||
+ tree path should be short | ||||||||||
+ creating a snapshot | ||||||||||
robert-zaremba marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
+ pruning (garbage collection) | ||||||||||
|
||||||||||
|
||||||||||
### LazyLedger SMT for State Commitment | ||||||||||
|
||||||||||
A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intractable size. The assumption here is that as the size of the tree is intractable, there would only be a few leaf nodes with valid data blocks relative to the tree size, rendering the tree as sparse. | ||||||||||
|
||||||||||
|
||||||||||
### Snapshots for storage sync and versioning | ||||||||||
|
||||||||||
One of the Stargate core features are snapshots and state sync delivered in the `/snapshot` package. This feature is implemented in SDK and requires storage support. Currently IAVL is the only supported backend. | ||||||||||
|
||||||||||
Database snapshot is a view of DB state at a certain time or transaction. It's not a full copy of a database (it would be too big), usually a snapshot mechanism is based on a _copy on write_ and it allows to efficiently deliver DB state at a certain stage. | ||||||||||
Some DB engines support snapshotting. Hence, we propose to reuse that functionality for the state sync and versioning (described below). It will the supported DB engines to ones which efficiently implement snapshots. In a final section we will discuss evaluated DBs. | ||||||||||
Comment on lines
+81
to
+82
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm defining database snapshot here, so I prefer to use snapshot mechanism here, so I prefer to keep the original language. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay I may have misunderstood. This is what some DBs call snapshots, and distinct from state sync snapshots as used in the ABCI, right? (although it can be used to implement ABCI snapshots) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, here we are talking about a database engine mechanism. |
||||||||||
|
||||||||||
New snapshot will be created in every `EndBlocker`. The `rootmulti.Store` keeps track of the version number and implements the `MultiStore` interface. `MultiStore` encapsulates a `Commiter` interface, which has the `Commit`, `SetPruning`, `GetPruning` functions which will be used for creating and removing snapshots. The `Store.Commit` function increments the version on each call, and checks if it needs to remove old versions. We will need to update the SMT interface to implement the `Commiter` interface. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is snapshot creation part of the state-machine process? Also, if you just take a direct DB snapshot, how do you perform verification? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because the App has a knowledge when to create a snapshot. Storage doesn't have that knowledge. We could assume that it can create a snapshot on each commit, but it will make the design more constrained, and the library less robust. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about verification and the time it takes to create a snapshot? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's very efficient - the DB is using copy-on-write. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For clarification, copy-on-write is used to maintain historical versions, but the state sync snapshot still involves copying the entire state store at the time of creation (at least, that is how it's currently implemented).
robert-zaremba marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
NOTE: `Commit` must be called exactly once per block. Otherwise we risk going out of sync for the version number and block height. | ||||||||||
NOTE: For the SDK storage, we may consider splitting that interface into `Committer` and `PrunningCommiter` - only the multiroot should implement `PrunningCommiter` (cache and prefix store don't need pruning). | ||||||||||
robert-zaremba marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
Number of historical versions (snapshots) for `abci.Query` and fast sync is part of a node configuration, not a chain configuration (configuration implied by the blockchain consensus). A configuration should allow to specify number of past blocks and number of past blocks modulo some number (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). Archival nodes can keep all snapshots. | ||||||||||
robert-zaremba marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
Pruning old snapshots is effectively done by a database. Whenever we update a record in `SC`, SMT won't update nodes - instead it create new nodes on the update path, without removing the old one. Since we are snapshoting each block, we need to update that mechanism to immediately remove orphaned nodes from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions. | ||||||||||
|
||||||||||
To manage the active snapshots we will either us a DB _max number of snapshots_ option (if available), or will remove snapshots in the `EndBlocker`. The latter option can be done efficiently by identifying snapshots with block height. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems a bit confusing to me. Pruning of Snapshots and pruning of application states, currently, are two separate configurable parameters. Are we merging these two? If so can it worded this way. What is the impact to disk size with this design? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do you define application state pruning? For me, it is removing not needed records by a module (eg removing zero balances). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am talking about how we currently prune application states or versions. You are talking about pruning versions or snapshots which are used for versions. This is application state pruning. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I understand your concern. ADR-40 is not about pruning application state. Old SS (state storage) versions (a version of the whole state) are covered by snapshots. If we want to remove an old version we remove a snapshot.
robert-zaremba marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
#### Accessing old state versions | ||||||||||
|
||||||||||
One of the functional requirements is to access old state. This is done through `abci.Query` structure. The version is specified by a block height (so we query for an object by a key `K` at block height `H`). The number of old versions supported for `abci.Query` is configurable. Accessing an old state is done by using available snapshots. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't specific to abci.Query. Might make more sense to reword in the sense of querying. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why it's not specific for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Users want to query old state as well.. Many dont want to go through abci.Query. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How they do it now? Are you talking about a new feature? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here - I prefer to be consistent and use snapshot. |
||||||||||
`abci.Query` doesn't need old state of `SC`. So, for efficiency, we should keep `SC` and `SS` in different databases (however using the same DB engine). | ||||||||||
|
||||||||||
Moreover, SDK could provide a way to directly access the state. However, a state machines shouldn't do that - since the number of snapshots is configurable, it would lead to nondeterministic execution. | ||||||||||
|
||||||||||
We positively [validated](https://github.com/cosmos/cosmos-sdk/discussions/8297) a snapshot mechanism for querying old state with regards to the database we evaluated. | ||||||||||
robert-zaremba marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
### State Proofs | ||||||||||
|
||||||||||
For any object stored in State Store (SS), we have corresponding object in `SC`. A proof for object `V` identified by a key `K` is a branch of `SC`, where the path corresponds to the key `hash(K)`, and the leaf is `hash(K, V)`. | ||||||||||
|
||||||||||
### Rollbacks | ||||||||||
|
||||||||||
We need to be able to process transactions and roll-back state updates if a transaction fails. This can be done in the following way: during transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Once we finish the block processing, in the `Endblocker`, we commit a root store - at that time, all changes are written to the SMT and to the `SS` and a snapshot is created. | ||||||||||
|
||||||||||
|
||||||||||
### Committing to an object without saving it | ||||||||||
|
||||||||||
We identified use-cases, where modules will need to save an object commitment without storing an object itself. Sometimes clients are receiving complex objects, and they have no way to prove a correctness of that object without knowing the storage layout. For those use cases it would be easier to commit to the object without storing it directly. | ||||||||||
|
||||||||||
|
||||||||||
|
||||||||||
## Consequences | ||||||||||
|
||||||||||
|
||||||||||
### Backwards Compatibility | ||||||||||
|
||||||||||
This ADR doesn't introduce any SDK level API changes. | ||||||||||
|
||||||||||
We change the storage layout of the state machine, a storage hard fork and network upgrade is required to incorporate these changes. SMT provides a merkle proof functionality, however it is not compatible with ICS23. Updating the proofs for ICS23 compatibility is required. | ||||||||||
|
||||||||||
### Positive | ||||||||||
|
||||||||||
+ Decoupling state from state commitment introduce better engineering opportunities for further optimizations and better storage patterns. | ||||||||||
+ Performance improvements. | ||||||||||
+ Joining SMT based camp which has wider and proven adoption than IAVL. Example projects which decided on SMT: Ethereum2, Diem (Libra), Trillan, Tezos, LazyLedger. | ||||||||||
|
||||||||||
### Negative | ||||||||||
|
||||||||||
+ Storage migration | ||||||||||
+ LL SMT doesn't support pruning - we will need to add and test that functionality. | ||||||||||
|
||||||||||
### Neutral | ||||||||||
|
||||||||||
+ Deprecating IAVL, which is one of the core proposals of Cosmos Whitepaper. | ||||||||||
tac0turtle marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
|
||||||||||
## Alternative designs. | ||||||||||
|
||||||||||
Most of the alternative designs were evaluated in [state commitments and storage report](https://paper.dropbox.com/published/State-commitments-and-storage-review--BDvA1MLwRtOx55KRihJ5xxLbBw-KeEB7eOd11pNrZvVtqUgL3h). | ||||||||||
|
||||||||||
Ethereum research published [Verkle Tire](https://notes.ethereum.org/_N1mutVERDKtqGIEYc-Flw#fnref1) - an idea of combining polynomial commitments with merkle tree in order to reduce the tree height. This concept has a very good potential, but we think it's too early to implement it. The current, SMT based design could be easily updated to the Verkle Tire once other research implement all necessary libraries. The main advantage of the design described in this ADR is the separation of state commitments from the data storage and designing a more powerful interface. | ||||||||||
|
||||||||||
|
||||||||||
## Further Discussions | ||||||||||
|
||||||||||
### Evaluated KV Databases | ||||||||||
|
||||||||||
We verified existing databases KV databases for evaluating snapshot support. The following databases provide efficient snapshot mechanism: Badger, RocksDB, [Pebble](https://github.com/cockroachdb/pebble). Databases which don't provide such support or are not production ready: boltdb, leveldb, goleveldb, membdb, lmdb. | ||||||||||
robert-zaremba marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
### RDBMS | ||||||||||
|
||||||||||
Use of RDBMS instead of simple KV store for state. Use of RDBMS will require an SDK API breaking change (`KVStore` interface), will allow better data extraction and indexing solutions. Instead of saving an object as a single blob of bytes, we could save it as record in a table in the state storage layer, and as a `hash(key, protobuf(object))` in the SMT as outlined above. To verify that an object registered in RDBMS is same as the one committed to SMT, one will need to load it from RDBMS, marshal using protobuf, hash and do SMT search. | ||||||||||
|
||||||||||
### Off Chain Store | ||||||||||
|
||||||||||
We were discussing use case where modules can use a support database, which is not automatically committed. Module will responsible for having a sound storage model and can optionally use the feature discussed in __Committing to an object without saving it_ section. | ||||||||||
|
||||||||||
|
||||||||||
## References | ||||||||||
robert-zaremba marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
+ [IAVL What's Next?](https://github.com/cosmos/cosmos-sdk/issues/7100) | ||||||||||
+ [IAVL overview](https://docs.google.com/document/d/16Z_hW2rSAmoyMENO-RlAhQjAG3mSNKsQueMnKpmcBv0/edit#heading=h.yd2th7x3o1iv) of it's state v0.15 | ||||||||||
+ [State commitments and storage report](https://paper.dropbox.com/published/State-commitments-and-storage-review--BDvA1MLwRtOx55KRihJ5xxLbBw-KeEB7eOd11pNrZvVtqUgL3h) | ||||||||||
+ [LazyLedger SMT](https://github.com/lazyledger/smt) | ||||||||||
+ Facebook Diem (Libra) SMT [design](https://developers.diem.com/papers/jellyfish-merkle-tree/2021-01-14.pdf) | ||||||||||
+ [Trillian Revocation Transparency](https://github.com/google/trillian/blob/master/docs/papers/RevocationTransparency.pdf), [Trillian Verifiable Data Structures](https://github.com/google/trillian/blob/master/docs/papers/VerifiableDataStructures.pdf). | ||||||||||
+ Design and implementation [discussion](https://github.com/cosmos/cosmos-sdk/discussions/8297). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't get a data using SMT data. SMT only stores hashes.
So, if you read a value from SMT, and you want to get a data out, you need to recover the key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So sort of like an inverted index then. Can you rewrite this sentence like you just explained to make it clearer please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated