From a651a7d18eb9aa915c7042bdf9a328ce47a3d62a Mon Sep 17 00:00:00 2001 From: Min Zhang Date: Thu, 29 Sep 2022 16:27:57 -0400 Subject: [PATCH 01/24] add flat storage nep draft --- neps/nep-9999.md | 246 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 246 insertions(+) create mode 100644 neps/nep-9999.md diff --git a/neps/nep-9999.md b/neps/nep-9999.md new file mode 100644 index 000000000..0166306fc --- /dev/null +++ b/neps/nep-9999.md @@ -0,0 +1,246 @@ +--- +NEP: 0 +Title: Flat Storage +Author: Min Zhang Aleksandr Logunov +DiscussionsTo: https://github.com/nearprotocol/neps/pull/0000 +Status: Draft +Type: Protocol Track +Category: Chain +Created: 07-Sep-2022 +--- + +## Summary + +Currently, the state of blockchain is stored in our storage in the format of persistent merkelized tries. +Although the trie structure is needed to compute state roots and prove the validity of states, it is expensive +to read from the trie structure because a traversal from the trie root to the leaf that contains the key +value pair could require 2 * key_length of disk access in the worst case. + +In addition, we charge receipts by the number of trie nodes they touched (TTN cost), +which is confusing to developers and unpredictable. This NEp proposes the idea of FlatStorage, +which stores a flattened key/value pairs of the current state on disk. This way, any storage read requires at most +2 disk reads. As a result, we can make storage reads faster, decrease the fees, and get rid of the TTN +cost. + +## Motivation + +The motivation of this project is to increase performance of storage reads and remove TTN cost. + +## Rationale and alternatives + +- Why is this design the best in the space of possible designs? + +- What other designs have been considered and what is the rationale for not choosing them? + +- What is the impact of not doing this? + +## Specification + +FlatStorage will store key value pairs from trie keys to the value refs (the rocksdb key of where the value of the trie item is stored) on disk. Let’s call this block the head of flat storage. To look up a trie value from flat storage, we will need at most 2 disk reads, once to get the value reference, once to get the value. + +Since there could be forks, flat storage must also support lookups for other blocks. +To achieve that, we also store block deltas in memory, and use the deltas to compute state +at other blocks. We call these deltas FlatStorageDelta (FSD). Let’s say the flat storage head +is at block h, and we are applying transactions based on block h’. Then we need some FSDs to +access the state at h’ from the snapshot at h. All these FSDs must be able to fit in +memory, otherwise, each state key lookup will trigger more than 2 disk reads and we will +have to set storage key read fee higher. + +However, the consensus algorithm doesn’t provide any guarantees in the distance of blocks +that we need to process since it could be arbitrarily long for a block to be finalized. +To solve this problem, we make another proposal (TODO: attach link for the proposal) to +set gas limit to zero for blocks with height larger than the latest final block’s height + X. +This way, flat storage only needs to store FSDs for blocks with height less than the latest +final block’s height + X. And since there can be at most one valid blocks per height, +FlatStorage only needs to store at most X FSDs in memory. + +### FSD size estimation +To set the value of X, we need to see how many block deltas can fit in memory. + +We can estimate FSD size using protocol fees. +Assume that flat state stores a mapping from keys to value refs. +Maximal key length is ~2 KiB which is the limit of contract data key size. +During wasm execution, we pay `wasm_storage_write_base` = 64 Ggas per call and +`wasm_storage_write_key_byte` = 70 Mgas per key byte. +In the extreme case it means that we pay `(64_000 / 2 KiB + 70) Mgas ~= 102 Mgas` per byte. +Then the total size of keys changed in a block is at most +`block_gas_limit / gas_per_byte * num_shards = (1300 Tgas / 102 Mgas) * 4 ~= 50 MiB`. + +To estimate the sizes of value refs, there will be at most +`block_gas_limit / wasm_storage_write_base * num_shards += 1300 Tgas / 64 Ggas * 4 = 80K` changed entries in a block. +Since one value ref takes 40 bytes, limit of total size of changed value refs in a block +is then 3.2 MiB. + +To sum it up, we will have < 54 MiB for one block, and ~1.1 GiB for 20 blocks. + +Note that if we store a value instead of value ref, size of FSDs can potentially be much larger. +Because value limit is 4 MiB, we can’t apply previous argument about base cost. +Since `wasm_storage_write_value_byte` = 31 Mgas, one FSD size can be estimated as +`(1300 Tgas / storage_write_value_byte cost * num_shards)`, or ~170 MiB, +which is 3 times higher. + +// TODO: From the above calculation, if we store + +### Storage Writes + +## Reference Implementation +The following are the important structs that will be implemented in flat storage. + +`FlatState`: It provides an interface to get value or value references from flat storage. It + will be part of `Trie`, and all trie reads will be directed to the FlatState object. + A `FlatState` object is based on a block `block_hash`, and it provides key value lookups + on the state after the block `block_hash` is applied. + +`ShardFlatStates`: It provides an interface to construct `FlatState` for each shard. + +`FlatStorageState`: It stores some information about the state of the flat storage itself, + for example, all block deltas that are stored in flat storage and the flat + storage head. `FlatState` can access `FlatStorageState` to get the list of + deltas it needs to apply on top of state of current flat head in order to + compute state of a target block. + +`FlatStateDelta`: a HashMap that contains state changes introduced in a block. They can be applied +on top the state at flat head to compute state at another block. + +It may be noted that in this implementation, a separate `FlatState` and `FlatStorageState` +will be created for each shard. The reason is that there are two modes of block processing, +normal block processing and block catchups. +Since they are performed on different range of blocks, flat storage need to be able to support +different range of blocks on different shards. Therefore, we separate the flat storage objects +used for different shards. + +### DB columns +`DBCol::FlatState` stores a mapping from trie keys to the value corresponding to the trie keys, +based on the state of the block at flat storage head. +- *Rows*: trie key (`Vec`) +- *Column type*: `ValueOrValueRef` + +`DBCol::FlatStateDeltas` stores a mapping from `(shard_id, block_hash)` to the `FlatStateDelta` that stores +state changes introduced in the given shard of the given block. +- *Rows*: `{ shard_id, block_hash }` +- *Column type*: `FlatStateDelta` + +`DBCol::FlatStateHead` stores the flat head at different shards. +- *Rows*: `shard_id` +- *Column type*: `CryptoHash` + +### ```FlatState``` +```FlatState``` will be created for a shard `shard_id` and a block `block_hash`, and it can perform +key value lookup for the state of shard `shard_id` after block `block_hash` is applied. +```rust +pub struct FlatState { +/// Used to access flat state stored at the head of flat storage. +store: Store, +/// The block for which key-value pairs of its state will be retrieved. The flat state +/// will reflect the state AFTER the block is applied. +block_hash: CryptoHash, +/// In-memory cache for the key value pairs stored on disk. +#[allow(unused)] +cache: FlatStateCache, +/// Stores the state of the flat storage +#[allow(unused)] +flat_storage_state: FlatStorageState, +} +``` + +```FlatState``` will provide the following interface. +```rust +/// get_ref returns the value or value reference corresponding to the given `key` +/// for the state that this `FlatState` object represents, i.e., the state that after +/// block `self.block_hash` is applied. +pub fn get_ref( + &self, + key: &[u8], +) -> Result, StorageError> +``` + +###```ShardFlatStates``` +`ShardFlatStates` will be stored as part of `ShardTries`. Similar to how `ShardTries` is used to +construct new `Trie` objects given a state root and a shard id, `ShardFlatStates` is used to construct +a new `FlatState` object given a block hash and a shard id. + +```rust +pub fn new_flat_state_for_shard( + &self, + shard_id: ShardId, + block_hash: Option, +) -> FlatState +``` + +###```FlatStorageState``` +`FlatStorageState` is created per shard. It provides information to which blocks the flat storage +on the given shard currently supports and what block deltas need to be applied on top the stored +flat state on disk to get the state of the target block. + +```rust +fn get_deltas_between_blocks( + &self, + target_block_hash: &CryptoHash, +) -> Result>, FlatStorageError> +``` + +```rust +fn update_flat_head(&self, new_head: &CryptoHash) -> Result<(), FlatStorageError> +``` + +```rust +fn add_delta( + &self, + block_hash: &CryptoHash, + delta: FlatStateDelta, +) -> Result +``` + +#### Thread Safety +We should note that the implementation of `FlatStorageState` must be thread safe because it can +be concurrently accessed by multiple threads. A node can process multiple blocks at the same time +if they are on different forks. Therefore, `FlatStorageState` will be guarded by a `RwLock` so its +access can be shared safely. + +```rust +pub struct FlatStorageState(Arc>); +``` + +## Security Implications (Optional) + +If there are security concerns in relation to the NEP, those concerns should be explicitly written out to make sure reviewers of the NEP are aware of them. + +## Drawbacks (Optional) + +Why should we *not* do this? + +## Unresolved Issues (Optional) + +### Storage Writes +### Fees +### Migration Plan + +- What parts of the design do you expect to resolve through the NEP process before this gets merged? +- What parts of the design do you expect to resolve through the implementation of this feature before stabilization? +- What related issues do you consider out of scope for this NEP that could be addressed in the future independently of the solution that comes out of this NEP? + +## Future possibilities + +Think about what the natural extension and evolution of your proposal would +be and how it would affect the project as a whole in a holistic +way. Try to use this section as a tool to more fully consider all possible +interactions with the project in your proposal. +Also consider how the this all fits into the roadmap for the project +and of the relevant sub-team. + +This is also a good place to "dump ideas", if they are out of scope for the +NEP you are writing but otherwise related. + +If you have tried and cannot think of any future possibilities, +you may simply state that you cannot think of anything. + +Note that having something written down in the future-possibilities section +is not a reason to accept the current or a future NEP. Such notes should be +in the section on motivation or rationale in this or subsequent NEPs. +The section merely provides additional information. + +## Copyright +[copyright]: #copyright + +Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). From 9004d026be2511ac8bc5ad0e2192c37b9fa9ec62 Mon Sep 17 00:00:00 2001 From: Min Zhang Date: Thu, 29 Sep 2022 16:38:54 -0400 Subject: [PATCH 02/24] add nep number --- README.md | 1 + neps/{nep-9999.md => nep-0399.md} | 6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) rename neps/{nep-9999.md => nep-0399.md} (99%) diff --git a/README.md b/README.md index 315643f23..38d06118b 100644 --- a/README.md +++ b/README.md @@ -25,6 +25,7 @@ Changes to the protocol specification and standards are called NEAR Enhancement |[0245](https://github.com/near/NEPs/blob/master/neps/nep-0245.md) | Multi Token Standard | @zcstarr @riqi @jriemann @marcos.sun | Review | |[0297](https://github.com/near/NEPs/blob/master/neps/nep-0297.md) | Events Standard | @telezhnaya | Final | |[0330](https://github.com/near/NEPs/blob/master/neps/nep-0330.md) | Source Metadata | @BenKurrek | Review | +|[0399](https://github.com/near/NEPs/blob/master/neps/nep-0399.md) | Flat Storage | @AleksandrLogunov @MinZhang | Draft | diff --git a/neps/nep-9999.md b/neps/nep-0399.md similarity index 99% rename from neps/nep-9999.md rename to neps/nep-0399.md index 0166306fc..4cb24164a 100644 --- a/neps/nep-9999.md +++ b/neps/nep-0399.md @@ -1,11 +1,11 @@ --- -NEP: 0 +NEP: 0399 Title: Flat Storage Author: Min Zhang Aleksandr Logunov -DiscussionsTo: https://github.com/nearprotocol/neps/pull/0000 +DiscussionsTo: https://github.com/nearprotocol/neps/pull/0399 Status: Draft Type: Protocol Track -Category: Chain +Category: Storage Created: 07-Sep-2022 --- From 34ae9942cbbd26fee407b2b741bff45ba0f078ce Mon Sep 17 00:00:00 2001 From: Min Zhang Date: Sun, 2 Oct 2022 17:37:22 -0400 Subject: [PATCH 03/24] add more conntent' --- neps/nep-0399.md | 154 +++++++++++++++++++++++++++++++++++++---------- 1 file changed, 122 insertions(+), 32 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index 4cb24164a..dab606046 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -1,7 +1,7 @@ --- NEP: 0399 Title: Flat Storage -Author: Min Zhang Aleksandr Logunov +Author: Aleksandr Logunov Min Zhang DiscussionsTo: https://github.com/nearprotocol/neps/pull/0399 Status: Draft Type: Protocol Track @@ -13,45 +13,89 @@ Created: 07-Sep-2022 Currently, the state of blockchain is stored in our storage in the format of persistent merkelized tries. Although the trie structure is needed to compute state roots and prove the validity of states, it is expensive -to read from the trie structure because a traversal from the trie root to the leaf that contains the key -value pair could require 2 * key_length of disk access in the worst case. +to read from the trie structure because it requires a traversal from the trie root to the leaf that contains the key +value pair, which could mean 2 * key_length of disk access in the worst case. -In addition, we charge receipts by the number of trie nodes they touched (TTN cost), -which is confusing to developers and unpredictable. This NEp proposes the idea of FlatStorage, +In addition, we charge receipts by the number of trie nodes they touched (TTN cost), +which is both confusing to developers and unpredictable. This NEP proposes the idea of FlatStorage, which stores a flattened key/value pairs of the current state on disk. This way, any storage read requires at most 2 disk reads. As a result, we can make storage reads faster, decrease the fees, and get rid of the TTN cost. ## Motivation -The motivation of this project is to increase performance of storage reads and remove TTN cost. +The motivation of this proposal is to increase performance of storage reads, reduce storage read costs and +simplifies how storage fees are charged by getting rid of TTN cost. ## Rationale and alternatives - Why is this design the best in the space of possible designs? - +There are other ideas for how to improve storage performance, such as using + other database instead of rocksdb, or changing the representation of states + to achieve locality of data in the same account. Considering that these ideas + will likely require much more work than FlatStorage, FlatStorage is a good investment + of our effort to achieve better storage performances. In addition, the improvement + from FlatStorage can be combined with the improvement brought by these other ideas, + so the implementation of FlatStorage won't be rendered obsolete in the future. - What other designs have been considered and what is the rationale for not choosing them? - +Alternatively, we can still get rid of TTN cost by increasing the base fees for storage reads and writes. However, + this could require increasing the fees by quite a lot, which could end up breaking many contracts. - What is the impact of not doing this? +Storage reads will still be inefficiently implemented and cost more than it could be. ## Specification +The key idea of FlatStorage is to store a direct mapping from trie keys to values on disk. +Here the values of this mapping can be either the value corresponding to the trie key itself, +or the value ref, a hash that points to the address of the value. If the value itself is stored, +only one disk read is needed to look up a value from flat storage, otherwise two disk reads if the value +ref is stored. We will discuss more in the following section for whether we use values or value refs. +For the purpose of high level discussion, it suffices to say that with FlatStorage, +at most two disk reads are needed to perform a storage read. + +The simple design above won't work because there could be forks in the chain. In the following case, FlatStorage +must support key value lookups for states of the blocks on both forks. +``` + Block B1 - Block B2 - ... + / +block A + \ Block C1 - Block C2 - ... +``` -FlatStorage will store key value pairs from trie keys to the value refs (the rocksdb key of where the value of the trie item is stored) on disk. Let’s call this block the head of flat storage. To look up a trie value from flat storage, we will need at most 2 disk reads, once to get the value reference, once to get the value. - -Since there could be forks, flat storage must also support lookups for other blocks. -To achieve that, we also store block deltas in memory, and use the deltas to compute state -at other blocks. We call these deltas FlatStorageDelta (FSD). Let’s say the flat storage head -is at block h, and we are applying transactions based on block h’. Then we need some FSDs to -access the state at h’ from the snapshot at h. All these FSDs must be able to fit in -memory, otherwise, each state key lookup will trigger more than 2 disk reads and we will -have to set storage key read fee higher. +The handling of forks will be the main consideration of the following design. More specifically, +the design should satisfy the following requirements, +1) It should support concurrent block processing. Blocks on different forks are processed + concurrently in our client code, so the flat storage API must support that. +2) In case of long forks, block processing time should not be too much longer than the average case. + We don’t want this case to be exploitable. It is acceptable that block processing time is 200ms longer, + which may slow down block production, but probably won’t cause missing blocks and chunks. + It is not acceptable if block processing time is 10s, which may lead to more forks and instability in the network. +3) The design must be able to decrease storage access cost in all cases, + since we are going to change the storage read fees based on flat storage. + We can't conditionally enable FlatStorage for some blocks and disable it for other, because + the fees we charge must be consistent. + +The mapping of key value pairs FlatStorage stored on disk matches the state at some block. +We call this block the head of flat storage, or the flat head. There are two natural options +for which block should be the flat head, the chain head, or the last final block. Although +both options could work, the implementation we propose will use the last final block because +it is simpler. + +To support key value lookups for other blocks that are not the flat head, FlatStorage will +also store key value changes(deltas) per block for these blocks. +We call these deltas FlatStorageDelta (FSD). Let’s say the flat storage head is at block h, +and we are applying transactions based on block h’. Since h is the last final block, +h is an ancestor of h'. To access the state at block h', we need FSDs of all blocks between h and h'. +Note that all these FSDs must be stored in memory, otherwise, the access of FSDs will trigger +more disk reads and we will have to set storage key read fee higher. However, the consensus algorithm doesn’t provide any guarantees in the distance of blocks that we need to process since it could be arbitrarily long for a block to be finalized. To solve this problem, we make another proposal (TODO: attach link for the proposal) to set gas limit to zero for blocks with height larger than the latest final block’s height + X. -This way, flat storage only needs to store FSDs for blocks with height less than the latest -final block’s height + X. And since there can be at most one valid blocks per height, +If the gas limit is set to zero for a block, it won't contain any transactions or receipts, +and FlatStorage won't need to store the delta for this block. +With this change, FlatStorage only needs to store FSDs for blocks with height less than the latest +final block’s height + X. And since there can be at most one valid block per height, FlatStorage only needs to store at most X FSDs in memory. ### FSD size estimation @@ -77,36 +121,41 @@ To sum it up, we will have < 54 MiB for one block, and ~1.1 GiB for 20 blocks. Note that if we store a value instead of value ref, size of FSDs can potentially be much larger. Because value limit is 4 MiB, we can’t apply previous argument about base cost. Since `wasm_storage_write_value_byte` = 31 Mgas, one FSD size can be estimated as -`(1300 Tgas / storage_write_value_byte cost * num_shards)`, or ~170 MiB, +`(1300 Tgas / min(storage_write_value_byte, storage_write_key_byte) * num_shards)`, or ~170 MiB, which is 3 times higher. -// TODO: From the above calculation, if we store +The advantage of storing values instead of value refs is that it saves one disk read if the key has been +modified in the recent blocks. It may be beneficial if we get many transactions or receipts touching the same +trie keys in consecutive blocks, but it is hard to estimate the value of such benefits without more data. +Since storing values will cost much more memory than value refs, we will likely choose to store value refs +in FSDs and set X to a value between 10 and 20. ### Storage Writes +// TODO ## Reference Implementation -The following are the important structs that will be implemented in flat storage. +FlatStorage will implement the following structs. + +`FlatStateDelta`: a HashMap that contains state changes introduced in a block. They can be applied +on top the state at flat head to compute state at another block. -`FlatState`: It provides an interface to get value or value references from flat storage. It +`FlatState`: provides an interface to get value or value references from flat storage. It will be part of `Trie`, and all trie reads will be directed to the FlatState object. A `FlatState` object is based on a block `block_hash`, and it provides key value lookups on the state after the block `block_hash` is applied. -`ShardFlatStates`: It provides an interface to construct `FlatState` for each shard. +`ShardFlatStates`: provides an interface to construct `FlatState` for each shard. -`FlatStorageState`: It stores some information about the state of the flat storage itself, +`FlatStorageState`: stores information about the state of the flat storage itself, for example, all block deltas that are stored in flat storage and the flat storage head. `FlatState` can access `FlatStorageState` to get the list of deltas it needs to apply on top of state of current flat head in order to compute state of a target block. -`FlatStateDelta`: a HashMap that contains state changes introduced in a block. They can be applied -on top the state at flat head to compute state at another block. - It may be noted that in this implementation, a separate `FlatState` and `FlatStorageState` will be created for each shard. The reason is that there are two modes of block processing, normal block processing and block catchups. -Since they are performed on different range of blocks, flat storage need to be able to support +Since they are performed on different ranges of blocks, flat storage need to be able to support different range of blocks on different shards. Therefore, we separate the flat storage objects used for different shards. @@ -120,10 +169,28 @@ based on the state of the block at flat storage head. state changes introduced in the given shard of the given block. - *Rows*: `{ shard_id, block_hash }` - *Column type*: `FlatStateDelta` +Note that `FlatStateDelta`s needed are stored in memory, so during block processing this column won't be used + at all. This column is only used to load deltas into memory at `FlatStorageState` initialization time when node starts. `DBCol::FlatStateHead` stores the flat head at different shards. - *Rows*: `shard_id` - *Column type*: `CryptoHash` +Similarly, flat head is also stored in `FlatStorageState` in memory, so this column is only used to initialize + `FlatStorageState` when node starts. + +### `FlatStateDelta` +`FlatStateDelta` stores a mapping from trie keys to value refs. If the value is `None`, it means the key is deleted +in the block. +```rust +pub struct FlatStateDelta(HashMap, Option>); +``` + +```rust +pub fn from_state_changes(changes: &[RawStateChangesWithTrieKey]) -> FlatStateDelta +``` +Converts raw state changes to flat state delta. The raw state changes will be returned as part of the result of +`Runtime::apply_transactions`. They will be converted to `FlatStateDelta` to be added +to `FlatStorageState` during `Chain::post_processblock`. ### ```FlatState``` ```FlatState``` will be created for a shard `shard_id` and a block `block_hash`, and it can perform @@ -146,14 +213,17 @@ flat_storage_state: FlatStorageState, ```FlatState``` will provide the following interface. ```rust -/// get_ref returns the value or value reference corresponding to the given `key` -/// for the state that this `FlatState` object represents, i.e., the state that after -/// block `self.block_hash` is applied. pub fn get_ref( &self, key: &[u8], ) -> Result, StorageError> ``` +Returns the value or value reference corresponding to the given `key` +for the state that this `FlatState` object represents, i.e., the state that after +block `self.block_hash` is applied. + +`FlatState` will be stored as a field in `Tries`. + ###```ShardFlatStates``` `ShardFlatStates` will be stored as part of `ShardTries`. Similar to how `ShardTries` is used to @@ -167,6 +237,19 @@ pub fn new_flat_state_for_shard( block_hash: Option, ) -> FlatState ``` +Creates a new `FlatState` to be used for performing key value lookups on the state of shard `shard_id` +after block `block_hash` is applied. + +```rust +pub fn get_flat_storage_state_for_shard( + &self, + shard_id: ShardId, +) -> Result +``` +Returns the `FlatStorageState` for the shard `shard_id`. This function is needed because even though +`FlatStorageState` is part of `Runtime`, `Chain` also needs access to `FlatStorageState` to update flat head. +We will also create a function with the same in `Runtime` that calls this function to provide `Chain` to access +to `FlatStorageState`. ###```FlatStorageState``` `FlatStorageState` is created per shard. It provides information to which blocks the flat storage @@ -179,10 +262,16 @@ fn get_deltas_between_blocks( target_block_hash: &CryptoHash, ) -> Result>, FlatStorageError> ``` +Returns the list of deltas between blocks `target_block_hash`(inclusive) and flat head(exclusive), +Returns an error if `target_block_hash` is not a direct descendent of the current flat head. +This function will be used in `FlatState::get_ref`. ```rust fn update_flat_head(&self, new_head: &CryptoHash) -> Result<(), FlatStorageError> ``` +Updates the head of the flat storage, including updating the flat head in memory and on disk, +update the flat state on disk to reflect the state at the new head, and gc the `FlatStateDelta`s that +are no longer needed from memory and from disk. ```rust fn add_delta( @@ -191,6 +280,7 @@ fn add_delta( delta: FlatStateDelta, ) -> Result ``` +Adds `delta` to `FlatStorageState`, returns a `StoreUpdate` object that includes #### Thread Safety We should note that the implementation of `FlatStorageState` must be thread safe because it can From 8fa49eed63d7d33898afc06c722dccab26542e17 Mon Sep 17 00:00:00 2001 From: Min Zhang Date: Sun, 2 Oct 2022 18:25:01 -0400 Subject: [PATCH 04/24] add discussion for storage writes --- neps/nep-0399.md | 78 +++++++++++++++++++++++++----------------------- 1 file changed, 41 insertions(+), 37 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index dab606046..59549c1e2 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -29,19 +29,23 @@ simplifies how storage fees are charged by getting rid of TTN cost. ## Rationale and alternatives -- Why is this design the best in the space of possible designs? -There are other ideas for how to improve storage performance, such as using +Q: Why is this design the best in the space of possible designs? + +A: There are other ideas for how to improve storage performance, such as using other database instead of rocksdb, or changing the representation of states to achieve locality of data in the same account. Considering that these ideas will likely require much more work than FlatStorage, FlatStorage is a good investment of our effort to achieve better storage performances. In addition, the improvement from FlatStorage can be combined with the improvement brought by these other ideas, so the implementation of FlatStorage won't be rendered obsolete in the future. -- What other designs have been considered and what is the rationale for not choosing them? -Alternatively, we can still get rid of TTN cost by increasing the base fees for storage reads and writes. However, + +Q: What other designs have been considered and what is the rationale for not choosing them? + +A: Alternatively, we can still get rid of TTN cost by increasing the base fees for storage reads and writes. However, this could require increasing the fees by quite a lot, which could end up breaking many contracts. -- What is the impact of not doing this? -Storage reads will still be inefficiently implemented and cost more than it could be. +Q: What is the impact of not doing this? + +A: Storage reads will remain inefficiently implemented and cost more than it could be. ## Specification The key idea of FlatStorage is to store a direct mapping from trie keys to values on disk. @@ -131,7 +135,30 @@ Since storing values will cost much more memory than value refs, we will likely in FSDs and set X to a value between 10 and 20. ### Storage Writes -// TODO +Currently, storage writes are charged based on the number of touched trie nodes (TTN cost), because updating the leaf trie +node which stores the value to the trie key requires updating all trie nodes on the path leading to the leaf node. +All writes are committed at once in one db transaction at the end of block processing, outside of runtime after +all receipts in a block are executed. However, at the time of execution, runtime needs to calculate the cost, +which means it needs to know how many trie nodes the write affects, so runtime will issue a read for every write +to calculate the TTN cost for the write. Such reads cannot be replaced by a read in FlatStorage because FlatStorage does +not provide the path to the trie node. + +There are multiple proposals on how storage writes can work with FlatStorage. +- Keep it the same. The cost of writes remain the same. Note that this can increase the cost for writes in + some cases, for example, if a contract first read from a key and then writes to the same key in the same chunk. + Without FlatStorage, the key will be cached in the chunk cache after the read, so the write will cost less. + With FlatStorage, the read will go through FlatStorage, the write will not find the key in the chunk cache and + it will cost more. +- Remove the TTN cost from storage write fees. Currently, there are two ideas in this direction. + - Charge based on maximum depth of a contract’s state, instead of per-touch-trie node. + - Charge based on key length only. + + Both of the above ideas would allow us to remove writes from the critical path of block execution. However, + it is unclear at this point what the new cost would look like and whether further optimizations are needed + to bring down the cost for writes in the new cost model. + +### Migration Plan +// TODO ## Reference Implementation FlatStorage will implement the following structs. @@ -224,7 +251,6 @@ block `self.block_hash` is applied. `FlatState` will be stored as a field in `Tries`. - ###```ShardFlatStates``` `ShardFlatStates` will be stored as part of `ShardTries`. Similar to how `ShardTries` is used to construct new `Trie` objects given a state root and a shard id, `ShardFlatStates` is used to construct @@ -292,44 +318,22 @@ access can be shared safely. pub struct FlatStorageState(Arc>); ``` -## Security Implications (Optional) - -If there are security concerns in relation to the NEP, those concerns should be explicitly written out to make sure reviewers of the NEP are aware of them. - ## Drawbacks (Optional) Why should we *not* do this? -## Unresolved Issues (Optional) +## Unresolved Issues -### Storage Writes -### Fees -### Migration Plan +As we discussed in Section Specification, there are still unanswered questions around how the new cost model for storage +writes would look like and how the current storage can be upgraded to enabled FlatStorage. We expect to finalize +the migration plan before this NEP gets merged, but we might need more time to collect data and measurement around +storage write costs, which can be only be collected after FlatStorage is partially implemented. -- What parts of the design do you expect to resolve through the NEP process before this gets merged? -- What parts of the design do you expect to resolve through the implementation of this feature before stabilization? -- What related issues do you consider out of scope for this NEP that could be addressed in the future independently of the solution that comes out of this NEP? +Another big unanswered question is how FlatStorage would work when challenges are enabled. We consider that to be out of +the scope of this NEP because the details of how challenges will be implemented are not clear yet. ## Future possibilities -Think about what the natural extension and evolution of your proposal would -be and how it would affect the project as a whole in a holistic -way. Try to use this section as a tool to more fully consider all possible -interactions with the project in your proposal. -Also consider how the this all fits into the roadmap for the project -and of the relevant sub-team. - -This is also a good place to "dump ideas", if they are out of scope for the -NEP you are writing but otherwise related. - -If you have tried and cannot think of any future possibilities, -you may simply state that you cannot think of anything. - -Note that having something written down in the future-possibilities section -is not a reason to accept the current or a future NEP. Such notes should be -in the section on motivation or rationale in this or subsequent NEPs. -The section merely provides additional information. - ## Copyright [copyright]: #copyright From 3fa77c9585efb31e6a911c76dd9336dc3a1e9547 Mon Sep 17 00:00:00 2001 From: Min Zhang Date: Tue, 4 Oct 2022 17:16:44 -0400 Subject: [PATCH 05/24] address comments --- neps/nep-0399.md | 43 ++++++++++++++++++++++++------------------- 1 file changed, 24 insertions(+), 19 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index 59549c1e2..663fb8f0c 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -6,33 +6,35 @@ DiscussionsTo: https://github.com/nearprotocol/neps/pull/0399 Status: Draft Type: Protocol Track Category: Storage -Created: 07-Sep-2022 +Created: 30-Sep-2022 --- ## Summary Currently, the state of blockchain is stored in our storage in the format of persistent merkelized tries. -Although the trie structure is needed to compute state roots and prove the validity of states, it is expensive -to read from the trie structure because it requires a traversal from the trie root to the leaf that contains the key -value pair, which could mean 2 * key_length of disk access in the worst case. - -In addition, we charge receipts by the number of trie nodes they touched (TTN cost), -which is both confusing to developers and unpredictable. This NEP proposes the idea of FlatStorage, -which stores a flattened key/value pairs of the current state on disk. This way, any storage read requires at most -2 disk reads. As a result, we can make storage reads faster, decrease the fees, and get rid of the TTN -cost. +Although the trie structure is needed to compute state roots and prove the validity of states, reading from it +because it requires a traversal from the trie root to the leaf node that contains the key +value pair, which could mean up to 2 * key_length of disk access in the worst case. + +In addition, we charge receipts by the number of trie nodes they touched (TTN cost). Note that the number +of touched trie node does not always equal to the key length, it depends on the internal trie structure. +As a result, this cost is confusing and hard to be estimated for developers. +This NEP proposes the idea of FlatStorage, which stores a flattened map of key/value pairs of the current state on disk. +Note that the original trie structure will not be removed. With FlatStorage, +any storage read requires at most 2 disk reads. As a result, we can make storage reads faster, +decrease the fees, and get rid of the TTN cost for storage reads. ## Motivation The motivation of this proposal is to increase performance of storage reads, reduce storage read costs and -simplifies how storage fees are charged by getting rid of TTN cost. +simplify how storage fees are charged by getting rid of TTN cost for storage reads. ## Rationale and alternatives Q: Why is this design the best in the space of possible designs? A: There are other ideas for how to improve storage performance, such as using - other database instead of rocksdb, or changing the representation of states + other databases instead of rocksdb, or changing the representation of states to achieve locality of data in the same account. Considering that these ideas will likely require much more work than FlatStorage, FlatStorage is a good investment of our effort to achieve better storage performances. In addition, the improvement @@ -43,9 +45,10 @@ Q: What other designs have been considered and what is the rationale for not cho A: Alternatively, we can still get rid of TTN cost by increasing the base fees for storage reads and writes. However, this could require increasing the fees by quite a lot, which could end up breaking many contracts. + Q: What is the impact of not doing this? -A: Storage reads will remain inefficiently implemented and cost more than it could be. +A: Storage reads will remain inefficiently implemented and cost more than they should. ## Specification The key idea of FlatStorage is to store a direct mapping from trie keys to values on disk. @@ -79,13 +82,15 @@ the design should satisfy the following requirements, the fees we charge must be consistent. The mapping of key value pairs FlatStorage stored on disk matches the state at some block. -We call this block the head of flat storage, or the flat head. There are two natural options -for which block should be the flat head, the chain head, or the last final block. Although -both options could work, the implementation we propose will use the last final block because -it is simpler. +We call this block the head of flat storage, or the flat head. During block processing, +the flat head is set to the last final block. The Doomslug consensus algorithm +guarantees that if a block is final, all future final blocks must be descendants of this block. +In other words, any block that is not built on top of the last final block can be discarded because they +will never be finalized. As a result, if we use the last final block as the flat head, any block +FlatStorage needs to process is a descendant of the flat head. To support key value lookups for other blocks that are not the flat head, FlatStorage will -also store key value changes(deltas) per block for these blocks. +store key value changes(deltas) per block for these blocks. We call these deltas FlatStorageDelta (FSD). Let’s say the flat storage head is at block h, and we are applying transactions based on block h’. Since h is the last final block, h is an ancestor of h'. To access the state at block h', we need FSDs of all blocks between h and h'. @@ -155,7 +160,7 @@ There are multiple proposals on how storage writes can work with FlatStorage. Both of the above ideas would allow us to remove writes from the critical path of block execution. However, it is unclear at this point what the new cost would look like and whether further optimizations are needed - to bring down the cost for writes in the new cost model. + to bring down the cost for writes in the new cost model. ### Migration Plan // TODO From 548e33ddea239b6efb2ee5874968d97a3cd762f9 Mon Sep 17 00:00:00 2001 From: Min Zhang Date: Tue, 4 Oct 2022 22:54:05 -0400 Subject: [PATCH 06/24] add more sections --- neps/nep-0399.md | 37 ++++++++++++++++++++++++++++++++----- 1 file changed, 32 insertions(+), 5 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index 663fb8f0c..0b194ffc5 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -163,7 +163,26 @@ There are multiple proposals on how storage writes can work with FlatStorage. to bring down the cost for writes in the new cost model. ### Migration Plan -// TODO +There are two main questions regarding to how to enable FlatStorage. +1) Whether there should be database migration. The main challenge of enabling FlatStorage will be to build the flat state + column, which requires iterating the entire state. We currently estimate that it takes 1 hour to build + flat state for archival nodes and 15 minutes for rpc and validator nodes. Note that this estimation is very rough + and further verification is needed. The main concern is that if it takes too long for archival node to migrate, + they may have a hard time catching up later since the block processing speed of archival nodes is not very fast. + + Alternatively, we can build the flat state in a background process while the node is running. This provides a better + experience for both archival and validator nodes since the migration process is transient to them. It would require + more implementation effort from our side. + + To make a decision, we will verify the time it takes to build flat state. If it will cause a problem for archival nodes + to catch up, we will implement the background migration process. +2) Whether there should be a protocol upgrade. The enabling of FlatStorage itself does not require a protocol upgrade, since +it is an internal storage implementation that doesn't change protocol level. However, a protocol upgrade is needed + if we want to adjust fees based on the storage performance with FlatStorage. These two changes can happen in one release, + or we can be release them separately. We propose that the enabling of FlatStorage and the protocol upgrade + to adjust fees should happen in separate release to reduce the risk. The period between the two releases can be + used to test the stability and performance of FlatStorage. Because it is not a protocol change, it is easy to roll back + the change in case any issue arises. ## Reference Implementation FlatStorage will implement the following structs. @@ -254,7 +273,7 @@ Returns the value or value reference corresponding to the given `key` for the state that this `FlatState` object represents, i.e., the state that after block `self.block_hash` is applied. -`FlatState` will be stored as a field in `Tries`. +`FlatState` will be stored as a field in `Tries`. ###```ShardFlatStates``` `ShardFlatStates` will be stored as part of `ShardTries`. Similar to how `ShardTries` is used to @@ -323,9 +342,16 @@ access can be shared safely. pub struct FlatStorageState(Arc>); ``` -## Drawbacks (Optional) +## Drawbacks + +Implementing FlatStorage will require a lot of engineering effort and introduce code that will make the codebase more +complicated. We are confident that FlatStorage will bring a lot of performance benefit, but we can only measure the exact +improvement after the implementation. In a very unlikely case, we may find that the benefit FlatStorage brings is not +worth the effort. -Why should we *not* do this? +Another issue is that it will make the state rollback harder in the future when we enable challenges in phase 2 of sharding. +When a challenge is accepted and the state needs to be rolled back to a previous block, the entire flat state needs to +be rebuilt, which could take a long time. ## Unresolved Issues @@ -335,7 +361,8 @@ the migration plan before this NEP gets merged, but we might need more time to c storage write costs, which can be only be collected after FlatStorage is partially implemented. Another big unanswered question is how FlatStorage would work when challenges are enabled. We consider that to be out of -the scope of this NEP because the details of how challenges will be implemented are not clear yet. +the scope of this NEP because the details of how challenges will be implemented are not clear yet. But this is something +we need to consider when we design challenges. ## Future possibilities From 181b65bc9e544ccfb86dff49e22dd86c335f4e5b Mon Sep 17 00:00:00 2001 From: mzhangmzz <34969888+mzhangmzz@users.noreply.github.com> Date: Thu, 6 Oct 2022 15:03:35 -0400 Subject: [PATCH 07/24] Update neps/nep-0399.md Co-authored-by: Marcelo Fornet --- neps/nep-0399.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index 0b194ffc5..da6f11704 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -13,7 +13,7 @@ Created: 30-Sep-2022 Currently, the state of blockchain is stored in our storage in the format of persistent merkelized tries. Although the trie structure is needed to compute state roots and prove the validity of states, reading from it -because it requires a traversal from the trie root to the leaf node that contains the key +requires a traversal from the trie root to the leaf node that contains the key value pair, which could mean up to 2 * key_length of disk access in the worst case. In addition, we charge receipts by the number of trie nodes they touched (TTN cost). Note that the number From 51e35aba18ce771cc8e983e206a4891e6cf2b682 Mon Sep 17 00:00:00 2001 From: Aleksandr Logunov Date: Thu, 23 Feb 2023 16:16:45 +0400 Subject: [PATCH 08/24] Update neps/nep-0399.md Co-authored-by: Akhilesh Singhania --- neps/nep-0399.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index da6f11704..4a8e48f08 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -26,8 +26,8 @@ decrease the fees, and get rid of the TTN cost for storage reads. ## Motivation -The motivation of this proposal is to increase performance of storage reads, reduce storage read costs and -simplify how storage fees are charged by getting rid of TTN cost for storage reads. +The motivation of this proposal is to increase performance of storage reads, reduce storage read gas fees and +simplify how storage gas fees are charged by getting rid of TTN cost for storage reads. ## Rationale and alternatives From 79f324e1de46719b6606837cf82a1a8b4491bb65 Mon Sep 17 00:00:00 2001 From: Aleksandr Logunov Date: Thu, 23 Feb 2023 16:17:08 +0400 Subject: [PATCH 09/24] Update neps/nep-0399.md Co-authored-by: Akhilesh Singhania --- neps/nep-0399.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index 4a8e48f08..6afd1a1f1 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -97,7 +97,7 @@ h is an ancestor of h'. To access the state at block h', we need FSDs of all blo Note that all these FSDs must be stored in memory, otherwise, the access of FSDs will trigger more disk reads and we will have to set storage key read fee higher. -However, the consensus algorithm doesn’t provide any guarantees in the distance of blocks +However, the Doomslug consensus algorithm doesn’t provide any guarantees in the distance of blocks that we need to process since it could be arbitrarily long for a block to be finalized. To solve this problem, we make another proposal (TODO: attach link for the proposal) to set gas limit to zero for blocks with height larger than the latest final block’s height + X. From dabaea3af8954e88d4974c228657b8ca00d11139 Mon Sep 17 00:00:00 2001 From: Aleksandr Logunov Date: Thu, 23 Feb 2023 16:17:29 +0400 Subject: [PATCH 10/24] Update neps/nep-0399.md Co-authored-by: Akhilesh Singhania --- neps/nep-0399.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index 6afd1a1f1..b185b63ea 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -14,7 +14,7 @@ Created: 30-Sep-2022 Currently, the state of blockchain is stored in our storage in the format of persistent merkelized tries. Although the trie structure is needed to compute state roots and prove the validity of states, reading from it requires a traversal from the trie root to the leaf node that contains the key -value pair, which could mean up to 2 * key_length of disk access in the worst case. +value pair, which could mean up to 2 * key_length disk accesses in the worst case. In addition, we charge receipts by the number of trie nodes they touched (TTN cost). Note that the number of touched trie node does not always equal to the key length, it depends on the internal trie structure. From a9ccf71a27b92c0259081cf7d7a75347c6a527ec Mon Sep 17 00:00:00 2001 From: Aleksandr Logunov Date: Thu, 23 Feb 2023 16:17:58 +0400 Subject: [PATCH 11/24] Update neps/nep-0399.md Co-authored-by: Akhilesh Singhania --- neps/nep-0399.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index b185b63ea..c1ad5bc32 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -54,7 +54,7 @@ A: Storage reads will remain inefficiently implemented and cost more than they s The key idea of FlatStorage is to store a direct mapping from trie keys to values on disk. Here the values of this mapping can be either the value corresponding to the trie key itself, or the value ref, a hash that points to the address of the value. If the value itself is stored, -only one disk read is needed to look up a value from flat storage, otherwise two disk reads if the value +only one disk read is needed to look up a value from flat storage, otherwise two disk reads are needed if the value ref is stored. We will discuss more in the following section for whether we use values or value refs. For the purpose of high level discussion, it suffices to say that with FlatStorage, at most two disk reads are needed to perform a storage read. From 4ce8b6004d9c9ae098ec50d023df7b97b3ee5d28 Mon Sep 17 00:00:00 2001 From: Longarithm Date: Thu, 23 Feb 2023 17:19:11 +0400 Subject: [PATCH 12/24] rewrite summary+motivation --- neps/nep-0399.md | 31 +++++++++++++++++-------------- 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index c1ad5bc32..3e03f01cf 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -11,23 +11,26 @@ Created: 30-Sep-2022 ## Summary -Currently, the state of blockchain is stored in our storage in the format of persistent merkelized tries. -Although the trie structure is needed to compute state roots and prove the validity of states, reading from it -requires a traversal from the trie root to the leaf node that contains the key -value pair, which could mean up to 2 * key_length disk accesses in the worst case. - -In addition, we charge receipts by the number of trie nodes they touched (TTN cost). Note that the number -of touched trie node does not always equal to the key length, it depends on the internal trie structure. -As a result, this cost is confusing and hard to be estimated for developers. -This NEP proposes the idea of FlatStorage, which stores a flattened map of key/value pairs of the current state on disk. -Note that the original trie structure will not be removed. With FlatStorage, -any storage read requires at most 2 disk reads. As a result, we can make storage reads faster, -decrease the fees, and get rid of the TTN cost for storage reads. +This NEP proposes the idea of Flat Storage, which stores a flattened map of key/value pairs of the current +blockchain state on disk. Note that original Trie (persistent merkelized trie) is not removed, but Flat Storage +allows to make storage reads faster, make storage fees more predictable and potentially decrease them. ## Motivation -The motivation of this proposal is to increase performance of storage reads, reduce storage read gas fees and -simplify how storage gas fees are charged by getting rid of TTN cost for storage reads. +Currently, the blockchain state is stored in our storage only in the format of persistent merkelized tries. +Although it is needed to compute state roots and prove the validity of states, reading from it requires a +traversal from the trie root to the leaf node that contains the key value pair, which could mean up to +2 * key_length disk accesses in the worst case. + +In addition, we charge receipts by the number of trie nodes they touched (TTN cost). Note that the number +of touched trie node does not always equal to the key length, it depends on the internal trie structure. +Based on some feedback from contract developers collected in the past, they are interested in predictable fees, +but TTN costs are annoying to predict and can lead to unexpected excess of the gas limit. They are also a burden +for NEAR Protocol client implementations, i.e. nearcore, as exact TTN number must be computed deterministically +by all clients. This prevents storage optimizations that use other strategies than nearcore uses today. + +With Flat Storage, number of disk reads reduces from 2 * key_length to 2, storage read gas fees are simplified +by getting rid of TTN cost, and potentially can be reduced because less disk reads are needed. ## Rationale and alternatives From 809095e023d5e2031068f249ab02fa78e791d725 Mon Sep 17 00:00:00 2001 From: Longarithm Date: Fri, 24 Feb 2023 16:20:14 +0400 Subject: [PATCH 13/24] size estimations --- neps/nep-0399.md | 168 +++++++++++++++++++++++++---------------------- 1 file changed, 91 insertions(+), 77 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index 3e03f01cf..fdd818f2e 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -29,44 +29,69 @@ but TTN costs are annoying to predict and can lead to unexpected excess of the g for NEAR Protocol client implementations, i.e. nearcore, as exact TTN number must be computed deterministically by all clients. This prevents storage optimizations that use other strategies than nearcore uses today. -With Flat Storage, number of disk reads reduces from 2 * key_length to 2, storage read gas fees are simplified -by getting rid of TTN cost, and potentially can be reduced because less disk reads are needed. +With Flat Storage, number of disk reads is reduced from worst-case 2 * key_length to exactly 2, storage read gas +fees are simplified by getting rid of TTN cost, and potentially can be reduced because less disk reads are needed. ## Rationale and alternatives Q: Why is this design the best in the space of possible designs? -A: There are other ideas for how to improve storage performance, such as using - other databases instead of rocksdb, or changing the representation of states - to achieve locality of data in the same account. Considering that these ideas - will likely require much more work than FlatStorage, FlatStorage is a good investment - of our effort to achieve better storage performances. In addition, the improvement - from FlatStorage can be combined with the improvement brought by these other ideas, - so the implementation of FlatStorage won't be rendered obsolete in the future. +A: Space of possible designs is quite big here, let's show some meaningful examples. + +The most straightforward one is just to increase TTN, or align it with biggest possible value, or alternatively increase +the base fees for storage reads and writes. However, technically the biggest TTN could be 4096 as of today. And we tend +to strongly avoid increasing fees, because it may break existing contract calls, not even mentioning that it would greatly +reduce capacity of NEAR blocks, because for current mainnet usecases depth is usually below 20. + +allowing to make number of traversed nodes very stable and predictable. But implementation is tricky, we didn't move +significantly further POC here: https://github.com/near/nearcore/discussions/4815. Also for most key-value pairs +tree depth will actually increase - for example, if you have 1M keys, depth is always 20, and it would cause increasing +fees as well. + +Separate idea is to get rid of global state root completely: https://github.com/near/NEPs/discussions/425. Instead, +we could track latest account state in the block where it was changed. But after closer look, it brings us to +similar questions - if some key was untouched for a long time, it becomes harder to find exact block to find latest +value for it, and we need some tree-shaped structure again. Because ideas like that could be also extremely invasive, +we stopped considering them at this point. + +Another ideas around storage include exploring databases other than RocksDB, or moving State to a separate database. +We can tweak State representation, i.e. start trie key from account id to achieve data locality within +the same account. However, Flat Storage main goal is speed up reads and make their costs predictable, and these ideas +are orthogonal to that, although they still can improve storage in other ways. Q: What other designs have been considered and what is the rationale for not choosing them? -A: Alternatively, we can still get rid of TTN cost by increasing the base fees for storage reads and writes. However, - this could require increasing the fees by quite a lot, which could end up breaking many contracts. +A: There were several ideas on Flat Storage implementation. One of questions is whether we need global flat storage +for all shards or separate flat storages. Due to how sharding works, we have make flat storages separate, because in +the future node may have to catchup new shard while already tracking old shards, and flat storage heads (see +Specification) must be different for these shards. + +Flat storage deltas are another tricky part of design, but we cannot avoid them, because at certain points of time +different nodes can disagree what is the current chain head, and they have to support reads for some subset of latest +blocks with decent speed. We can't really fallback to Trie in such cases because storage reads for it are much slower. + +Another implementation detail is where to put flat storage head. The planned implementation doesn't rely on that +significantly and can be changed, but for MVP we assume flat storage head = chain final head as a simplest solution. Q: What is the impact of not doing this? -A: Storage reads will remain inefficiently implemented and cost more than they should. +A: Storage reads will remain inefficiently implemented and cost more than they should, and the gas fees will remain +difficult for the contract developers to predict. ## Specification -The key idea of FlatStorage is to store a direct mapping from trie keys to values on disk. + +The key idea of Flat Storage is to store a direct mapping from trie keys to values on disk. Here the values of this mapping can be either the value corresponding to the trie key itself, or the value ref, a hash that points to the address of the value. If the value itself is stored, only one disk read is needed to look up a value from flat storage, otherwise two disk reads are needed if the value ref is stored. We will discuss more in the following section for whether we use values or value refs. -For the purpose of high level discussion, it suffices to say that with FlatStorage, +For the purpose of high level discussion, it suffices to say that with Flat Storage, at most two disk reads are needed to perform a storage read. The simple design above won't work because there could be forks in the chain. In the following case, FlatStorage must support key value lookups for states of the blocks on both forks. ``` - Block B1 - Block B2 - ... - / + / Block B1 - Block B2 - ... block A \ Block C1 - Block C2 - ... ``` @@ -74,14 +99,15 @@ block A The handling of forks will be the main consideration of the following design. More specifically, the design should satisfy the following requirements, 1) It should support concurrent block processing. Blocks on different forks are processed - concurrently in our client code, so the flat storage API must support that. + concurrently in the nearcore Client code, the struct which responsibility includes receiving blocks from network, + scheduling applying chunks and writing results of that to disk. Flat storage API must be aligned with that. 2) In case of long forks, block processing time should not be too much longer than the average case. We don’t want this case to be exploitable. It is acceptable that block processing time is 200ms longer, which may slow down block production, but probably won’t cause missing blocks and chunks. - It is not acceptable if block processing time is 10s, which may lead to more forks and instability in the network. + 10s delays are not acceptable and may lead to more forks and instability in the network. 3) The design must be able to decrease storage access cost in all cases, since we are going to change the storage read fees based on flat storage. - We can't conditionally enable FlatStorage for some blocks and disable it for other, because + We can't conditionally enable Flat Storage for some blocks and disable it for other, because the fees we charge must be consistent. The mapping of key value pairs FlatStorage stored on disk matches the state at some block. @@ -98,49 +124,31 @@ We call these deltas FlatStorageDelta (FSD). Let’s say the flat storage head i and we are applying transactions based on block h’. Since h is the last final block, h is an ancestor of h'. To access the state at block h', we need FSDs of all blocks between h and h'. Note that all these FSDs must be stored in memory, otherwise, the access of FSDs will trigger -more disk reads and we will have to set storage key read fee higher. - -However, the Doomslug consensus algorithm doesn’t provide any guarantees in the distance of blocks -that we need to process since it could be arbitrarily long for a block to be finalized. -To solve this problem, we make another proposal (TODO: attach link for the proposal) to -set gas limit to zero for blocks with height larger than the latest final block’s height + X. -If the gas limit is set to zero for a block, it won't contain any transactions or receipts, -and FlatStorage won't need to store the delta for this block. -With this change, FlatStorage only needs to store FSDs for blocks with height less than the latest -final block’s height + X. And since there can be at most one valid block per height, -FlatStorage only needs to store at most X FSDs in memory. +more disk reads and we will have to set storage key read fee higher. ### FSD size estimation -To set the value of X, we need to see how many block deltas can fit in memory. - -We can estimate FSD size using protocol fees. -Assume that flat state stores a mapping from keys to value refs. -Maximal key length is ~2 KiB which is the limit of contract data key size. -During wasm execution, we pay `wasm_storage_write_base` = 64 Ggas per call and -`wasm_storage_write_key_byte` = 70 Mgas per key byte. -In the extreme case it means that we pay `(64_000 / 2 KiB + 70) Mgas ~= 102 Mgas` per byte. -Then the total size of keys changed in a block is at most -`block_gas_limit / gas_per_byte * num_shards = (1300 Tgas / 102 Mgas) * 4 ~= 50 MiB`. -To estimate the sizes of value refs, there will be at most -`block_gas_limit / wasm_storage_write_base * num_shards -= 1300 Tgas / 64 Ggas * 4 = 80K` changed entries in a block. -Since one value ref takes 40 bytes, limit of total size of changed value refs in a block -is then 3.2 MiB. +We prefer to store deltas in memory, because memory read is much faster than disk read, and even a single extra RocksDB +access requires increasing storage fees, which is not desirable. To reduce delta size, we will store hashes of trie keys +instead of keys, because deltas are read-only. Now let's carefully estimate FSD size. + +We can do so using protocol fees as of today. Assume that flat state stores a mapping from keys to value refs. +Maximal key length is ~2 KiB which is the limit of contract data key size. During wasm execution, we pay +`wasm_storage_write_base` = 64 Ggas per call. Entry size is 68 B for key hash and value ref. +Then the total size of keys changed in a block is at most +`chunk_gas_limit / gas_per_entry * entry_size * num_shards = (1300 Tgas / 64 Ggas) * 68 B * 4 ~= 5.5 MiB`. -To sum it up, we will have < 54 MiB for one block, and ~1.1 GiB for 20 blocks. +Assuming that we can increase RAM requirements by 1 GiB, we can afford to store deltas for 100-200 blocks +simultaneously. Note that if we store a value instead of value ref, size of FSDs can potentially be much larger. Because value limit is 4 MiB, we can’t apply previous argument about base cost. -Since `wasm_storage_write_value_byte` = 31 Mgas, one FSD size can be estimated as -`(1300 Tgas / min(storage_write_value_byte, storage_write_key_byte) * num_shards)`, or ~170 MiB, -which is 3 times higher. - +Since `wasm_storage_write_value_byte` = 31 Mgas, values contribution to FSD size can be estimated as +`(1300 Tgas / storage_write_value_byte * num_shards)`, or ~167 MiB. Same estimation for trie keys gives 54 MiB. The advantage of storing values instead of value refs is that it saves one disk read if the key has been modified in the recent blocks. It may be beneficial if we get many transactions or receipts touching the same trie keys in consecutive blocks, but it is hard to estimate the value of such benefits without more data. -Since storing values will cost much more memory than value refs, we will likely choose to store value refs -in FSDs and set X to a value between 10 and 20. +We may store only short values ("inlining"), but this idea is orthogonal and can be applied separately. ### Storage Writes Currently, storage writes are charged based on the number of touched trie nodes (TTN cost), because updating the leaf trie @@ -168,17 +176,16 @@ There are multiple proposals on how storage writes can work with FlatStorage. ### Migration Plan There are two main questions regarding to how to enable FlatStorage. 1) Whether there should be database migration. The main challenge of enabling FlatStorage will be to build the flat state - column, which requires iterating the entire state. We currently estimate that it takes 1 hour to build - flat state for archival nodes and 15 minutes for rpc and validator nodes. Note that this estimation is very rough - and further verification is needed. The main concern is that if it takes too long for archival node to migrate, + column, which requires iterating the entire state. Estimations showed that it takes 10 hours to build + flat state for archival nodes and 5 hours for rpc and validator nodes in 8 threads. The main concern is that if + it takes too long for archival node to migrate, they may have a hard time catching up later since the block processing speed of archival nodes is not very fast. Alternatively, we can build the flat state in a background process while the node is running. This provides a better experience for both archival and validator nodes since the migration process is transient to them. It would require more implementation effort from our side. - To make a decision, we will verify the time it takes to build flat state. If it will cause a problem for archival nodes - to catch up, we will implement the background migration process. + We currently proceed with background migration using 8 threads. 2) Whether there should be a protocol upgrade. The enabling of FlatStorage itself does not require a protocol upgrade, since it is an internal storage implementation that doesn't change protocol level. However, a protocol upgrade is needed if we want to adjust fees based on the storage performance with FlatStorage. These two changes can happen in one release, @@ -190,24 +197,28 @@ it is an internal storage implementation that doesn't change protocol level. How ## Reference Implementation FlatStorage will implement the following structs. -`FlatStateDelta`: a HashMap that contains state changes introduced in a block. They can be applied -on top the state at flat head to compute state at another block. - -`FlatState`: provides an interface to get value or value references from flat storage. It - will be part of `Trie`, and all trie reads will be directed to the FlatState object. - A `FlatState` object is based on a block `block_hash`, and it provides key value lookups - on the state after the block `block_hash` is applied. +`FlatStorageChunkView`: interface for getting value or value reference from flat storage for +specific shard, block hash and trie key. In current logic we plan to make it part of `Trie`, +and all trie reads will be directed to this object. Though we could work with chunk hashes, we don't, +because block hashes are easier to navigate. -`ShardFlatStates`: provides an interface to construct `FlatState` for each shard. - -`FlatStorageState`: stores information about the state of the flat storage itself, +`FlatStorage`: API for interacting with flat storage for fixed shard, including updating head, + adding new delta and creating `FlatStorageChunkView`s. for example, all block deltas that are stored in flat storage and the flat storage head. `FlatState` can access `FlatStorageState` to get the list of deltas it needs to apply on top of state of current flat head in order to compute state of a target block. -It may be noted that in this implementation, a separate `FlatState` and `FlatStorageState` -will be created for each shard. The reason is that there are two modes of block processing, +`FlatStorageManager`: owns flat storages for all shards, being stored in `NightshadeRuntime`, accepts + updates from `Chain` side, caused by successful processing of chunk or block. + +`FlatStorageCreator`: handles flat storage structs creation or initiates background creation (aka migration +process) if flat storage data is not presend on DB yet. + +`FlatStateDelta`: a HashMap that contains state changes introduced in a chunk. They can be applied +on top the state at flat storage head to compute state at another block. + +The reason for having separate flat storages that there are two modes of block processing, normal block processing and block catchups. Since they are performed on different ranges of blocks, flat storage need to be able to support different range of blocks on different shards. Therefore, we separate the flat storage objects @@ -217,18 +228,21 @@ used for different shards. `DBCol::FlatState` stores a mapping from trie keys to the value corresponding to the trie keys, based on the state of the block at flat storage head. - *Rows*: trie key (`Vec`) -- *Column type*: `ValueOrValueRef` +- *Column type*: `ValueRef` + +`DBCol::FlatStateDeltas` stores all existing FSDs as mapping from `(shard_id, block_hash, trie_key)` to the `ValueRef`. +To read the whole delta, we read all values for given key prefix. This delta stores all state changes introduced in the +given shard of the given block. +- *Rows*: `{ shard_id, block_hash, trie_key }` +- *Column type*: `ValueRef` -`DBCol::FlatStateDeltas` stores a mapping from `(shard_id, block_hash)` to the `FlatStateDelta` that stores -state changes introduced in the given shard of the given block. -- *Rows*: `{ shard_id, block_hash }` -- *Column type*: `FlatStateDelta` Note that `FlatStateDelta`s needed are stored in memory, so during block processing this column won't be used at all. This column is only used to load deltas into memory at `FlatStorageState` initialization time when node starts. -`DBCol::FlatStateHead` stores the flat head at different shards. -- *Rows*: `shard_id` -- *Column type*: `CryptoHash` +`DBCol::FlatStateMetadata` stores miscellaneous data about flat storage layout, including current flat storage +head, current creation status and info about deltas existence. We don't specify exact format here because it is under +discussion and can be tweaked until release. + Similarly, flat head is also stored in `FlatStorageState` in memory, so this column is only used to initialize `FlatStorageState` when node starts. From 32b878b5a89e25e48b9500c8bfe1c6027626c341 Mon Sep 17 00:00:00 2001 From: Longarithm Date: Fri, 24 Feb 2023 17:45:06 +0400 Subject: [PATCH 14/24] drawbacks --- neps/nep-0399.md | 137 +++++++++++++++++++++++++++++++---------------- 1 file changed, 91 insertions(+), 46 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index fdd818f2e..c3ea90800 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -80,7 +80,7 @@ difficult for the contract developers to predict. ## Specification -The key idea of Flat Storage is to store a direct mapping from trie keys to values on disk. +The key idea of Flat Storage is to store a direct mapping from trie keys to values in the DB. Here the values of this mapping can be either the value corresponding to the trie key itself, or the value ref, a hash that points to the address of the value. If the value itself is stored, only one disk read is needed to look up a value from flat storage, otherwise two disk reads are needed if the value @@ -173,6 +173,8 @@ There are multiple proposals on how storage writes can work with FlatStorage. it is unclear at this point what the new cost would look like and whether further optimizations are needed to bring down the cost for writes in the new cost model. +See https://gov.near.org/t/storage-write-optimizations/30083 for more details. + ### Migration Plan There are two main questions regarding to how to enable FlatStorage. 1) Whether there should be database migration. The main challenge of enabling FlatStorage will be to build the flat state @@ -258,80 +260,87 @@ pub fn from_state_changes(changes: &[RawStateChangesWithTrieKey]) -> FlatStateDe ``` Converts raw state changes to flat state delta. The raw state changes will be returned as part of the result of `Runtime::apply_transactions`. They will be converted to `FlatStateDelta` to be added -to `FlatStorageState` during `Chain::post_processblock`. +to `FlatStorageState` during `Chain::postprocess_block` or `Chain::catch_up_postprocess`. -### ```FlatState``` -```FlatState``` will be created for a shard `shard_id` and a block `block_hash`, and it can perform +### ```FlatStorageChunkView``` +```FlatStorageChunkView``` will be created for a shard `shard_id` and a block `block_hash`, and it can perform key value lookup for the state of shard `shard_id` after block `block_hash` is applied. ```rust -pub struct FlatState { +pub struct FlatStorageChunkView { /// Used to access flat state stored at the head of flat storage. store: Store, /// The block for which key-value pairs of its state will be retrieved. The flat state /// will reflect the state AFTER the block is applied. block_hash: CryptoHash, /// In-memory cache for the key value pairs stored on disk. -#[allow(unused)] cache: FlatStateCache, /// Stores the state of the flat storage -#[allow(unused)] -flat_storage_state: FlatStorageState, +flat_storage: FlatStorage, } ``` -```FlatState``` will provide the following interface. +```FlatStorageChunkView``` will provide the following interface. ```rust pub fn get_ref( &self, key: &[u8], -) -> Result, StorageError> +) -> Result, StorageError> ``` Returns the value or value reference corresponding to the given `key` -for the state that this `FlatState` object represents, i.e., the state that after +for the state that this `FlatStorageChunkView` object represents, i.e., the state that after block `self.block_hash` is applied. -`FlatState` will be stored as a field in `Tries`. +### ```FlatStorageManager``` -###```ShardFlatStates``` -`ShardFlatStates` will be stored as part of `ShardTries`. Similar to how `ShardTries` is used to -construct new `Trie` objects given a state root and a shard id, `ShardFlatStates` is used to construct -a new `FlatState` object given a block hash and a shard id. +`FlatStorageManager` will be stored as part of `ShardTries` and `NightshadeRuntime`. Similar to how `ShardTries` is used to +construct new `Trie` objects given a state root and a shard id, `FlatStorageManager` is used to construct +a new `FlatStorageChunkView` object given a block hash and a shard id. ```rust -pub fn new_flat_state_for_shard( +pub fn new_flat_storage_chunk_view( &self, shard_id: ShardId, block_hash: Option, -) -> FlatState +) -> FlatStorageChunkView ``` -Creates a new `FlatState` to be used for performing key value lookups on the state of shard `shard_id` +Creates a new `FlatStorageChunkView` to be used for performing key value lookups on the state of shard `shard_id` after block `block_hash` is applied. ```rust -pub fn get_flat_storage_state_for_shard( +pub fn get_flat_storage( &self, shard_id: ShardId, -) -> Result +) -> Result ``` -Returns the `FlatStorageState` for the shard `shard_id`. This function is needed because even though -`FlatStorageState` is part of `Runtime`, `Chain` also needs access to `FlatStorageState` to update flat head. -We will also create a function with the same in `Runtime` that calls this function to provide `Chain` to access +Returns the `FlatStorage` for the shard `shard_id`. This function is needed because even though +`FlatStorage` is part of `NightshadeRuntime`, `Chain` also needs access to `FlatStorage` to update flat head. +We will also create a function with the same in `NightshadeRuntime` that calls this function to provide `Chain` to access to `FlatStorageState`. -###```FlatStorageState``` -`FlatStorageState` is created per shard. It provides information to which blocks the flat storage +```rust +pub fn remove_flat_storage( + &self, + shard_id: ShardId, +) -> Result +``` + +Removes flat storage for shard if we stopped tracking it. + +###```FlatStorage``` +`FlatStorage` is created per shard. It provides information to which blocks the flat storage on the given shard currently supports and what block deltas need to be applied on top the stored flat state on disk to get the state of the target block. ```rust -fn get_deltas_between_blocks( +fn get_blocks_to_head( &self, target_block_hash: &CryptoHash, -) -> Result>, FlatStorageError> +) -> Result, FlatStorageError> ``` -Returns the list of deltas between blocks `target_block_hash`(inclusive) and flat head(exclusive), +Returns the list of deltas between blocks `target_block_hash` (inclusive) and flat head (exclusive), Returns an error if `target_block_hash` is not a direct descendent of the current flat head. -This function will be used in `FlatState::get_ref`. +This function will be used in `FlatStorageChunkView::get_ref`. Note that we can't call it once and store during applying +chunk, because in parallel to that some block can be processed and flat head can be updated. ```rust fn update_flat_head(&self, new_head: &CryptoHash) -> Result<(), FlatStorageError> @@ -341,41 +350,77 @@ update the flat state on disk to reflect the state at the new head, and gc the ` are no longer needed from memory and from disk. ```rust -fn add_delta( +fn add_block( &self, block_hash: &CryptoHash, delta: FlatStateDelta, ) -> Result ``` -Adds `delta` to `FlatStorageState`, returns a `StoreUpdate` object that includes -#### Thread Safety -We should note that the implementation of `FlatStorageState` must be thread safe because it can +Adds `delta` to `FlatStorage`, returns a `StoreUpdate` object that includes DB transaction to be committed to persist +that change. + +```rust +fn get_ref( + &self, + block_hash: &CryptoHash, + key: &[u8], +) -> Result, FlatStorageError> +``` + +Returns `ValueRef` from flat storage state on top of `block_hash`. Returns `None` if key is not present, or an error if +block is not supported. + +### Thread Safety +We should note that the implementation of `FlatStorage` must be thread safe because it can be concurrently accessed by multiple threads. A node can process multiple blocks at the same time -if they are on different forks. Therefore, `FlatStorageState` will be guarded by a `RwLock` so its -access can be shared safely. +if they are on different forks, and chunks from these blocks can trigger storage reads in parallel. +Therefore, `FlatStorage` will be guarded by a `RwLock` so its access can be shared safely: ```rust -pub struct FlatStorageState(Arc>); +pub struct FlatStorage(Arc>); ``` ## Drawbacks +The main drawback is that we need to control total size of state updates in blocks after current final head. +current testnet/mainnet load amount of blocks under final head doesn't exceed 5 in 99.99% cases, we still have to +consider extreme cases, because Doomslug consensus doesn't give guarantees / upper limit on that. If we don't consider +this at all and there is no finality for a long time, validator nodes can crash because of too many FSDs of memory, and +chain slows down and stalls, which can have a negative impact on user/validator experience and reputation. For now, we +claim that we support enough deltas in memory for chain to be finalized, and the proper discussions are likely to happen +in NEPs like https://github.com/near/NEPs/pull/460. + +Risk of DB corruption slightly increases, and it becomes harder to replay blocks on chain. While `Trie` entries are +essentially immutable (in fact, value for each key is +unique, because key is a value hash), `FlatStorage` is read-modify-write, because values for the same `TrieKey` can be +completely different. We believe that such flat mapping is reasonable to maintain anyway, as for newly discovered state +sync idea. But if some change was applied incorrectly, we may have to recompute the whole flat storage, and for block +hashes before flat head we can't access flat storage at all. + +Though Flat Storage significantly reduces amount of storage reads, we have to keep it up-to-date, which results in 1 +extra disk write for changed key, and 1 auxiliary disk write + removal for each FSD. +We see this as an acceptable tradeoff, because actual disk writes are executed in background and are not a bottleneck +for block processing. For storage write in general Flat Storage is even a net improvement, because it removes +necessity to traverse changed nodes during write execution ("reads-for-writes"), and we can apply optimizations +there (see "Storage Writes" section). + Implementing FlatStorage will require a lot of engineering effort and introduce code that will make the codebase more -complicated. We are confident that FlatStorage will bring a lot of performance benefit, but we can only measure the exact -improvement after the implementation. In a very unlikely case, we may find that the benefit FlatStorage brings is not -worth the effort. +complicated. In particular, we had to extend `RuntimeAdapter` API with flat storage-related method after thorough +considerations. We are confident that FlatStorage will bring a lot of performance benefit, but we can only measure the exact +improvement after the implementation. We may find that the benefit FlatStorage brings is not +worth the effort, but it is very unlikely. -Another issue is that it will make the state rollback harder in the future when we enable challenges in phase 2 of sharding. +It will make the state rollback harder in the future when we enable challenges in phase 2 of sharding. When a challenge is accepted and the state needs to be rolled back to a previous block, the entire flat state needs to -be rebuilt, which could take a long time. +be rebuilt, which could take a long time. Alternatively, we could postpone garbage collection of deltas and add support +of applying them backwards. ## Unresolved Issues -As we discussed in Section Specification, there are still unanswered questions around how the new cost model for storage -writes would look like and how the current storage can be upgraded to enabled FlatStorage. We expect to finalize -the migration plan before this NEP gets merged, but we might need more time to collect data and measurement around -storage write costs, which can be only be collected after FlatStorage is partially implemented. +As we discussed in previous sections, there are still unanswered questions around how the new cost model for storage +writes would look like and how FSDs size should be safely limited. We might need more time to collect data and +measurements around storage write costs, which can be only be collected after FlatStorage is partially implemented. Another big unanswered question is how FlatStorage would work when challenges are enabled. We consider that to be out of the scope of this NEP because the details of how challenges will be implemented are not clear yet. But this is something From 210318ac096768719f773250c1d221a6d3c517e1 Mon Sep 17 00:00:00 2001 From: Longarithm Date: Fri, 24 Feb 2023 19:28:04 +0400 Subject: [PATCH 15/24] shard removals --- neps/nep-0399.md | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index c3ea90800..304ec3e16 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -150,6 +150,13 @@ modified in the recent blocks. It may be beneficial if we get many transactions trie keys in consecutive blocks, but it is hard to estimate the value of such benefits without more data. We may store only short values ("inlining"), but this idea is orthogonal and can be applied separately. +### Storage Reads + +The plan for new storage read costs is to essentially drop TTN and cover the cost with `storage_read_base`, which should +have an order of 100 Ggas. This corresponds to 100 us time needed for 1 DB read. Ideally we would increase it to 200 +Ggas and further, because we do at least 2 DB reads, but gas usage analytics showed that it would break existing +contracts. So we stick with 100 Ggas and plan to implement inlining separately. + ### Storage Writes Currently, storage writes are charged based on the number of touched trie nodes (TTN cost), because updating the leaf trie node which stores the value to the trie key requires updating all trie nodes on the path leading to the leaf node. @@ -399,8 +406,8 @@ sync idea. But if some change was applied incorrectly, we may have to recompute hashes before flat head we can't access flat storage at all. Though Flat Storage significantly reduces amount of storage reads, we have to keep it up-to-date, which results in 1 -extra disk write for changed key, and 1 auxiliary disk write + removal for each FSD. -We see this as an acceptable tradeoff, because actual disk writes are executed in background and are not a bottleneck +extra disk write for changed key, and 1 auxiliary disk write + removal for each FSD. Disk requirements also slightly +increase. We think it is acceptable, because actual disk writes are executed in background and are not a bottleneck for block processing. For storage write in general Flat Storage is even a net improvement, because it removes necessity to traverse changed nodes during write execution ("reads-for-writes"), and we can apply optimizations there (see "Storage Writes" section). @@ -416,6 +423,14 @@ When a challenge is accepted and the state needs to be rolled back to a previous be rebuilt, which could take a long time. Alternatively, we could postpone garbage collection of deltas and add support of applying them backwards. +Speaking of new sharding phases, once nodes are no longer tracking all shards, Flat Storage must have support for adding +or removing state for some specific shard. Adding new shard is a tricky but natural extension of catchup process. Our +current approach for removal is to iterate over all entries in `DBCol::FlatState` and find out for each trie key to +which shard it belongs to. We would be happy to assume that each shard is represented by set of +contiguous ranges in `DBCol::FlatState` and make removals simpler, but this is still under discussion. + +Last but not least, resharding is not supported by current implementation yet. + ## Unresolved Issues As we discussed in previous sections, there are still unanswered questions around how the new cost model for storage From d7c88f5ead89fe97d326cf74d2fe21783aa072ad Mon Sep 17 00:00:00 2001 From: Longarithm Date: Fri, 24 Feb 2023 20:01:56 +0400 Subject: [PATCH 16/24] costs --- neps/nep-0399.md | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index 304ec3e16..c47609de9 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -152,13 +152,19 @@ We may store only short values ("inlining"), but this idea is orthogonal and can ### Storage Reads -The plan for new storage read costs is to essentially drop TTN and cover the cost with `storage_read_base`, which should +Current read cost does not exceed 56 Ggas + 30 Mgas * key.len() + 5 Mgas * value.len() + 16 Ggas * TTN. It makes sense +to consider only children of root of contract state only. For small contracts I would expect TTN < 5 due to few amount +of branches, for Aurora and Sweatcoin we've seen TTN around 10-15. + +The plan for new storage read costs is to essentially drop TTN and cover it with `storage_read_base`, which should have an order of 100 Ggas. This corresponds to 100 us time needed for 1 DB read. Ideally we would increase it to 200 Ggas and further, because we do at least 2 DB reads, but gas usage analytics showed that it would break existing -contracts. So we stick with 100 Ggas and plan to implement inlining separately. +contracts. So we are likely to agree on 100 Ggas and plan to implement inlining separately. Note that cost reduction, if +it happens, won't be significant, because in fact 16 Ggas per TTN is significantly off. ### Storage Writes -Currently, storage writes are charged based on the number of touched trie nodes (TTN cost), because updating the leaf trie + +Storage writes are charged similarly and include TTN as well, because updating the leaf trie node which stores the value to the trie key requires updating all trie nodes on the path leading to the leaf node. All writes are committed at once in one db transaction at the end of block processing, outside of runtime after all receipts in a block are executed. However, at the time of execution, runtime needs to calculate the cost, @@ -176,7 +182,8 @@ There are multiple proposals on how storage writes can work with FlatStorage. - Charge based on maximum depth of a contract’s state, instead of per-touch-trie node. - Charge based on key length only. - Both of the above ideas would allow us to remove writes from the critical path of block execution. However, + Both of the above ideas would allow us to get rid of trie traversal ("reads-for-writes") from the critical path of + block execution. However, it is unclear at this point what the new cost would look like and whether further optimizations are needed to bring down the cost for writes in the new cost model. From f2a22fcec40b761747443900437f3d9f2a34c220 Mon Sep 17 00:00:00 2001 From: Longarithm Date: Mon, 27 Feb 2023 15:40:46 +0400 Subject: [PATCH 17/24] nits --- neps/nep-0399.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index c47609de9..ac7b20ed7 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -43,10 +43,12 @@ the base fees for storage reads and writes. However, technically the biggest TTN to strongly avoid increasing fees, because it may break existing contract calls, not even mentioning that it would greatly reduce capacity of NEAR blocks, because for current mainnet usecases depth is usually below 20. -allowing to make number of traversed nodes very stable and predictable. But implementation is tricky, we didn't move -significantly further POC here: https://github.com/near/nearcore/discussions/4815. Also for most key-value pairs +We also consider changing tree type from Trie to AVL, B-tree, etc. to make number of traversed nodes more stable and +predictable. But we approached AVL idea, and implementation turned out to be tricky, so we didn't go +much further than POC here: https://github.com/near/nearcore/discussions/4815. Also for most key-value pairs tree depth will actually increase - for example, if you have 1M keys, depth is always 20, and it would cause increasing -fees as well. +fees as well. Size of intermediate node also increases, because we have to need to store a key there to decide whether +we should go to the left or right child. Separate idea is to get rid of global state root completely: https://github.com/near/NEPs/discussions/425. Instead, we could track latest account state in the block where it was changed. But after closer look, it brings us to @@ -189,6 +191,9 @@ There are multiple proposals on how storage writes can work with FlatStorage. See https://gov.near.org/t/storage-write-optimizations/30083 for more details. +While storage writes are not fully implemented yet, increasing parameter compute cost for storage writes in +https://github.com/near/NEPs/pull/455 may help as an intermediate solution. + ### Migration Plan There are two main questions regarding to how to enable FlatStorage. 1) Whether there should be database migration. The main challenge of enabling FlatStorage will be to build the flat state From 855492af475190e4abd1a1938ef25c18cc92a1d0 Mon Sep 17 00:00:00 2001 From: Longarithm Date: Fri, 3 Mar 2023 22:10:27 +0400 Subject: [PATCH 18/24] apply suggestions --- neps/nep-0399.md | 46 ++++++++++++++++++++-------------------------- 1 file changed, 20 insertions(+), 26 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index ac7b20ed7..37cdbff1d 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -30,7 +30,8 @@ for NEAR Protocol client implementations, i.e. nearcore, as exact TTN number mus by all clients. This prevents storage optimizations that use other strategies than nearcore uses today. With Flat Storage, number of disk reads is reduced from worst-case 2 * key_length to exactly 2, storage read gas -fees are simplified by getting rid of TTN cost, and potentially can be reduced because less disk reads are needed. +fees are simplified by getting rid of TTN cost, and potentially can be further reduced because fewer disk reads +are needed. ## Rationale and alternatives @@ -51,12 +52,12 @@ fees as well. Size of intermediate node also increases, because we have to need we should go to the left or right child. Separate idea is to get rid of global state root completely: https://github.com/near/NEPs/discussions/425. Instead, -we could track latest account state in the block where it was changed. But after closer look, it brings us to +we could track the latest account state in the block where it was changed. But after closer look, it brings us to similar questions - if some key was untouched for a long time, it becomes harder to find exact block to find latest value for it, and we need some tree-shaped structure again. Because ideas like that could be also extremely invasive, we stopped considering them at this point. -Another ideas around storage include exploring databases other than RocksDB, or moving State to a separate database. +Other ideas around storage include exploring databases other than RocksDB, or moving State to a separate database. We can tweak State representation, i.e. start trie key from account id to achieve data locality within the same account. However, Flat Storage main goal is speed up reads and make their costs predictable, and these ideas are orthogonal to that, although they still can improve storage in other ways. @@ -158,11 +159,11 @@ Current read cost does not exceed 56 Ggas + 30 Mgas * key.len() + 5 Mgas * value to consider only children of root of contract state only. For small contracts I would expect TTN < 5 due to few amount of branches, for Aurora and Sweatcoin we've seen TTN around 10-15. -The plan for new storage read costs is to essentially drop TTN and cover it with `storage_read_base`, which should -have an order of 100 Ggas. This corresponds to 100 us time needed for 1 DB read. Ideally we would increase it to 200 -Ggas and further, because we do at least 2 DB reads, but gas usage analytics showed that it would break existing -contracts. So we are likely to agree on 100 Ggas and plan to implement inlining separately. Note that cost reduction, if -it happens, won't be significant, because in fact 16 Ggas per TTN is significantly off. +The plan for new storage read costs is to essentially drop TTN and cover it with `storage_read_base`. The exact values +will be determined by our estimations. Based on https://github.com/near/nearcore/discussions/6575, costs should have an +order of hundreds of Ggas, which corresponds to hundreds of us per 1 DB read. Because current cost, 16 Ggas per TTN, is +already significantly off, notable cost reductions are unlikely. And because we don't want to increase costs, we are +going to cover undercharging with https://github.com/near/NEPs/pull/455. ### Storage Writes @@ -226,7 +227,7 @@ because block hashes are easier to navigate. `FlatStorage`: API for interacting with flat storage for fixed shard, including updating head, adding new delta and creating `FlatStorageChunkView`s. for example, all block deltas that are stored in flat storage and the flat - storage head. `FlatState` can access `FlatStorageState` to get the list of + storage head. `FlatStorageChunkView` can access `FlatStorage` to get the list of deltas it needs to apply on top of state of current flat head in order to compute state of a target block. @@ -258,14 +259,14 @@ given shard of the given block. - *Column type*: `ValueRef` Note that `FlatStateDelta`s needed are stored in memory, so during block processing this column won't be used - at all. This column is only used to load deltas into memory at `FlatStorageState` initialization time when node starts. + at all. This column is only used to load deltas into memory at `FlatStorage` initialization time when node starts. `DBCol::FlatStateMetadata` stores miscellaneous data about flat storage layout, including current flat storage head, current creation status and info about deltas existence. We don't specify exact format here because it is under discussion and can be tweaked until release. -Similarly, flat head is also stored in `FlatStorageState` in memory, so this column is only used to initialize - `FlatStorageState` when node starts. +Similarly, flat head is also stored in `FlatStorage` in memory, so this column is only used to initialize + `FlatStorage` when node starts. ### `FlatStateDelta` `FlatStateDelta` stores a mapping from trie keys to value refs. If the value is `None`, it means the key is deleted @@ -279,7 +280,7 @@ pub fn from_state_changes(changes: &[RawStateChangesWithTrieKey]) -> FlatStateDe ``` Converts raw state changes to flat state delta. The raw state changes will be returned as part of the result of `Runtime::apply_transactions`. They will be converted to `FlatStateDelta` to be added -to `FlatStorageState` during `Chain::postprocess_block` or `Chain::catch_up_postprocess`. +to `FlatStorage` during `Chain::postprocess_block` or `Chain::catch_up_postprocess`. ### ```FlatStorageChunkView``` ```FlatStorageChunkView``` will be created for a shard `shard_id` and a block `block_hash`, and it can perform @@ -291,8 +292,6 @@ store: Store, /// The block for which key-value pairs of its state will be retrieved. The flat state /// will reflect the state AFTER the block is applied. block_hash: CryptoHash, -/// In-memory cache for the key value pairs stored on disk. -cache: FlatStateCache, /// Stores the state of the flat storage flat_storage: FlatStorage, } @@ -334,7 +333,7 @@ pub fn get_flat_storage( Returns the `FlatStorage` for the shard `shard_id`. This function is needed because even though `FlatStorage` is part of `NightshadeRuntime`, `Chain` also needs access to `FlatStorage` to update flat head. We will also create a function with the same in `NightshadeRuntime` that calls this function to provide `Chain` to access -to `FlatStorageState`. +to `FlatStorage`. ```rust pub fn remove_flat_storage( @@ -443,18 +442,13 @@ contiguous ranges in `DBCol::FlatState` and make removals simpler, but this is s Last but not least, resharding is not supported by current implementation yet. -## Unresolved Issues - -As we discussed in previous sections, there are still unanswered questions around how the new cost model for storage -writes would look like and how FSDs size should be safely limited. We might need more time to collect data and -measurements around storage write costs, which can be only be collected after FlatStorage is partially implemented. - -Another big unanswered question is how FlatStorage would work when challenges are enabled. We consider that to be out of -the scope of this NEP because the details of how challenges will be implemented are not clear yet. But this is something -we need to consider when we design challenges. - ## Future possibilities +Flat Storage maintains all state keys in sorted order, which seems beneficial. We currently investigate opportunity to +speed up state sync: instead of traversing state part in Trie, we can extract range of keys and values from Flat Storage +and build range of Trie nodes based on it. It is well known that reading Trie nodes is a bottleneck for state sync as +well. + ## Copyright [copyright]: #copyright From 1ad54cdd1cf74e06aee305d75d759b47cd884652 Mon Sep 17 00:00:00 2001 From: Vlad Frolov Date: Mon, 20 Mar 2023 21:53:19 +0100 Subject: [PATCH 19/24] feat: Added Changelog section --- neps/nep-0399.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index 37cdbff1d..26a4d7e7b 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -449,6 +449,32 @@ speed up state sync: instead of traversing state part in Trie, we can extract ra and build range of Trie nodes based on it. It is well known that reading Trie nodes is a bottleneck for state sync as well. +## Changelog + +### 1.0.0 - Initial Version + +The NEP was approved by Protocol Working Group members on March 16, 2023 ([meeting recording](https://www.youtube.com/watch?v=4VxRoKwLXIs)): + +- [Bowen's vote](https://github.com/near/NEPs/pull/399#issuecomment-1467010125) +- [Marcello's vote](https://github.com/near/NEPs/pull/399#pullrequestreview-1341069564) +- [Marcin's vote](https://github.com/near/NEPs/pull/399#issuecomment-1465977749) + +#### Benefits + +* The proposal makes serving reads more efficient; making the NEAR protocol cheaper to use and increasing the capacity of the network; +* The proposal makes estimating gas costs for a transaction easier as the fees for reading are no longer a function of the trie structure whose shape the smart contract developer does not know ahead of time and can continuously change. +* The proposal should open doors to enabling future efficiency gains in the protocol and further simplifying gas fee estimations. +* 'Secondary' index over the state data - which would allow further optimisations in the future. + +#### Concerns + +| # | Concern | Resolution | Status | +| - | - | - | - | +| 1 | The cache requires additional database storage | There is an upper bound on how much additional storage is needed. The costs for the additional disk storage should be negligible | Not an issue | +| 2 | Additional implementation complexity | Given the benefits of the proposal, I believe the complexity is justified | not an issue | +| 3 | Additional memory requirement | Most node operators are already operating over-provisioned machines which can handle the additional memory requirement. The minimum requirements should be raised but it appears that minimum requirements are already not enough to operate a node | This is a concern but it is not specific to this project | +| 4 | Slowing down the read-update-write workload | This is common pattern in smart contracts so indeed a concern. However, there are future plans on how to address this by serving writes from the flat storage as well which will also reduce the fees of serving writes and make further improvements to the NEAR protocol | This is a concern but hopefully will be addressed in future iterations of the project | + ## Copyright [copyright]: #copyright From 9222c433b9b346f5b97a0d34d53f6e951199d59a Mon Sep 17 00:00:00 2001 From: Aleksandr Logunov Date: Tue, 21 Mar 2023 02:59:30 +0400 Subject: [PATCH 20/24] Update neps/nep-0399.md Co-authored-by: Marcelo Fornet --- neps/nep-0399.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index 26a4d7e7b..892f2aa73 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -282,8 +282,8 @@ Converts raw state changes to flat state delta. The raw state changes will be re `Runtime::apply_transactions`. They will be converted to `FlatStateDelta` to be added to `FlatStorage` during `Chain::postprocess_block` or `Chain::catch_up_postprocess`. -### ```FlatStorageChunkView``` -```FlatStorageChunkView``` will be created for a shard `shard_id` and a block `block_hash`, and it can perform +### `FlatStorageChunkView` +`FlatStorageChunkView` will be created for a shard `shard_id` and a block `block_hash`, and it can perform key value lookup for the state of shard `shard_id` after block `block_hash` is applied. ```rust pub struct FlatStorageChunkView { From 066290f958fe03d4df34d40097053d7e0b9dad06 Mon Sep 17 00:00:00 2001 From: Aleksandr Logunov Date: Tue, 21 Mar 2023 02:59:40 +0400 Subject: [PATCH 21/24] Update neps/nep-0399.md Co-authored-by: Marcelo Fornet --- neps/nep-0399.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index 892f2aa73..23990ac75 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -297,7 +297,7 @@ flat_storage: FlatStorage, } ``` -```FlatStorageChunkView``` will provide the following interface. +`FlatStorageChunkView` will provide the following interface. ```rust pub fn get_ref( &self, From da383fa99fdcba7ae4b0baa1d70d70e26a7c92ce Mon Sep 17 00:00:00 2001 From: Aleksandr Logunov Date: Tue, 21 Mar 2023 02:59:47 +0400 Subject: [PATCH 22/24] Update neps/nep-0399.md Co-authored-by: Marcelo Fornet --- neps/nep-0399.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index 23990ac75..aa79eb362 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -308,7 +308,7 @@ Returns the value or value reference corresponding to the given `key` for the state that this `FlatStorageChunkView` object represents, i.e., the state that after block `self.block_hash` is applied. -### ```FlatStorageManager``` +### `FlatStorageManager` `FlatStorageManager` will be stored as part of `ShardTries` and `NightshadeRuntime`. Similar to how `ShardTries` is used to construct new `Trie` objects given a state root and a shard id, `FlatStorageManager` is used to construct From c36593e9e2b4120f04f4dd39c0d69fdc6a427a33 Mon Sep 17 00:00:00 2001 From: Aleksandr Logunov Date: Tue, 21 Mar 2023 02:59:55 +0400 Subject: [PATCH 23/24] Update neps/nep-0399.md Co-authored-by: Marcelo Fornet --- neps/nep-0399.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index aa79eb362..55890cbad 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -344,7 +344,7 @@ pub fn remove_flat_storage( Removes flat storage for shard if we stopped tracking it. -###```FlatStorage``` +###`FlatStorage` `FlatStorage` is created per shard. It provides information to which blocks the flat storage on the given shard currently supports and what block deltas need to be applied on top the stored flat state on disk to get the state of the target block. From a886279462eca2c41c85798aa8b0afd6d939902d Mon Sep 17 00:00:00 2001 From: Longarithm Date: Mon, 20 Mar 2023 23:24:51 +0000 Subject: [PATCH 24/24] protocol change --- neps/nep-0399.md | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/neps/nep-0399.md b/neps/nep-0399.md index 26a4d7e7b..3168674bf 100644 --- a/neps/nep-0399.md +++ b/neps/nep-0399.md @@ -153,21 +153,22 @@ modified in the recent blocks. It may be beneficial if we get many transactions trie keys in consecutive blocks, but it is hard to estimate the value of such benefits without more data. We may store only short values ("inlining"), but this idea is orthogonal and can be applied separately. -### Storage Reads +### Protocol changes -Current read cost does not exceed 56 Ggas + 30 Mgas * key.len() + 5 Mgas * value.len() + 16 Ggas * TTN. It makes sense -to consider only children of root of contract state only. For small contracts I would expect TTN < 5 due to few amount -of branches, for Aurora and Sweatcoin we've seen TTN around 10-15. +Flat Storage itself doesn't change protocol. We only change impacted storage costs to reflect changes in performance. Below we describe reads and writes separately. -The plan for new storage read costs is to essentially drop TTN and cover it with `storage_read_base`. The exact values -will be determined by our estimations. Based on https://github.com/near/nearcore/discussions/6575, costs should have an -order of hundreds of Ggas, which corresponds to hundreds of us per 1 DB read. Because current cost, 16 Ggas per TTN, is -already significantly off, notable cost reductions are unlikely. And because we don't want to increase costs, we are -going to cover undercharging with https://github.com/near/NEPs/pull/455. +#### Storage Reads -### Storage Writes +Latest proposal for shipping storage reads is [here](https://github.com/near/nearcore/issues/8006#issuecomment-1473718509). +It solves several issues with costs, but the major impact of flat storage is that essentially for reads +`wasm_touching_trie_node` and `wasm_read_cached_trie_node` are reduced to 0. Reason is that before we had to cover costs +of reading nodes from memory or disk, and with flat storage we make only 2 DB reads. -Storage writes are charged similarly and include TTN as well, because updating the leaf trie +Latest up-to-date gas and compute costs can be found in nearcore repo. + +#### Storage Writes + +Storage writes are charged similarly to reads and include TTN as well, because updating the leaf trie node which stores the value to the trie key requires updating all trie nodes on the path leading to the leaf node. All writes are committed at once in one db transaction at the end of block processing, outside of runtime after all receipts in a block are executed. However, at the time of execution, runtime needs to calculate the cost, @@ -192,8 +193,8 @@ There are multiple proposals on how storage writes can work with FlatStorage. See https://gov.near.org/t/storage-write-optimizations/30083 for more details. -While storage writes are not fully implemented yet, increasing parameter compute cost for storage writes in -https://github.com/near/NEPs/pull/455 may help as an intermediate solution. +While storage writes are not fully implemented yet, we may increase parameter compute cost for storage writes implemented +in https://github.com/near/NEPs/pull/455 as an intermediate solution. ### Migration Plan There are two main questions regarding to how to enable FlatStorage. @@ -456,7 +457,7 @@ well. The NEP was approved by Protocol Working Group members on March 16, 2023 ([meeting recording](https://www.youtube.com/watch?v=4VxRoKwLXIs)): - [Bowen's vote](https://github.com/near/NEPs/pull/399#issuecomment-1467010125) -- [Marcello's vote](https://github.com/near/NEPs/pull/399#pullrequestreview-1341069564) +- [Marcelo's vote](https://github.com/near/NEPs/pull/399#pullrequestreview-1341069564) - [Marcin's vote](https://github.com/near/NEPs/pull/399#issuecomment-1465977749) #### Benefits