Worst-case witness size for stateless validation #9378

jakmeier · 2023-08-02T19:20:40Z

Goals

Background

If we want to move to stateless validation, we need to understand the witness sizes involved.
This issue is about finding out hard limits that are never exceeded even when maliciously crafted transactions are sent in a perfectly corrupted order.

Why should NEAR One work on this

In the world of stateless validation, understanding and limiting state witness size is the key as time to deliver a state witness over the network is critical part of stateless validation. We need to make sure state witness stays within reasonable limit so block processing time stays similar to what we have now and does not overload validator nodes.

One known problem is that a contract could be 4MB in size but a call to it might only consume 2.5 Tgas in execution cost. In other words, we can fit 400 x 4 MB = 1.6GB of WASM contract code in a single chunk.
Adding a gas cost that scales with the real contract code size would be a way to limit this.

What needs to be accomplished

Analyze the data and understand the worst case scenario for state witness size
Come up with reasonable size limit that can cover most of non-malicious cases
Make necessary changes to nearcore to enforce the limit

Task list

Tasks

Give feedback

Size of action receipts
Size of data receipts
State entries accessed per action
Trie nodes accessed per action
Max total state accessed per chunk
Options

jakmeier · 2023-08-14T13:49:43Z

It looks like I will need to split this into several stages. Data receipts can become quite tricky to figure out in which witnesses they belong inside, or how many of them can clump up at one place. I will write more details about them in a future comment.

For now, let me list for action receipts how large they can become based on gas and other runtime limits.
This is not considering the state accessed, or even the output logs produced.

Network Message Sizes

	Network MB Output / 1Pgas	Network MB Input / 1Pgas	Main Parameter
	[MB]	[MB]
empty action receipt	0.60	0.60
CreateAccount	0.00	0.00
DeployContract	145.83	15.48	`action_deploy_contract_per_byte`.`send`
FunctionCall	358.57	358.57	`action_function_call_per_byte`
Transfer	0.15	0.15
Stake	0.58	0.80
AddKey	20.44	20.44	`action_function_call_per_byte`
DeleteKey	0.69	0.69
DeleteAccount	0.47	0.47
Delegate (empty)	1.44	1.44

The table above shows that function call actions can become the largest per gas unit. A chunk filled with function calls could be up to 358MB large, just for storing the actions. This is true today, before we even start talking about changes that would be necessary for stateless validation.

jakmeier · 2023-08-30T13:18:12Z

The data dependency rabbit hole looks a bit too deep to reach a conclusion just yet. I will ignore them for now and just look at other boundaries we can figure out.

jakmeier · 2023-08-30T16:19:51Z

Accessed state size per action

To understand how much state an action receipt may access in the worst case, we have to consider two parts.

How many trie nodes are necessary to include in the witness
How much actual data is accessed

Values

Observations:

Most actions look up a small number of objects, such as access keys or account meta data. We should compare the borsh serialized sizes of these objects with the gas cost to understand how much data is required for a chunk witness.
Function calls are trickier because they access data dynamically. But all value accesses from a smart contract are limited by gas, so we only have to understand what is the cheapest way to read a byte and extrapolate from there.

Number of trie nodes accessed

Observations:

Each action has a fixed number of trie keys it accesses, except for the function call which may access a dynamic number of keys.
Function calls are limited by 300 Tgas. But with flat storage, the touched-trie-node count (TTN) itself doesn't actually increase gas usage. Instead we need to find the cheapest way to read keys with many trie nodes (which is storage_has_key()) and extrapolate.
Not all trie keys are equal. For example, TrieKey::Account has a maximum key length of 65 bytes, whereas TrieKey::ContractData has a maximum length of 2114 bytes. By far, the ContractData keys have the highest upper limit.

Results

I've done all the analysis in the spreadsheet I also used for calculating how large receipts themselves are.
It should be publicly readable: https://docs.google.com/spreadsheets/d/1m66nacHFbvM0rogeS_lV8uJanhbvQKEF0caY93jMOpg/edit#gid=200844971

Here is the summary. Note that the first row (action receipt) is always necessary, so the touched trie nodes for a account creation transaction will be action_receipt + CreateAccount.

	TTN / PGas	Trie Values Bytes / Pgas
	[TTN]	[MB]
empty action receipt	3'629'630	*

CreateAccount	0	0.00
DeployContract	702'703	145.83
FunctionCall	36'106'032	183.54
Transfer	0	0.00
Stake	0	0.00
AddKey	2'475'185	20.44
DeleteKey	2'757'895	22.78
DeleteAccount	0	0.00
Delegate (empty)	1'310'000	10.82

* empty action receipt needs to access data receipts, these calculations are still ongoing

Conclusion

Trie nodes likely need explicit limits
- Function calls can currently access a lot of trie nodes (more than 36 million nodes in a single chunk!).
- The next biggest offender is simply the overhead for empty receipts. This can also lead to 3M nodes in a chunk.
- (for context: a trie node with 16 children is 512B large, so if a witness should be limited to 10MB we can only fit around 20000 nodes, definitely not millions of them)
The values themselves are also quite large. But only deployments and function calls have the potential to blow things up beyond ~20MB per chunk.

jakmeier · 2023-08-31T12:40:11Z

Suggestion to enforce ~45MB witness size per chunk

Here is an idea how to limit the worst-case sizes.

We could add additional chunk space limits. Today, it's already limited by gas, by compute costs (in case of known-to-be-undercharged parameters), and by the size of transactions. We can add more conditions.

Specifically, I would like to add a limit for:

Total network bandwidth. This would count the total size of receipts created either from a transaction or from within a contract call. Once we go above a limit of e.g. 10MB, we would declare the chunk as full and put the unprocessed receipts in the delayed receipts queue.
Total witness size. This would count how large the witness of a chunk application is, including state trie nodes and state values. For now, I assume we will have an additional phase after receipt execution that produces the witness, which blocks execution of the next receipt. This phase includes reading all trie nodes necessary (with flat state access we didn't need to read the nodes to compute the result) and adding up the actual node sizes to get an exact witness size. If the total witness size exceeds a limit (maybe another 10MB) then again we would not execute additional receipts for this chunk.

For this to work, we need good congestion control. If the delayed receipts queue gets too long, this itself could blow up the witness size. Let's say we can keep the queue length per shard below 10MB, then we have used 3 * 10MB = 30MB of witness size so far. The remaining 15MB come from assumption around how we enforce the limit.

Note on enforcing (soft-)limits

To avoid re-execution, we might want to make these soft limits. (The last applied receipt is allowed to go beyond, we just don't execute another one.) That's how we are currently handling all existing limits. But that means the real limit is the soft limit + the largest possible size of single receipt application. If we do nothing, this completely blows up the limits back to the 100s of MBs territory.

To limit the effect of the last receipt that goes above the soft limit, I suggest we also add per-receipt limits to network bandwidth and witness size. Going above the limit will result in an error, which is observable by users as a new failure mode. But the hypothesis is that we can set the limit high enough that normal usage will not hit them.

Ballpark numbers:

~5MB for bandwidth per receipts seems reasonable to me. A receipt can produce lots of network bandwidth by making a cross contract call. Today, it's possible to make many calls at once and attach an argument of up to 4MB to each. We could say the total limit per receipt is 5MB, then users can still make a single cross contract call with a 4MB argument, or two calls with 2MB each. But they couldn't do two calls with 4MB each. I don't think that's a problem for anyone.
~10MB for witness size per receipt also seems reasonable. This would allow 20k trie nodes, which should be plenty. Also, it would still allow to overwrite an existing 4MB value in storage with another 4MB value, which results in 8MB state access. (4MB is the maximum size allowed for single storage entry.)

Hard limit?

The obvious alternative would be to just make these hard limits. As in, the receipt that breaches the limit is reverted and will only be applied in the next chunk.

In this design, the per-receipt limit would be the same as the per chunk limit. Single receipts that go above the chunk limit will never execute successfully, so we will also need to make them explicitly fail which again introduces a new failure mode for receipt execution. Still, enforcing hard limits could bring down the worst-case witness size from ~45MB to ~30MB if we just take all the suggested numbers used so far.

walnut-the-cat · 2023-09-27T20:42:23Z

For this to work, we need good congestion control.

@jakmeier , Is this already handled with local congestion control? Or are you suggesting us to do global congestion control as well?

walnut-the-cat · 2023-09-27T20:44:14Z

Sounds like this needs NEP?

jakmeier · 2023-09-28T09:27:25Z

@walnut-the-cat

I think we would need global congestion control to guarantee a limit on the delayed receipts queue, local congestion control only stops new transactions from coming in but not the receipts from other shards. But local congestion control definitely helps in practice and might be good enough for a first iteration.
Yes, it needs a NEP

walnut-the-cat · 2023-11-01T21:35:34Z

@robin-near, when we are loading trie in memory, how much do we have to pay attention on TTN? Can we completely get rid of it? In other words, can we cross out the following statement by Jakob?

Trie nodes likely need explicit limits:
* Function calls can currently access a lot of trie nodes (more than 36 million nodes in a single chunk!).
* The next biggest offender is simply the overhead for empty receipts. This can also lead to 3M nodes in a chunk.

robin-near · 2023-11-02T04:51:08Z

Trie node access would be pretty cheap, yeah. But writes are not completely free (coz they gotta go to disk), so probably keep that.

Not sure about the empty receipts part - what does that mean?

jakmeier · 2023-11-02T14:09:08Z

Trie node access would be pretty cheap, yeah. But writes are not completely free (coz they gotta go to disk), so probably keep that.

Access may be cheap but this issue here is specifically about witness size in the worst case. (hard limits we can proof, not practical usage patterns) Even if it's served from memory, you still have to upload and download over the network, potentially archive it and so on.

Not sure about the empty receipts part - what does that mean?

With stateless validation, my assumption here is/was that each trie node involved in the state transition needs to be in the witness. (Assuming pre-ZK implementation)

An empty receipt, as in a receipt with no actions, is still a receipt that is stored and loaded from the trie. But because it's empty, it's particularly cheap in terms of gas costs. In other words, many empty receipts may be included in a single chunk.
According to my table above, you can access 3 million trie nodes with just 1000 Tgas if it's filled with empty receipts. This seems like a problem for state witness size. Because to prove that a receipt is indeed part of a current trie, you have to include all those trie nodes in the witness for a single chunk.

Does that make sense?

robin-near · 2023-11-02T16:54:53Z

Oh sorry I was answering Yoon's question without looking at the context. For the greater state witness size issue I don't have any useful comment at this moment.

jancionear · 2024-04-17T19:12:27Z

TTN / PGas Trie Values Bytes / Pgas

[TTN] [MB]

FunctionCall 36'106'032 183.54

Trie nodes likely need explicit limits

Function calls can currently access a lot of trie nodes (more than 36 million nodes in a single chunk!).

(for context: a trie node with 16 children is 512B large, so if a witness should be limited to 10MB we can only fit around 20000 nodes, definitely not millions of them)

AFAIU the cost of touching a single Trie node is wasm_touching_trie_node: 16_101_955_926. That means that in one PGas we can touch 1e15 / 16_101_955_926 = 62104 nodes, not 36 million 0_o

I'm currently trying to construct a contract that generates the largest storage proof possible, and for now I think the most efficient option is to read large values. Cost of reading one value byte is only wasm_storage_read_value_byte: 5_611_005, so we can read 1e15 / 5_611_005 = 178221192 = 178MB of values in one PGas. Reading 62104 trie nodes, 512 bytes each, would give us only 31MB of storage proof.

So far I was able to construct a receipt which generates 24MB of storage proof by reading large values. It's only half of the theoretical maxium, but I guess other costs get in the way.

…tion (#11069) During receipt execution we record all touched nodes from the pre-state trie. Those recorded nodes form the storage proof that is sent to validators, and validators use it to execute the receipts and validate the results. In #9378 it's stated that in a worst case scenario a single receipt can generate hundreds of megabytes of storage proof. That would cause problems, as it'd cause the `ChunkStateWitness` to also be hundreds of megabytes in size, and there would be problems with sending this much data over the network. Because of that we need to limit the size of the storage proof. We plan to have two limits: * per-chunk soft limit - once a chunk has more than X MB of storage proof we stop processing new receipts, and move the remaining ones to the delayed receipt queue. This has been implemented in #10703 * per-receipt hard limit - once a receipt generates more than X MB of storage proof we fail the receipt, similarly to what happens when a receipt goes over the allowed gas limit. This one is implemented in this PR. Most of the hard-limit code is straightforward - we need to track the size of recorded storage and fail the receipt if it goes over the limit. But there is one ugly problem: #10890. Because of the way current `TrieUpdate` works we don't record all of the storage proof in real time. There are some corner cases (deleting one of two children of a branch) in which some nodes are not recorded until we do `finalize()` at the end of the chunk. This means that we can't really use `Trie::recorded_storage_size()` to limit the size, as it isn't fully accurate. If we do that, a malicious actor could prepare receipts which seem to have only 1MB of storage proof during execution, but actually record 10MB during `finalize()`. There is a long discussion in #10890 along with some possible solution ideas, please read that if you need more context. This PR implements Idea 1 from #10890. Instead of using `Trie::recorded_storage_size()` we'll use `Trie::recorded_storage_size_upper_bound()`, which estimates the upper bound of recorded storage size by assuming that every trie removal records additional 2000 bytes: ```rust /// Size of the recorded state proof plus some additional size added to cover removals. /// An upper-bound estimation of the true recorded size after finalization. /// See #10890 and #11000 for details. pub fn recorded_storage_size_upper_bound(&self) -> usize { // Charge 2000 bytes for every removal let removals_size = self.removal_counter.saturating_mul(2000); self.recorded_storage_size().saturating_add(removals_size) } ``` As long as the upper bound is below the limit we can be sure that the real recorded size is also below the limit. It's a rough estimation, which often exaggerates the actual recorded size (even by 20+ times), but it could be a good-enough/MVP solution for now. Doing it in a better way would require a lot of refactoring in the Trie code. We're now [moving fast](https://near.zulipchat.com/#narrow/stream/407237-core.2Fstateless-validation/topic/Faster.20decision.20making), so I decided to go with this solution for now. The upper bound calculation has been added in a previous PR along with the metrics to see if using such a rough estimation is viable: #11000 I set up a mainnet node with shadow validation to gather some data about the size distribution with mainnet traffic: [Metrics link](https://nearinc.grafana.net/d/edbl9ztm5h1q8b/stateless-validation?orgId=1&var-chain_id=mainnet&var-shard_id=All&var-node_id=ci-b20a9aef-mainnet-rpc-europe-west4-01-84346caf&from=1713225600000&to=1713272400000) ![image](https://github.com/near/nearcore/assets/149345204/dc3daa88-5f59-4ae5-aa9e-ab2802f034b8) ![image](https://github.com/near/nearcore/assets/149345204/90602443-7a0f-4503-9bce-8fbce352c0ba) The metrics show that: * For all receipts both the recorded size and the upper bound estimate are below 2MB * Overwhelming majority of receipts generate < 50KB of storage proof * For all chunks the upper bound estimate is below 6MB * For 99% of chunks the upper bound estimate is below 3MB Based on this I believe that we can: * Set the hard per-receipt limit to 4MB. All receipts were below 2MB, but it's good to have a bit of a safety margin here. This is a hard limit, so it might break existing contracts if they turn out to generate more storage proof than the limit. * Set the soft per-chunk limit to 3MB. 99% of chunks will not be affected by this limit. For the 1% that hit the limit they'll execute fewer receipts, with the rest of the receipts put into the delayed receipt queue. This slightly lowers throughput of a single chunk, but it's not a big slowdown, by ~1%. Having a 4MB per-receipt hard limit and a 3MB per-chunk soft limit would give us a hard guarantee that for all chunks the total storage proof size is below 7MB. It's worth noting that gas usage already limits the storage proof size quite effectively. For 98% of chunks the storage proof size is already below 2MB, so the limit isn't really needed for typical mainnet traffic. The limit matters mostly for stopping malicious actors that'd try to DoS the network by generating large storage proofs. Fixes: #11019

jakmeier · 2024-04-23T07:02:50Z

AFAIU the cost of touching a single Trie node is wasm_touching_trie_node: 16_101_955_926. That means that in one PGas we can touch 1e15 / 16_101_955_926 = 62104 nodes, not 36 million 0_o

@jancionear Yes, you are correct for cases where we actually charge wasm_touching_trie_node or TTN cost for short. However, if you look at the linked source document above the table (this one) you see that the computation looks at how many trie nodes can be touched using has_key(). Since the introduction of flat state, reading operations do not charge TTN cost because we don't even know how many tries there are on the path if we only look it up in flat state.

But I am not up to speed with all stateless validation changes. If you reintroduce the TTN cost for state reads as part of stateless validation, indeed that changes the calculations completely.

jancionear · 2024-04-23T15:56:25Z

But I am not up to speed with all stateless validation changes. If you reintroduce the TTN cost for state reads as part of stateless validation, indeed that changes the calculations completely.

With stateless validation we always walk down the trie, even when using flat storage, because we have to record the touched nodes and place them in the storage proof:

fn lookup_from_flat_storage(
        &self,
        key: &[u8],
    ) -> Result<Option<OptimizedValueRef>, StorageError> {
        let flat_storage_chunk_view = self.flat_storage_chunk_view.as_ref().unwrap();
        let value = flat_storage_chunk_view.get_value(key)?;
        if self.recorder.is_some() {
            // If recording, we need to look up in the trie as well to record the trie nodes,
            // as they are needed to prove the value. Also, it's important that this lookup
            // is done even if the key was not found, because intermediate trie nodes may be
            // needed to prove the non-existence of the key.
            let value_ref_from_trie =
                self.lookup_from_state_column(NibbleSlice::new(key), false)?;
            debug_assert_eq!(
                &value_ref_from_trie,
                &value.as_ref().map(|value| value.to_value_ref())
            );
        }
        Ok(value.map(OptimizedValueRef::from_flat_value))
    }

I thought that because of this we also charge for all TTN on the read path, but now I see that we actually don't charge gas for the trie read, charge_gas_for_trie_node_access is set to false :O. Alex said that we don't charge gas there to not risk breaking existing contracts.

Now the math much more sense to me, if we don't charge for TTN then there could be severe undercharging. 👍 👍

jakmeier added A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) T-contract-runtime Team: issues relevant to the contract runtime team labels Aug 2, 2023

jakmeier mentioned this issue Aug 2, 2023

Stateless validation of WASM execution #9377

Open

4 tasks

jakmeier self-assigned this Aug 14, 2023

jakmeier changed the title ~~Worst-case witness size for statless valiation~~ Worst-case witness size for stateless validation Aug 18, 2023

walnut-the-cat mentioned this issue Sep 27, 2023

🔷 Prototype stateless validation #9292

Closed

walnut-the-cat added T-core Team: issues relevant to the core team and removed T-contract-runtime Team: issues relevant to the contract runtime team labels Nov 1, 2023

pugachAG assigned pugachAG and unassigned jakmeier Nov 13, 2023

walnut-the-cat added this to Near One project tracking Nov 17, 2023

walnut-the-cat moved this to Prioritised in Near One project tracking Nov 17, 2023

walnut-the-cat moved this from Prioritised to In Progress in Near One project tracking Nov 17, 2023

pugachAG mentioned this issue Nov 28, 2023

[Tracking issue] State Witness size limit #10259

Open

walnut-the-cat moved this from In Progress to Prioritised in Near One project tracking Nov 29, 2023

walnut-the-cat mentioned this issue Nov 29, 2023

[ProjectTracking]: Stateless validation MVP near/near-one-project-tracking#5

Closed

23 tasks

walnut-the-cat removed this from Near One project tracking Nov 29, 2023

pugachAG assigned shreyan-gupta and unassigned pugachAG Dec 1, 2023

pugachAG mentioned this issue Feb 14, 2024

Known issue: state witness size near/stakewars-iv#14

Closed

tayfunelmas added the A-stateless-validation Area: stateless validation label Apr 1, 2024

This was referenced Apr 15, 2024

[Stateless validation] per-receipt hard limit on recorded trie storage proof size #11019

Closed

Per-receipt hard limit on storage proof size using upper bound estimation #11069

Merged

jakmeier mentioned this issue Jun 6, 2024

feat(stateless_validation): Limit size of outgoing receipts to keep size of source_receipt_proofs under control #11492

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worst-case witness size for stateless validation #9378

Worst-case witness size for stateless validation #9378

jakmeier commented Aug 2, 2023 •

edited by walnut-the-cat

Loading

Tasks

jakmeier commented Aug 14, 2023 •

edited

Loading

jakmeier commented Aug 30, 2023

jakmeier commented Aug 30, 2023 •

edited

Loading

jakmeier commented Aug 31, 2023

walnut-the-cat commented Sep 27, 2023

walnut-the-cat commented Sep 27, 2023

jakmeier commented Sep 28, 2023

walnut-the-cat commented Nov 1, 2023

robin-near commented Nov 2, 2023

jakmeier commented Nov 2, 2023

robin-near commented Nov 2, 2023

jancionear commented Apr 17, 2024

jakmeier commented Apr 23, 2024

jancionear commented Apr 23, 2024

Worst-case witness size for stateless validation #9378

Worst-case witness size for stateless validation #9378

Comments

jakmeier commented Aug 2, 2023 • edited by walnut-the-cat Loading

Goals

Background

Why should NEAR One work on this

What needs to be accomplished

Task list

Tasks

jakmeier commented Aug 14, 2023 • edited Loading

Network Message Sizes

jakmeier commented Aug 30, 2023

jakmeier commented Aug 30, 2023 • edited Loading

Accessed state size per action

Values

Number of trie nodes accessed

Results

Conclusion

jakmeier commented Aug 31, 2023

Suggestion to enforce ~45MB witness size per chunk

Note on enforcing (soft-)limits

Hard limit?

walnut-the-cat commented Sep 27, 2023

walnut-the-cat commented Sep 27, 2023

jakmeier commented Sep 28, 2023

walnut-the-cat commented Nov 1, 2023

robin-near commented Nov 2, 2023

jakmeier commented Nov 2, 2023

robin-near commented Nov 2, 2023

jancionear commented Apr 17, 2024

jakmeier commented Apr 23, 2024

jancionear commented Apr 23, 2024

jakmeier commented Aug 2, 2023 •

edited by walnut-the-cat

Loading

jakmeier commented Aug 14, 2023 •

edited

Loading

jakmeier commented Aug 30, 2023 •

edited

Loading