-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stateless_validation][witness size soft limit] Count nodes needed for key updates in state witness size limit #10890
Comments
To expand a bit on the problem: When the runtime reads or writes a value in the trie, it calls This sounds like a workable solution - we record all nodes on the path to all read and written values, but there's one corner case that isn't covered by this. When there's a branch node with two children and we delete one of the children, the branch node should be merged with the the surviving child, as it doesn't make sense to have a branch node with only one child. This problem was addressed in #10841 by recording all nodes touched during This means that runtime is not aware of those extra "surviving child" nodes, and as a result it's unable to charge gas or enforce storage proof size limit for them. There are two main questions here:
|
ImpactWhat's the worst that could happen? A malicious actor could prepare a state which allows to execute the corner case as many times as possible.
The cost to record the big node (2064 bytes) would be (113 + 65 = 178 bytes). This means that a malicious actor could record 12 times more state than what's accounted for in the storage proof size limit! ( (2064+178)/178 = 12.5 ). That's pretty bad :/ It means that a malicious actor could grow the size of ChunkStateWitness about 10 times, from 8 MB to 80 MB. There are already problems with distributing an 8 MB witness, so an 80 MB one could fail altogether. IMO it would be good to fix it. Although with all the improvements in network distribution it might turn out that it isn't all that necessary, we'll see how it goes. |
Idea 1 - Charge extra to account for the unrecorded childrenWe could just assume that every time we delete something there's also a big malicious child that'll be recorded and charge 2000 bytes extra. The big problem with that is that it lowers chunk capacity - charging 10x more for an operation means that the number of receipts that fit in a chunk will be a few times lower. We could add some clever heuristics (only charge once per branch, count children, trace deletions, etc...). |
Idea 2 - Immediately perform the operations on a real trieA big reason why we needed This solution would be very clean - no strange tricks, we just apply the trie changes like we said we do. |
Idea 3 - Finalize after every receiptCurrently we finalize after every chunk, but we could do it after every receipt instead. All operations during the receipt's execution would be performed on Postponing node recording could have additional performance benefits - recording the node on every node access means that we have to perform extra work during every access - maybe serialize the node, maybe hash it. This work is repeated every time we access this node, which is wasteful. Recording only during finalize could allow to record every node exactly once, which could be more performant. |
Idea 4 - Modify how the trie worksWhat if we didn't merge the branch during deletion? We could leave this work to the next person that tries to read the value from the surviving child. With such a modification it'd be enough to record nodes that are on the path to the value that is read/written. We'd need to think about the implications of such a fundamental change.
It's a very interesting idea, but it'd change how a fundamental piece of the system works, so it could be a lot of work. OTOH it'd prepare us for Verkle trees later ;) |
It makes sense, thank you very much for the overview!
In the longer term, anything from (2) to (4) looks feasible. Perhaps (4) is simpler and (3) is harder to implement, because it requires to serve state reads from |
Hmm that's a good point, if it's true that contracts rarely delete things then charging extra wouldn't hurt normal users that much, while still protecting us from malicious actors.
I really like (2), because it's a clean simple solution, but it might be a big effort to get it done, make sure that there are no performance problems, etc. (3) sounds more feasible to me, while still being a proper solution. I need to dive into the code and figure out what's feasible.
We could still serve reads from |
Just to clarify, adding 2000 to the size estimation on each storage_remove doesn't look like too much complexity, and it doesn't have to be inside Trie. Or you mean additional logic with "checking if there were 2 children"? |
Eh IMO it counts as complex, it's one more thing that you need to remember about and be cautious to do properly. It's very nonobvious, and it'll be easy to forget about charging the extra gas somewhere. People who will work with this code in the future won't be aware of it, it's a trap set for them. It's not the end of the world, but I'd prefer a solution which is intuitive, without such gotchas. |
Ah I see. Actually, I don't think we should necessarily charge more gas. I just mean maintaining u64 variable which estimates state witness size, which is usually size(recorded_storage), and say every time when storage_remove is invoked, we add 2000 there. |
@robin-near , @tayfunelmas , @shreyan-gupta , @bowenwang1996 , and I talked about ideas in the warroom and we converged into our own preference. We can talk about pros/cons of each idea during stateless validation meeting tomorrow |
For now I'll work towards implementing Solution 1 (charge extra). It's a simple solution that could be good enough (depends on the data gathered from #11000). From other discussion it seems that we've converged on Idea 3 (finalize after every receipt) as a more proper solution, but implementing it would involve more effort. We can consider implementing the proper solution later, after the MVP release. |
cc. @robin-near , @tayfunelmas , @shreyan-gupta . If we go with solution 1, do we plan to implement hard limit separately? I thought benefit of solution 3 was us being able to tackle soft limit corner case and hard limit at the same time? Separate question. I understand that solution 1 will result in less number of receipts in a chunk, but what if one receipt is already too big? (e.g. 50mb) Won't it still cause chunk miss to happen? |
Btw, quick question, in case we are going with solution 1, would we have to maintain this extra charge implementation forever, even if we upgrade protocol later? |
Going with (1) doesn't cause any trouble with implementing the hard per-receipt limit. During |
I don't know, is it expected that the current binary can apply any block from the past? |
That's still expected today as far as I understand. |
So, confirming, solution 1 basically says the following right?
|
Meanwhile for the hard limit, we were having some more discussions here, as I'll try to give a brief summary Our original though of "finalize after each receipt execution" to get the size of witness (per receipt) doesn't work as it's impossible to rollback touched trie nodes. Once touched, they need to be a part of the state witness else the validation can not happen properly. What we need to remember here is that chunk validators only work on the state witness partial trie and don't have access to the whole trie like the chunk producer does. Instead, we need to tap into the runtime and stop execution as soon as we reach the hard limit. This would be consistent across chunk producer and validators. To be noted, initially we were thinking about this as a hard limit per receipt execution, but we may need to change our mindset and think of it as hard limit for the whole chunk/for all receipts instead as that's the most straightforward way to implement it. (We only keep a running track of state witness size, not for individual receipts). We would have to then set the limits as something on the lines of |
In terms of implementation, the logic for the hard limit need to go via Code flow is
|
Oooh that's a good point. To prove that executing the receipt generated 100MB of storage proof we would have to include the 100 MB storage proof in the witness x.x. You're right, post execution checking doesn't really look viable. That kinda kills idea (3) :c |
Eh yeah, this sucks :/ But I'm kinda thinking that after moving to stateless validation we might want to throw some of the old trie infrastructure away anyway. There's a lot of things that don't make much sense with stateless validation - accounting cache, flat storage, etc. |
…ize based on number of trie removals (#11000) In #10890 we considered adding extra 2000 bytes to the storage proof size for every removal operation on the trie (Idea 1). This PR implements the logic of recording the number of removals and calculating the adjusted storage proof size. It's not used in the soft witness size limit for now, as we have to evaluate how viable of a solution it is. There are concerns that charging 2000 bytes extra would cause the throughput to drop. To evaluate the impact of adjusting the size calculation like this, the PR adds three new metrics. * `near_receipt_recorded_size` - a histogram of how much storage proof is recorded when processing a receipt. Apart from #10890, it'll help us estimate what the hard per-receipt storage proof size limit should be. * `near_receipt_adjusted_recorded_size` - a histogram of adjusted storage proof size calculated when processing a receipt * `near_receipt_adjusted_recorded_size_ratio` - ratio of adjusted size to non-adjusted size. It'll allow us to evaluate how much the adjustment affects the final size. The hope is that contracts rarely delete things, so the effect will be small (a few percent), but it might turn out that this assumption is false and the adjusted size is e.g 2x higher that the non-adjusted one. In that case we'd have to reevaluate whether it's a viable solution. I'd like to run code from this branch on shadow validation nodes to gather data from mainnet traffic.
I added some metrics to evaluate viability of idea 1 (in #11000), and started a shadow validation node to see how the upper-bound size compares to the recorded size. Here are the grafana dashboards (starting at "Recorded storage size per receipt"): |
Some observation about the data:
Based on this we could try setting the per-receipt limit to 4MB, and the per-chunk soft limit to 8MB. It leaves us with a bit of safety margin to make sure that the existing contracts won't break, while ensuring that the total size of storage proof stays under 12MB. The truth is that gas costs seem to already limit the size quite effectively. All of the chunk-level storage proofs are under 2MB (on mainnet traffic). The limit doesn't really need to maintain the 2MB size of storage proof, as it's already maintained by gas costs. It only has to protect against malicious traffic aiming to generate a huge storage proof. |
How often do this go out of sync?
I guess this is the other 5% of the cases? (since you mentioned upper bound is accurate for 95% of chunks)
Is this pre-compression size? cc. @saketh-are , @shreyan-gupta |
I only collected this data for receipts which generate >100KB of storage proof, the ratio could get really high for small receipts, and small receipts don't matter for hard limit, so I excluded them. I can't say for sure, but I'd estimate that ~5% of receipts have a big difference between the upper bound estimation and the actual value.
Those really big ones are ~0.2% of all chunks. I guess we could ignore them and lower the soft limit further. The soft limit can be arbitrarily lowered as long as it doesn't affect throughput too much.
Yes, although Trie nodes are mostly hashes, which aren't compressible, so it might not matter that much. |
Added one more dashboard to see how many chunks are at most X MB Big: It looks like we could set the soft size limit to as low as 3MB, and 99% of the chunks would still fit in this limit. For the other 1% we would move the receipts to the delayed queue and execute them in another chunk. That would be a ~1% slowdown in chunk throughput, but it'd give us a solid guarantee that no chunk storage proof is larger than 7MB. |
Can you also check where these outliers are coming from? If they are from major dapps (e.g HOT), that can be troublesome |
I added some logs to print out receipts that generate more than 500KB of storage proof (upper bound).
For Here are the logs: |
…time To enforce the hard per-receipt limit we need to monitor how much storage proof has been recorded during execution of the receipt and halt the execution when the size of generated storage proof goes over the limit. To achieve this the runtime needs to be able to see how much proof was recorded, so let's expose this information so that it's available from the runtime. `recorded_storage_size()` doesn't provide the exact size of storage proof, as it doesn't cover some corner cases (see near#10890), so we use the `upper_bound` version to estimate how much storage proof could've been generated by the receipt. As long as upper bound is under the limit we can be sure that the actual value is also under the limit.
…limit The `recorded_storage_size()` function can sometimes return a value which is less than the actual recorded size. See near#10890 for details. This means that a malicious actor could create a workload which would bypass the soft size limit and e.g generate 10x more storage proof than allowed. To fix this proble let's use the upper bound estimation of the total recorded size. As long as the upper bound estimation is under the limit we can be sure that the actual value is also under the limit.
We've got a new record!
|
…tion (#11069) During receipt execution we record all touched nodes from the pre-state trie. Those recorded nodes form the storage proof that is sent to validators, and validators use it to execute the receipts and validate the results. In #9378 it's stated that in a worst case scenario a single receipt can generate hundreds of megabytes of storage proof. That would cause problems, as it'd cause the `ChunkStateWitness` to also be hundreds of megabytes in size, and there would be problems with sending this much data over the network. Because of that we need to limit the size of the storage proof. We plan to have two limits: * per-chunk soft limit - once a chunk has more than X MB of storage proof we stop processing new receipts, and move the remaining ones to the delayed receipt queue. This has been implemented in #10703 * per-receipt hard limit - once a receipt generates more than X MB of storage proof we fail the receipt, similarly to what happens when a receipt goes over the allowed gas limit. This one is implemented in this PR. Most of the hard-limit code is straightforward - we need to track the size of recorded storage and fail the receipt if it goes over the limit. But there is one ugly problem: #10890. Because of the way current `TrieUpdate` works we don't record all of the storage proof in real time. There are some corner cases (deleting one of two children of a branch) in which some nodes are not recorded until we do `finalize()` at the end of the chunk. This means that we can't really use `Trie::recorded_storage_size()` to limit the size, as it isn't fully accurate. If we do that, a malicious actor could prepare receipts which seem to have only 1MB of storage proof during execution, but actually record 10MB during `finalize()`. There is a long discussion in #10890 along with some possible solution ideas, please read that if you need more context. This PR implements Idea 1 from #10890. Instead of using `Trie::recorded_storage_size()` we'll use `Trie::recorded_storage_size_upper_bound()`, which estimates the upper bound of recorded storage size by assuming that every trie removal records additional 2000 bytes: ```rust /// Size of the recorded state proof plus some additional size added to cover removals. /// An upper-bound estimation of the true recorded size after finalization. /// See #10890 and #11000 for details. pub fn recorded_storage_size_upper_bound(&self) -> usize { // Charge 2000 bytes for every removal let removals_size = self.removal_counter.saturating_mul(2000); self.recorded_storage_size().saturating_add(removals_size) } ``` As long as the upper bound is below the limit we can be sure that the real recorded size is also below the limit. It's a rough estimation, which often exaggerates the actual recorded size (even by 20+ times), but it could be a good-enough/MVP solution for now. Doing it in a better way would require a lot of refactoring in the Trie code. We're now [moving fast](https://near.zulipchat.com/#narrow/stream/407237-core.2Fstateless-validation/topic/Faster.20decision.20making), so I decided to go with this solution for now. The upper bound calculation has been added in a previous PR along with the metrics to see if using such a rough estimation is viable: #11000 I set up a mainnet node with shadow validation to gather some data about the size distribution with mainnet traffic: [Metrics link](https://nearinc.grafana.net/d/edbl9ztm5h1q8b/stateless-validation?orgId=1&var-chain_id=mainnet&var-shard_id=All&var-node_id=ci-b20a9aef-mainnet-rpc-europe-west4-01-84346caf&from=1713225600000&to=1713272400000) ![image](https://github.com/near/nearcore/assets/149345204/dc3daa88-5f59-4ae5-aa9e-ab2802f034b8) ![image](https://github.com/near/nearcore/assets/149345204/90602443-7a0f-4503-9bce-8fbce352c0ba) The metrics show that: * For all receipts both the recorded size and the upper bound estimate are below 2MB * Overwhelming majority of receipts generate < 50KB of storage proof * For all chunks the upper bound estimate is below 6MB * For 99% of chunks the upper bound estimate is below 3MB Based on this I believe that we can: * Set the hard per-receipt limit to 4MB. All receipts were below 2MB, but it's good to have a bit of a safety margin here. This is a hard limit, so it might break existing contracts if they turn out to generate more storage proof than the limit. * Set the soft per-chunk limit to 3MB. 99% of chunks will not be affected by this limit. For the 1% that hit the limit they'll execute fewer receipts, with the rest of the receipts put into the delayed receipt queue. This slightly lowers throughput of a single chunk, but it's not a big slowdown, by ~1%. Having a 4MB per-receipt hard limit and a 3MB per-chunk soft limit would give us a hard guarantee that for all chunks the total storage proof size is below 7MB. It's worth noting that gas usage already limits the storage proof size quite effectively. For 98% of chunks the storage proof size is already below 2MB, so the limit isn't really needed for typical mainnet traffic. The limit matters mostly for stopping malicious actors that'd try to DoS the network by generating large storage proofs. Fixes: #11019
#11069 worked around this issue by implementing |
…11507) To limit the amount of storage proof generated during chunk application we calculate the upper bound estimation of how big the storage proof will be, and stop executing receipts when this estimated size gets to big. When estimating we assume that every trie removals generates 2000 bytes of storage proof, because this is the maximum size that a malicious attacker could generate (#11069, #10890). This estimation was meant to limit the size of proof generated while executing receipts, but currently it also applies to other trie removals performed by the runtime, for example when removing receipts from the delayed receipt queue. This is really wasteful - removing 1000 receipts would cause the estimation to jump by 2MB, hitting the soft limit. We don't really need to charge this much for internal operations performed by the runtime, they aren't malicious. Let's change is so that only contracts are charged extra for removals. This will avoid the extra big estimation caused by normal queue manipulation. Refs: https://near.zulipchat.com/#narrow/stream/308695-nearone.2Fprivate/topic/Large.20number.20of.20delayed.20receipts.20in.20statelessnet/near/442878068
When we change state KV pairs, it happens on e.g.
TrieUpdate::set
which doesn't access nodes on its own. Nodes required to update trie are read in the end, onTrieUpdate::finalize
. WhileRuntime::apply
produces correct state witness, its "online" size is not computed correctly because of that.One way to compute better size is to always call
Trie::get
onTrieUpdate::set
.Note that
TrieUpdate::finalize
still may access new nodes due to branch restructuring. But I think the impact of that it small, only up to 2 nodes per each key removal.Discussion https://near.zulipchat.com/#narrow/stream/295558-core/topic/Trie.3A.3Aget.20on.20TrieUpdate.3A.3Aset/near/428163849
The text was updated successfully, but these errors were encountered: