-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: record nodes for writes in memtrie #10841
Conversation
691b53c
to
2c8493c
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #10841 +/- ##
========================================
Coverage 71.41% 71.42%
========================================
Files 763 764 +1
Lines 153717 153956 +239
Branches 153717 153956 +239
========================================
+ Hits 109781 109965 +184
- Misses 38938 38991 +53
- Partials 4998 5000 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Out of curiosity, how much additional memory do we use with this change? |
core/store/src/trie/mod.rs
Outdated
} | ||
|
||
// Retroactively record all accessed trie items to account for | ||
// key-value pairs which were only written but never read, thus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a gap in my logical understanding: where is the link that goes from "a key is written to but not read from" to "we must have the previous value for that key"?
For the specific test case that I knew of of a trie restructuring (converting branch of 1 child to extension, I know that the child node must be read in order to do this conversion, and at the same time the child node is deleted, so in this case, yes, the child is written to (as in, its refcount was decremented) but not read from (as in, did not get queried), and the child needed its previous value (in order to fulfill the restructuring)), but is that in general the only case where the update phase would cause a value to be written to but not read from?
So in other words, is the logic this: "if a node is written to but not read from, then that node must have been replaced during a trie restructuring, and therefore its previous value must be known" (which I need to be convinced of), or is there some other way to draw this conclusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a node is written to but not read from, then that node must have been replaced during a trie restructuring, and therefore its previous value must be known
I think "key" got replaced with "node" somewhere in the message.
The logic is "if a key is written to but not read from" => "all nodes on the trie path must be known".
Let's say you don't know some node on the path to it and take the lowest of them. Then you can't recompute it after write, because you don't know hashes of neighbours of the key, so you can't compute new state root.
And, yeah, when we descend into node corresponding to key, all refcounts on the way are decremented.
I actually want to propose simpler logic by this PR: "for all keys, for which any state operation was called (get/has/write/remove), all nodes to the old path must be recorded". That's what current disk trie logic does, and that's easy to explain and follow.
For memtries, union of (nodes recorded on chunk application) and (nodes recorded on trie restructuring) should give this set. So I didn't even think much if this is completely necessary, it's enough for me that it is verifiable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, ok, thanks for clarifying your intent in the mathematical form :)
I think my question is a bit more detailed than this.
Suppose we replace a node from the trie. Do we actually need the node that we're replacing? The parent of that node would have the hashes of the sibling nodes that we need to reconstruct the parent. So for example we have A -> B -> C -> D and we're deleting D. Do we need to have D, because to reconstruct a A' -> B' -> C' -> D', we only need A, B, C, and D'. D is not needed. But in the logic in this PR, it seems that it requires that D be needed. That's the part I don't understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense.
Then, my argument goes to:
- that's what current disk logic do. Every time we descend to node, we pull it out by
move_node_to_mutable
ordelete_value
. By default I would avoid the logic in couple places to "if this is the last node, don't do it" + adding cornercase when this node is a restructured branch. - the logic "we take all nodes on the path" is simpler to explain than "we take all nodes on reads + all nodes except the last one on writes"
- savings from not including previous value shouldn't be high, because it is a rare case to modify value without reading it.
action_deploy_contract
reads contract before redeploying it, etc. Still we can store only ValueRefs of prev values, but I'm not 100% confident I'll do it right the first time :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are recording node D in A -> B -> C -> D for on disk tries, let's try to keep consistency and do the same for memtries even if it's not the most efficient.
core/store/src/trie/mem/updating.rs
Outdated
pub nodes: HashMap<CryptoHash, Arc<[u8]>>, | ||
/// Hashes of accessed values - because values themselves are not | ||
/// necessarily present in memtrie. | ||
pub values: HashSet<CryptoHash>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have an optional value, if the value is present in memtrie?
core/store/src/trie/mem/updating.rs
Outdated
if let Some(trie_refcount_changes) = self.trie_refcount_changes.as_mut() { | ||
trie_refcount_changes.subtract(hash, 1); | ||
if let Some(tracked_node_changes) = self.tracked_trie_changes.as_mut() { | ||
tracked_node_changes.accesses.values.insert(hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to below comment, does subtracting refcount for the value guarantee that we must need the previous value of the value? If so, why?
Should be zero, it doesn't change the contents of the memtries. |
core/store/src/trie/mod.rs
Outdated
// not recorded before. | ||
if let Some(recorder) = &self.recorder { | ||
for (node_hash, serialized_node) in trie_accesses.nodes { | ||
recorder.borrow_mut().record(&node_hash, serialized_node); | ||
} | ||
for value_hash in trie_accesses.values { | ||
let value = self.storage.retrieve_raw_bytes(&value_hash)?; | ||
recorder.borrow_mut().record(&value_hash, value); | ||
} | ||
} | ||
Ok(trie_changes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's sad that the changes are recorded retroactively.
For state witness size limit it would be best to record them as they happen, this would make it possible to stop executing a receipt once it generates more than X MB of PartialState
. With retroactive recording we can only measure how much PartialState
was generated after the update is applied, which AFAIU happens after applying the whole chunk :/
Actually, doesn't this break #10703? The soft size limit looks at the size of recorder state after applying each receipt and stops executing new ones when this size gets above the limit. It depends on online recording, it wouldn't work with retroactive one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to somehow put the recording logic in TrieUpdate::set
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I guess it works the same way with normal trie writes. That sucks, it really throws a wrench into the witness size limit. This solution makes sense for now, but I think the whole logic of trie updating will need a refactor for the size limit :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay actually we don't need to record writes for the state witness, only the reads need to be recorded :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the meeting today, turns out we do want to record the nodes for updation/deletion :(
I believe this update
function is only called in trie_update finalize for all practical purposes?
We would need to revisit this and get a better solution for recording on the fly like in the get function instead of trie_update finalize. Required for both soft and hard limit state witness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently (other than trie restructuring), I guess we just rely on the fact that runtime first calls a get before update/delete for gas cost estimation.
/// Hashes of accessed values - because values themselves are not | ||
/// necessarily present in memtrie. | ||
pub values: HashSet<CryptoHash>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand what are the additional nodes that this change records.
AFAIU runtime updates the trie by calling storage_write
, which always does a trie read before writing a value:
let evicted_ptr = self.ext.storage_get(&key, StorageGetMode::Trie)?; |
This trie read records all nodes that were accessed to reach this value, even in the case of memtries:
nearcore/core/store/src/trie/mod.rs
Line 1296 in f2f6ed0
if let Some(recorder) = &self.recorder { |
Doesn't that record everything that is needed to prove execution of the contract? Why do we need an additional access log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I guess it doesn't record nodes that are created when adding new values...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this path is called only during contract execution. The gateway to update trie outside it is set<T: BorshSerialize>(state_update: &mut TrieUpdate, key: TrieKey, value: &T)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, although I would wait for an approval from Robin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still don't know what to do with witness size limit right?
Yeah, that's a following PR. |
Fixes #10769.
UPD: New motivation
While all said below is true, the main motivation is to account for nodes impacted at
squash_nodes
which can be touched without much pain only atTrie::update
.Old motivation
The root cause of errors "node not found" on reads from recorded storage is that nodes touched only at
Trie::update
- KV updates in the end of chunk applictaion - were never recorded on memtrie.It happens automatically on disk tries. For memtries we retroactively account for trie node accesses, but this wasn't done in similar way for memtrie updates.
While infrastructure indeed needs to be improved so these issues won't surprise us anymore, the easiest way is to account for all accesses in the same way as
move_node_to_mutable
does - on memtrie paths likeensure_updated
which move node out of storage to memory, which means that node has to be recorded, so trie can be validated. Unfortunately refcount changes are not enough to cover this, so I record trie accesses separately. Also, memtrie doesn't have access to values, so I have to retrieve values and record them in theTrie::update
.Testing
destructively_delete_in_memory_state_from_disk
was removing all keys because key decoding was incorrect. Now these tests fully cover trie access scenario: first we have get&get_ref accesses, and in the very end we have update call for some KV pairs. I confirmed that for incorrect impl some tests fail because of the same "node not found" error, and also that new sanity check onTrie::update
works.