trie: enable batch updates #24556

holiman · 2022-03-17T12:53:02Z

A long time ago, I did an experiment (here) with using the trie in a mode where it didn't always copy-on-change. That way of using the trie is unsafe from a concurrency perspective, but in many situations, that's fine.
In the end, I didn't push that feature, I found that it didn't make a lot of difference since most of the time spent during the intermediate root wasn't related to gc, rather disk IO.

However, when we verify proofs (that are large enough that the full trie contents do not fit into a snap response), then the proof verifier uses a trie, shoves in the left and right side bounds, and fills it with e.g. ~10K values, to obtain the trie root. In this case, the batched mode could make a pretty significant change.

So I cherry-picked and tested:

name                      old time/op    new time/op    delta
VerifyProof-6               19.8µs ±34%    19.3µs ±25%     ~     (p=1.000 n=5+5)
VerifyRangeProof10-6         190µs ±88%     121µs ±18%  -36.59%  (p=0.032 n=5+5)
VerifyRangeProof100-6       1.02ms ±20%    0.36ms ±15%  -64.79%  (p=0.008 n=5+5)
VerifyRangeProof1000-6      8.45ms ±16%    4.58ms ±19%  -45.73%  (p=0.008 n=5+5)
VerifyRangeProof5000-6      33.3ms ±17%    19.8ms ±33%  -40.56%  (p=0.008 n=5+5)
VerifyRangeNoProof10-6       410µs ±20%     443µs ±22%     ~     (p=0.690 n=5+5)
VerifyRangeNoProof500-6     1.73ms ±32%    1.64ms ±17%     ~     (p=1.000 n=5+5)
VerifyRangeNoProof1000-6    3.22ms ± 7%    3.33ms ±11%     ~     (p=0.690 n=5+5)

name                      old alloc/op   new alloc/op   delta
VerifyProof-6               5.09kB ± 5%    5.11kB ± 4%     ~     (p=0.690 n=5+5)
VerifyRangeProof10-6        41.0kB ±13%    21.6kB ± 1%  -47.31%  (p=0.008 n=5+5)
VerifyRangeProof100-6        269kB ± 1%      78kB ± 2%  -71.07%  (p=0.008 n=5+5)
VerifyRangeProof1000-6      1.99MB ± 3%    0.87MB ± 1%  -56.15%  (p=0.008 n=5+5)
VerifyRangeProof5000-6      9.52MB ± 0%    4.41MB ± 1%  -53.73%  (p=0.008 n=5+5)
VerifyRangeNoProof10-6      34.6kB ± 3%    34.4kB ± 2%     ~     (p=0.690 n=5+5)
VerifyRangeNoProof500-6      114kB ± 2%     114kB ± 1%     ~     (p=0.841 n=5+5)
VerifyRangeNoProof1000-6     212kB ± 1%     211kB ± 1%     ~     (p=0.841 n=5+5)

name                      old allocs/op  new allocs/op  delta
VerifyProof-6                  104 ± 4%       105 ± 2%     ~     (p=0.905 n=5+5)
VerifyRangeProof10-6           428 ± 8%       366 ± 2%  -14.53%  (p=0.008 n=5+5)
VerifyRangeProof100-6        1.96k ± 1%     1.34k ± 1%  -31.62%  (p=0.008 n=5+5)
VerifyRangeProof1000-6       16.0k ± 2%     12.4k ± 1%  -22.43%  (p=0.008 n=5+5)
VerifyRangeProof5000-6       78.3k ± 0%     61.8k ± 1%  -20.98%  (p=0.008 n=5+5)
VerifyRangeNoProof10-6         937 ± 2%       938 ± 1%     ~     (p=0.548 n=5+5)
VerifyRangeNoProof500-6      2.90k ± 1%     2.92k ± 1%     ~     (p=0.421 n=5+5)
VerifyRangeNoProof1000-6     5.31k ± 1%     5.30k ± 1%     ~     (p=0.690 n=5+5)

Note: this PR contains some changes to statedb, which comes from previous experiment. Those changes would need to be checked if they're (still) ok.

rjl493456442 · 2022-03-18T03:20:02Z

I think it's worthwhile for experiment. Not only the Insert, but also Get operation can apply the same optimization.

For Get, we can load the node from the disk and link it with its parent without copy. I think it's fine.
For Update, we do need to ensure that the entire trie path(from root to leave) are updated in an 'atomic' operation.

rjl493456442 · 2022-03-18T03:42:34Z

And also the Hash operation, we also do an extra copy for linking the generated hash to the node.

Semantically, all the operations in trie right now will always create a copy, apply the mutations(e.g. add the hash flag, replace the hashNode with a resolved node, etc) and eventually replace to root node, this procedure is treated as "atomic". The only concurrency issue will happen at "root" node.

We can explore to change the semantics. For example, when the root is resolved from the disk by Get, we can directly link it with its parent without changing all the ancestor nodes. The concurrency happens at this particular node instead of root node.

holiman · 2022-03-18T07:21:55Z

Just one thing I'd like to point out... So right now, the range prover seems like it shaves off ~50% on a 10K slice of leafs. However, we could do it even better.

Currently, we

take the first-leaf-path, and last-leaf-path, construct the proofs, and dump the nodes into the trie.
unset internals,
Fill with remaining leafs to recreate the original trie with hashes on the sides.
verify root, done

This method will always require a moderately large trie (~15-20k nodes) balooning up during the verification, regardless of how much we improve the trie. The alternative way to do this is to

take the first leaf-path, construct the proof, and initialize a stacktrie along that path (avoiding the right-hand side data)
fill remaining leafs
take the last leaf-path, finish up the stacktrie with the hashes/data on the right-hand side.
verify root, done

This method will have around 50-100 nodes at any time, and the pool makes it close to alloc-free. So it's intrinsically much better.
I started implementing that almost a year ago, and have an old stale branch with that attempt. I didn't follow through on it -- got the 'init'-part done, but didn't complete the finalization-part.

rjl493456442 · 2022-03-21T02:40:58Z

This method will always require a moderately large trie (~15-20k nodes) balooning up during the verification

You mean we will always have that many nodes in memory during the verification? If so, yes. But most of them are created in runtime, not resolved from the disk

holiman · 2022-09-23T08:37:31Z

I have now updated this PR, and modified it so that it does not touch the core state transition functions, but is only used when verifying range proof. For a somewhat large range, it shaves off 50% of allocations, and 30-50% of the time.

Specifically this case as mentioned in the original description:

However, when we verify proofs (that are large enough that the full trie contents do not fit into a snap response), then the proof verifier uses a trie, shoves in the left and right side bounds, and fills it with e.g. ~10K values, to obtain the trie root. In this case, the batched mode could make a pretty significant change.

name                      old time/op    new time/op    delta
VerifyRangeProof10-6        80.7µs ±64%    46.3µs ±11%  -42.59%  (p=0.000 n=10+8)
VerifyRangeProof100-6        585µs ±32%     298µs ±40%  -49.06%  (p=0.000 n=10+10)
VerifyRangeProof1000-6      4.27ms ±29%    2.91ms ±75%  -31.82%  (p=0.019 n=10+10)
VerifyRangeProof5000-6      15.2ms ±42%    14.0ms ±39%     ~     (p=0.631 n=10+10)
VerifyRangeNoProof10-6       334µs ±49%     223µs ± 8%     ~     (p=0.089 n=10+10)
VerifyRangeNoProof500-6     1.41ms ±19%    1.49ms ± 7%     ~     (p=0.139 n=9+8)
VerifyRangeNoProof1000-6    2.73ms ± 3%    2.10ms ±49%     ~     (p=0.156 n=9+10)

name                      old alloc/op   new alloc/op   delta
VerifyRangeProof10-6        41.5kB ± 1%    22.2kB ± 1%  -46.62%  (p=0.000 n=9+9)
VerifyRangeProof100-6        282kB ± 7%      78kB ± 2%  -72.18%  (p=0.000 n=10+9)
VerifyRangeProof1000-6      1.97MB ± 1%    0.86MB ± 2%  -56.13%  (p=0.000 n=10+10)
VerifyRangeProof5000-6      9.47MB ± 0%    4.41MB ± 1%  -53.44%  (p=0.000 n=9+10)
VerifyRangeNoProof10-6      34.2kB ± 3%    34.4kB ± 4%     ~     (p=0.796 n=10+10)
VerifyRangeNoProof500-6      114kB ± 2%     114kB ± 2%     ~     (p=0.315 n=10+10)
VerifyRangeNoProof1000-6     212kB ± 1%     212kB ± 2%     ~     (p=0.190 n=10+10)

name                      old allocs/op  new allocs/op  delta
VerifyRangeProof10-6           377 ± 1%       315 ± 1%  -16.36%  (p=0.000 n=9+9)
VerifyRangeProof100-6        1.94k ± 4%     1.29k ± 1%  -33.61%  (p=0.000 n=10+10)
VerifyRangeProof1000-6       15.8k ± 1%     12.3k ± 1%  -22.40%  (p=0.000 n=10+10)
VerifyRangeProof5000-6       77.6k ± 0%     61.7k ± 1%  -20.48%  (p=0.000 n=9+10)
VerifyRangeNoProof10-6         933 ± 2%       937 ± 2%     ~     (p=0.698 n=10+10)
VerifyRangeNoProof500-6      2.90k ± 1%     2.90k ± 1%     ~     (p=0.591 n=10+10)
VerifyRangeNoProof1000-6     5.29k ± 1%     5.28k ± 1%     ~     (p=0.101 n=10+10)

holiman · 2022-09-29T08:53:43Z

This probably adds more complexity than is warranted.

holiman requested review from zsfelfoldi, rjl493456442 and karalabe as code owners March 17, 2022 12:53

holiman mentioned this pull request Mar 18, 2022

Stacktrie prover v3 #24560

Closed

trie: enable batch updates

9b39cf7

holiman force-pushed the temp_trie branch from b24471d to a3c07bc Compare September 23, 2022 08:30

core, light, trie: undo core changes

53bc27d

holiman force-pushed the temp_trie branch from a3c07bc to 53bc27d Compare September 23, 2022 08:34

holiman added the status:triage label Sep 23, 2022

holiman closed this Sep 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trie: enable batch updates #24556

trie: enable batch updates #24556

holiman commented Mar 17, 2022 •

edited

Loading

rjl493456442 commented Mar 18, 2022

rjl493456442 commented Mar 18, 2022 •

edited

Loading

holiman commented Mar 18, 2022

rjl493456442 commented Mar 21, 2022

holiman commented Sep 23, 2022 •

edited

Loading

holiman commented Sep 29, 2022

trie: enable batch updates #24556

trie: enable batch updates #24556

Conversation

holiman commented Mar 17, 2022 • edited Loading

rjl493456442 commented Mar 18, 2022

rjl493456442 commented Mar 18, 2022 • edited Loading

holiman commented Mar 18, 2022

rjl493456442 commented Mar 21, 2022

holiman commented Sep 23, 2022 • edited Loading

holiman commented Sep 29, 2022

holiman commented Mar 17, 2022 •

edited

Loading

rjl493456442 commented Mar 18, 2022 •

edited

Loading

holiman commented Sep 23, 2022 •

edited

Loading