Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

mmr proof generation failures on rococo #12864

Closed
Tracked by #1636
Lederstrumpf opened this issue Dec 7, 2022 · 10 comments
Closed
Tracked by #1636

mmr proof generation failures on rococo #12864

Lederstrumpf opened this issue Dec 7, 2022 · 10 comments
Assignees
Labels
I3-bug The node fails to follow expected behavior.

Comments

@Lederstrumpf
Copy link
Contributor

We've been seeing mmr_generateProof calls failures on rococo the last weeks.

1. InconsistentStore

The first source of these errors is mmr_lib::Error::InconsistentStore.

> curl -H "Content-Type: application/json" -d '{"id":1, "jsonrpc":"2.0", "method": "mmr_generateProof", "params":[3, "0x0910d049debc60f0f2ee10b892b1bd5044efff9dbb3f6479c63baacb83f92f26"]}' $NODE_SYNCED_NORMALLY

{"jsonrpc":"2.0","error":{"code":8010,"message":"Unexpected MMR error","data":"Error::Verify"},"id":1}

---
node logs:
2022-12-07 10:08:25 [<wasm:stripped>] MMR error: InconsistentStore

This issue is due to the offchain worker not being triggered on every block.

Solution

This can be mitigated by firing notifications on the initial block sync too, as suggested by @serban300 and done here:
Lederstrumpf@a571c51
However, this should already be solved by #12753 - will test once runtime 9330 is activate on rococo.

2. CodecError

The other issue is codec::Error. The source of these is that the onchain runtime version's api may mismatch what the client was built with. As such, the failures we've seen changed based on what runtime version was current (unless historic chain state is queried).
For example, on a v0.9.31 or v0.9.32 release node, mmr_generateProof works for blocks with onchain runtime version 9321 or 9310, that is from block 2744973 onwards, but fails with the codec error for earlier blocks. In this case, we changed the API for runtime 9310 to use block numbers in lieu of leaf indices with PR #12345, so moving from u64 → u32 leads to the codec error thrown at

let (leaf, proof) = api
.generate_proof_with_context(
&BlockId::hash(block_hash),
sp_core::ExecutionContext::OffchainCall(None),
block_number,
)
.map_err(runtime_error_into_rpc_error)?
.map_err(mmr_error_into_rpc_error)?;

Confirmed that this is the root of the codec error since changing back to leaf inputs in the proof generation API atop v0.9.31 resolves this:
Lederstrumpf@72cdcad

> curl -H "Content-Type: application/json" -d '{"id":1, "jsonrpc":"2.0", "method": "mmr_generateProof", "params":[3, "0x0910d049debc60f0f2ee10b892b1bd5044efff9dbb3f6479c63baacb83f92f26"]}' $NODE_RUNNING_PATCHED_v0.9.3.1

{"jsonrpc":"2.0","result":{"blockHash":"0x0910...2f26","leaf":"0xc5010...0000","proof":"0x0300...f299"},"id":1}

Solution

Keeping the mmr API interface stable would avoid this.
If we change the API interface, the issue will disappear once the runtime aligned with changes on the API server becomes active. It will then still fail for historic queries, which I see the following options for:

  1. version the API, and then either fail gracefully on a mismatch or retain older versions
  2. check against runtime version and fail gracefully if known API break
  3. ignore this since once the new runtime becomes active, it will only affect historic state queries
@Lederstrumpf Lederstrumpf added the I3-bug The node fails to follow expected behavior. label Dec 7, 2022
@acatangiu
Copy link
Contributor

  1. version the API, and then either fail gracefully on a mismatch or retain older versions

This is definitely the right way to do it. In this case we consciously decided to not do it since MMR pallet was not yet stable and we didn't want to carry around old versions of the API which were never used in production.


Since we've made multiple breaking changes to it lately, I am definitely leaning towards resetting the pallet once these changes are all deployed and thus, start fresh on a clean & stable MMR.

@serban300
Copy link
Contributor

With the new MMR client gadget I'm not sure if we can reset the pallet (If resetting the pallet = setting NUM_MMR_LEAVES to 0). The MMR gadget will take the first block with MMR and compute leaf numbers based on it. After that, if we set NUM_MMR_LEAVES to 0 later, I think we'll get errors for canonicalization and pruning.

@acatangiu
Copy link
Contributor

We should then coordinate to reset the pallet in the same runtime upgrade that brings in the new API that the gadget relies on.

@acatangiu acatangiu assigned serban300 and unassigned Lederstrumpf Dec 13, 2022
@serban300
Copy link
Contributor

The new API was already released. As discussed offline, we need to see if we can support resets in the gadget.

@serban300
Copy link
Contributor

The idea that I'm thinking about in order to make the MMR client gadget support pallet resets is to change the way we save mmr nodes in offchain DB. Now we are saving them node by node by their position in the MMR. The option that I'm working on is to save each subtree by the block that introduced them. For example

For the tree:

  2  
 / \    
1   3

we are now saving the data for each node at the following keys:

1_[genesis_hash]
2_[block_1_hash]
3_[block_1_hash]

I'm trying to save:

1_[genesis_hash] -> the subtree containing node 1 (the subtree added by block 1)
2_[block_1_hash] -> the subtree containing nodes 2 and 3 (the subtree added by block 2)

And then these get canonicalized by the client gadget.

If we store subtrees by the block number that introduced them, we don't care about the position of the nodes in the mmr tree anymore, so when the pallet resets, we can continue to canonicalize by block number. The client gadget won't be impacted by the fact that at block x the pallet got reset.

However the disadvantage would be that when we need to retrieve a node in order to generate a proof, we would retrieve the entire subtree introduced by a block, which could be less efficient.

In the meanwhile I'm still thinking if there could be other simpler or more efficient solutions. The easiest solution would probably be to call mmr_leaf_count from the gadget at every finality notification, but I don't think it would be very efficient.

@acatangiu
Copy link
Contributor

I'm trying to save:

 1_[genesis_hash] -> the subtree containing node 1 (the subtree added by block 1)
 2_[block_1_hash] -> the subtree containing nodes 2 and 3 (the subtree added by block 2)

However the disadvantage would be that when we need to retrieve a node in order to generate a proof, we would retrieve the entire subtree introduced by a block, which could be less efficient.

Efficiency is much more important for on-chain activity than off-chain activity. Making above change would add more complexity to the process of "adding a node to the mmr" which happens multiple times per block during block execution.
Currently, each node is directly added to offchain db (single entry per node); in your proposal (entry is vec of nodes per block), we'd either have:

  • for each node addition: read offchain db vec + append to vec + write vec
  • for each node append to some runtime storage temporary item + on mmr.finalize() move the vec from runtime storage to offchain db in a single read-onchain + write-offchain + kill-onchain-temp-item

Because of onchain db cache, I think the second option is actually faster than what we currently have as it does everything in memory, and hopefully doesn't even touch the underlying onchain storage db since by the end of the process, nothing ultimately changes to onchain storage; and for offchain db we get a single write instead of one per node.

If offchain db also had(has?) cache, then what we have now is the most efficient.


I would actually not change anything here, but simply add a new pallet-mmr runtime API pallet_genesis_block_num() and use that.

The easiest solution would probably be to call mmr_leaf_count() from the gadget at every finality notification, but I don't think it would be very efficient.

Or this ^ - which is equivalent when we check count only ever goes up. Efficiency-wise, it's no problem for the gadget to call a runtime api per finality notification, or even once per block.

@serban300
Copy link
Contributor

serban300 commented Dec 20, 2022

Because of onchain db cache, I think the second option is actually faster than what we currently have as it does everything in memory, and hopefully doesn't even touch the underlying onchain storage db since by the end of the process, nothing ultimately changes to onchain storage; and for offchain db we get a single write instead of one per node.

If offchain db also had(has?) cache, then what we have now is the most efficient.

Yes, I'm experimenting with something close to the second option. I did some tests with a PoC version and indeed this is more efficient than what we have now. Also, unexpectedly it's faster when generating proofs. I guess it's because the DB probably does cache the subtrees when first accessed. There are 2 problems however:

  1. In the PoC version I'm just saving the subtree in a in-mem variable. But for this I'm relying on the fact that MMRStore::append() always provides an entire subtree. This is true now, but we depend on the underlying MMR library. With future versions of the library this may change and if this happens we would need to do what you suggested (adding some temp on-chain storage as well) which would add some complexity, but still should remain more efficient than what we have now.
  2. Will this approach be compatible with MMBs in the future ? Are they structured in such a way that we could also save subtrees for each block ?

On the other hand this should simplify the client gadget a bit because we wouldn't need to know the first mmr block and compute leaf/node indexes.

I would actually not change anything here, but simply add a new pallet-mmr runtime API pallet_genesis_block_num() and use that.

Yes, this is actually easier to use than mmr_leaf_count() if we keep the current overall approach.

@acatangiu
Copy link
Contributor

I did some tests with a PoC version and indeed this is more efficient than what we have now.

"more efficient" on what ops, based on what tests/benchmarks?

Also, unexpectedly it's faster when generating proofs. I guess it's because the DB probably does cache the subtrees when first accessed.

Yup, you're getting hot cache hits. Just as a note, generating proofs is not "in the hot path", we shouldn't make design decisions for small optimizations there.

Will this approach be compatible with MMBs in the future? Are they structured in such a way that we could also save subtrees for each block?

I suggest we take the easy route now (use mmr_leaf_count() or pallet_genesis_block_num()), and optimize it when we bring in MMBs, since like you say optimizations here might not be compatible with MMBs later anyway.

@serban300
Copy link
Contributor

"more efficient" on what ops, based on what tests/benchmarks?

The time needed to generate 10k-100k blocks

I suggest we take the easy route now (use mmr_leaf_count() or pallet_genesis_block_num()), and optimize it when we bring in MMBs, since like you say optimizations here might not be compatible with MMBs later anyway.

Agree. I'll try the pallet_genesis_block_num() approach. And we can revisit this when switching to MMBs.

@acatangiu
Copy link
Contributor

I believe this is fixed now

@github-project-automation github-project-automation bot moved this from Need for Kusama 🗒 to Done ✅ in BEEFY Mar 31, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug The node fails to follow expected behavior.
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants