Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Cosmos added module upgrade encountered consensus issues #20746

Closed
1 task done
lmkdbd opened this issue Jun 21, 2024 · 10 comments
Closed
1 task done

[Question]: Cosmos added module upgrade encountered consensus issues #20746

lmkdbd opened this issue Jun 21, 2024 · 10 comments

Comments

@lmkdbd
Copy link

lmkdbd commented Jun 21, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

We encountered a strange issue when upgrading with cosmovisor. When attempting to add a module on multiple validator nodes, everything runs fine if all validators remain running post-upgrade. However, manually stopping and restarting a validator node results in a consensus issue. The specific error is:

ERR prevote step: consensus deems this block invalid; prevoting nil err="wrong Block.Header.AppHash.  Expected ..., got ..."

We spent several weeks troubleshooting this, including:

  1. Reviewing code for random values, time, maps, etc.
  2. Using iaviewer to inspect data across nodes and comparing exported leveldb data
  3. Incrementally reducing code complexity and repeatedly upgrading to pinpoint the issue
  4. Migrating the module to simapp to check if the issue could be reproduced

We ultimately found that having multiple structs in the keeper and modifying one of the struct's values in EndBlocker causes the consensus failure. We created a minimal module based on v0.50.6 to test the upgrade, adding an almost empty testmodule with only the UpdateParams method and the data structure to be tested.

The changes based on v0.50.6 are here: v0.50.6...lmkdbd:cosmos-sdk:test

In the following cases, the upgrade version is fully compiled from v0.50.6 code:

  1. When the keeper only declares the Params struct and attempts to modify Params in EndBlocker, stopping and restarting a validator node post-upgrade works fine. Successful upgrade commit: 3ab5d10
  2. When the keeper declares both Params and TestStruct, and attempts to modify TestStruct in EndBlocker, stopping and restarting a validator node post-upgrade results in consensus failure. The difference from the successful upgrade commit is here: d97a912

We hope someone can help identify the issue.

Our software versions are:

cosmovisor: v0.50.0
cosmos-sdk: v0.50.6

Cosmos SDK Version

0.50.6

How to reproduce?

  1. Compile simapp based on v0.50.6

  2. Start two validator nodes using cosmovisor

  3. Compile the upgraded simapp version using the [test](https://github.com/lmkdbd/cosmos-sdk/tree/test) branch

  4. Place the upgraded simapp version in cosmovisor under the directory named test-module

  5. Submit an upgrade proposal named test-module

  6. After the upgrade completes, let the nodes produce a few more blocks

  7. Stop one of the validator nodes and restart

@beer-1
Copy link
Contributor

beer-1 commented Jun 24, 2024

Maybe related with cosmos/iavl#880

iavl had problem for adding new store(module) and we patched it but release does not contain the patch yet. So we are also waiting the release which includes the patch.

@lmkdbd
Copy link
Author

lmkdbd commented Jun 24, 2024

Maybe related with cosmos/iavl#880

iavl had problem for adding new store(module) and we patched it but release does not contain the patch yet. So we are also waiting the release which includes the patch.

Thank you very much! We haven't looked into the deeper code logic and could only observe the behavior at the Cosmos-SDK level. Based on the issue you provided, we finally understand the possible cause of the problem. Once the code is merged into Cosmos-SDK, we will also try to see if it resolves the issue.

@zhi-lu
Copy link

zhi-lu commented Jun 25, 2024

@beer-1 Good brother, thank you very much for solving our big problem.

@tac0turtle
Copy link
Member

tac0turtle commented Jun 25, 2024

what version of iavl are you using?

also are you writing the structs to storage or keeping them in memory?

looking at the linked code im not sure what the issue is?

the issue linked from iavl is very old. No one else is having this issue i dont think its related to what beer-1 is saying

@tac0turtle tac0turtle changed the title [Bug]: Cosmos added module upgrade encountered consensus issues [Question]: Cosmos added module upgrade encountered consensus issues Jun 25, 2024
@lmkdbd
Copy link
Author

lmkdbd commented Jun 25, 2024

what version of iavl are you using?

also are you writing the structs to storage or keeping them in memory?

looking at the linked code im not sure what the issue is?

the issue linked from iavl is very old. No one else is having this issue i dont think its related to what beer-1 is saying

Currently, we are testing by directly writing the code mentioned in the issue cosmos/iavl#880 into the iavl package.

I checked the source code of iavl, and although the code was merged, it was removed in this commit: cosmos/iavl@2894221. You can search for the following code in this commit to confirm the deletion:

tree.version = int64(tree.ndb.opts.InitialVersion)

I also checked versions v1.1.1 and v1.1.2 of iavl, which are currently used by cosmos-sdk up to v0.50.7. In the v1.1.1 version, found here: https://github.com/cosmos/iavl/blob/v1.1.1/mutable_tree.go, the mentioned code is not present.

However, in v1.1.3, it seems that a consensus issue was fixed, which might be another approach to solving our problem. The commit is here: cosmos/iavl@86c273d. I'm not sure if this fix will resolve our issue.

But what I can confirm is that neither v1.1.1 nor v1.1.2 of iavl contains either of these commits.

@tac0turtle
Copy link
Member

please use the latest version of iavl 1.1.x as 1.1.1 and 1.1.2 had issues. We retracted those versions.

@lmkdbd
Copy link
Author

lmkdbd commented Jun 25, 2024

please use the latest version of iavl 1.1.x as 1.1.1 and 1.1.2 had issues. We retracted those versions.

We noticed that the new version of iavl has fixed a consensus error. The main issue now is that the latest release version of cosmos-sdk is v0.50.7, which depends on iavl version v1.1.2. We need to confirm whether future releases of cosmos-sdk will update the iavl dependency to v1.1.x so that we can update our modules by referencing cosmos-sdk v0.50.x. Currently, the cosmos-sdk v0.50.x branch still references iavl version v1.1.2.

@scirner22
Copy link

scirner22 commented Jul 5, 2024

@tac0turtle
Copy link
Member

yes this should fix the issue

@tac0turtle
Copy link
Member

closing this issue, if there are still issues please let me know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: 🥳 Done
Development

No branches or pull requests

5 participants