Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define and implement any necessary migration logic for an upgrade from v0.6.0 to v0.7.0 #608

Closed
evan-forbes opened this issue Aug 11, 2022 · 14 comments
Assignees

Comments

@evan-forbes
Copy link
Member

The upgrade from v0.6.0 to v0.7.0 will require a hardfork, as there were many breaking changes that were not necessarily documented well given that originally based mamaki on the v0.46.0-beta2 tag of the cosmos-sdk. We need to investigate if there needs to be any specific migration logic, and test out the upgrade.

@rootulp
Copy link
Collaborator

rootulp commented Aug 11, 2022

One idea from @liamsi is to start an entirely new testnet and avoid the upgrade effort on Mamaki.

There are multiple options here:

  1. Export the state, re-genesis, start a new chain with the same chain id: mamaki at the upgrade height
  2. Start a brand new chain, chain id: mamakai-2

@rootulp rootulp self-assigned this Aug 15, 2022
@rootulp
Copy link
Collaborator

rootulp commented Aug 15, 2022

@evan-forbes
Copy link
Member Author

@evan-forbes
Copy link
Member Author

Diff: cosmos/cosmos-sdk@v0.46.0...v0.46.0-beta2 (no changes)

I'm confused by this linked diff? There was a huge diff between the two

@rootulp
Copy link
Collaborator

rootulp commented Aug 15, 2022

I was also surprised to see no changes but I was looking at the wrong direction, your link appears correct.

@rootulp
Copy link
Collaborator

rootulp commented Aug 16, 2022

I tested this in two ways:

  1. Run the latest main (c7b5692) consensus full node on Mamaki. The node was able to catch up to the latest block and continue making progress. I occasionally see logs of the form:
1:20AM ERR Error stopping connection err="already stopped" module=p2p peer={"Hostname":"173.212.245.122","NodeID":"e9c5c0bebab8f180cc4c43c3acc4cb061a4f7383","Path":"","Port":26656,"Protocol":"mconn"}

but the seem like red herrings.

  1. Run a v0.6.0 local devnet node but the node doesn't make progress after starting.
9:20PM ERR no progress since last advance last_advance=2022-08-15T21:19:09-04:00 module=blockchain
9:20PM INF switching to consensus module=consensus
9:20PM INF starting service impl=ConsensusState module=consensus service=State
9:20PM INF starting service impl=baseWAL module=consensus service=baseWAL wal=/Users/rootulp/.celestia-app/data/cs.wal/wal
9:20PM INF starting service impl=Group module=consensus service=Group wal=/Users/rootulp/.celestia-app/data/cs.wal/wal
9:20PM INF starting service impl=TimeoutTicker module=consensus service=TimeoutTicker
9:20PM INF Searching for height height=1 max=0 min=0 module=consensus wal=/Users/rootulp/.celestia-app/data/cs.wal/wal
9:20PM INF Searching for height height=0 max=0 min=0 module=consensus wal=/Users/rootulp/.celestia-app/data/cs.wal/wal
9:20PM INF Found height=0 index=0 module=consensus wal=/Users/rootulp/.celestia-app/data/cs.wal/wal
9:20PM INF Catchup by replaying consensus messages height=1 module=consensus
9:20PM INF Replay: Done module=consensus
9:20PM INF Timed out dur=-35025.483 height=1 module=consensus round=0 step=1
9:20PM INF Timed out dur=3000 height=1 module=consensus round=0 step=3

I am able to start a local devnet with the latest main (c7b5692) so I'm not sure why 0.6.0 times out. I have rm -rf $HOME/.celestia-app in between tests.

@evan-forbes
Copy link
Member Author

evan-forbes commented Aug 16, 2022

the network sometimes having trouble with v0.6.0 is definitely plausible, but getting it to work on the latest main is super weird!!! (and interesting)

When I try syncing from scratch or try to starting the chain at any height all I get this error, which at least intuitively makes sense due to this change cosmos/cosmos-sdk#6510, but I could be wrong. Do you mind triple checking the version by calling the version subcommand just so I keep my sanity lolol

10:53PM INF starting node with ABCI Tendermint in-process
Error: error reading GenesisDoc at /home/evan/.celestia-app/config/genesis.json: block.TimeIotaMs must be greater than 0. Got 0

@rootulp
Copy link
Collaborator

rootulp commented Aug 16, 2022

Thanks for providing guidance!

Devnet

v0.6.0

$ celestia-appd version
0.6.0

and I was able to start a node but it doesn't make progress after starting (same as previous comment)

main

I was able to repro the error you encountered by not running rm -rf $HOME/.celestia-app

$ celestia-appd version
0.6.0-25-gc7b5692
$ celestia-appd start
11:54AM INF starting node with ABCI Tendermint in-process
Error: error reading GenesisDoc at /Users/rootulp/.celestia-app/config/genesis.json: block.TimeIotaMs must be greater than 0. Got 0

main starts a devnet successfully if I repeat these steps so it looks like the genesis files are incompatible between v0.6.0 and main (hypothetical v0.7.0).

Since Mamaki uses a downloaded genesis file, I repeated investigation on Mamakai being careful to not run rm -rf $HOME/.celestia-app between v0.6.0 and main tests.

Mamaki

v0.6.0

$ celestia-appd version
0.6.0
4:03PM INF Replay: Vote blockID={"hash":"","parts":{"hash":"","total":0}} height=178627 module=consensus peer=f0c58d904dec824605ac36114db28f1bf84f6ea3 round=0 type=2
4:03PM INF Replay: Timeout dur=1000 height=178627 module=consensus round=0 step=7
4:03PM INF Replay: New Step height=178627 module=consensus round=1 step=RoundStepPropose
4:03PM INF Replay: Timeout dur=3500 height=178627 module=consensus round=1 step=3
4:03PM INF Replay: New Step height=178627 module=consensus round=1 step=RoundStepPrevote
4:03PM INF Replay: Done module=consensus
4:03PM INF Timed out dur=3500 height=178627 module=consensus round=1 step=3
...
4:11PM ERR failed to dial peer err="all endpoints failed" module=p2p peer={"Hostname":"207.180.248.186","NodeID":"906d8f2d6057fb94e0265b10c466622bb804d508","Path":"","Port":26656,"Protocol":"mconn"}
4:11PM ERR failed to dial endpoint endpoint={"IP":"144.76.59.109","Path":"","Port":26656,"Protocol":"mconn"} err="dial tcp 144.76.59.109:26656: connect: connection refused" module=p2p peer=c1e56d7aa2e12b8e4d3f5e7e498578c46eb29313
4:11PM ERR failed to dial peer err="all endpoints failed" module=p2p peer={"Hostname":"144.76.59.109","NodeID":"c1e56d7aa2e12b8e4d3f5e7e498578c46eb29313","Path":"","Port":26656,"Protocol":"mconn"}
...

It's unclear to me if this node is behaving as expected. Lots of errors about failing to connect to peers but the logs emitted after start-up don't indicate any issues.

main

$ celestia-appd version
0.6.0-25-gc7b5692
celestia-appd start
4:18PM INF starting node with ABCI Tendermint in-process
Error: error reading GenesisDoc at /root/.celestia-app/config/genesis.json: block.TimeIotaMs must be greater than 0. Got 0

Which is inconsistent with what I observed before. I'll investigate migration path for the genesis file

@evan-forbes
Copy link
Member Author

It's unclear to me if this node is behaving as expected. Lots of errors about failing to connect to peers but the logs emitted after start-up don't indicate any issues.

this should most likely be expected give the differences between the two p2p stacks, despite them being kinda compatible

Which is inconsistent with what I observed before. I'll investigate migration path for the genesis file

good plan, we might have to slightly modify our fork of the sdk, but if we do, we only want that change to be included on the version used for mamaki (not mainnet or incentivized testnet)

@rootulp
Copy link
Collaborator

rootulp commented Aug 18, 2022

Thought from @evan-forbes

  1. It may be useful to learn why is Cosmos on cosmos-4?
  2. It's possible that wrapped transactions between v0.6.0 and v0.7.0 is different because we added a field (share index) to wrapped transactions. See Test compatibility of SplitMessages with/without share index #632

@rootulp
Copy link
Collaborator

rootulp commented Aug 24, 2022

Note from @musalbas

  • We want to preserve existing account balances from Mamaki on a new testnet
  • We want to keep the existing testnet alive while we spin-up a new testnet. Only after we verify the new testnet works, then will we shut down the existing testnet

@evan-forbes
Copy link
Member Author

evan-forbes commented Aug 24, 2022

just to note that going with new regensis hardfork approach will need to be communicated well with the other validators and node operators. While the upgrade process will be as simple as spinning up a new node, this is atypical.

also worth noting that to achieve the new approach requires changing the chain-id.

@evan-forbes
Copy link
Member Author

The final decision on the upgrade approach is to extract any state that we want to keep, such as validator set, balances, delegations, etc and hardfork/regenesis a new network. after this network is stable, we will end official support for mamaki and ask validators to shut down their nodes.

@evan-forbes
Copy link
Member Author

We've used this process to upgrade arabica, and will likely use a similar process to upgrade mamaki

  • stop the celestia-appd process
  • checkout and install special patched version of either v0.7.0-rc-1 or v0.6.0 that has a fix for exporting state
    • git checkout evan/patch-v0.7.0-rc-1-for-export or evan/patch-v0.7.0-rc-1-for-export
    • make install
  • export the genesis using the special v0.7.0-rc-1 binary we just installed
  • celestia-appd export --for-zero-height >> /arbitrary/path/to/save/new/state/genesis.json
  • manually change the new genesis.json from the above step to include the time_iota_ms field in the consensus parameters struct
  • "block":{"max_bytes":"22020096","max_gas":"-1","time_iota_ms":"1"}
  • erase everything in the .celestia-app/data directory (after making a backup)
  • rm -r .celestia-app/data
  • upgrade to the special migration binary (strictly v0.7.0 w/ an additional script sub cmd)
  • git checkout evan/upgrade-arabica-script
  • make install
  • run the migration script on the exported state and replace the genesis.json we just deleted
  • celestia-appd migrate-arabica --path /arbitrary/path/to/the/saved/new/state/genesis.json >> ./.celestia-app/config/genesis.json
  • replace the genesis.json with this new modified version of the state
  • create a new priv_validator_state.json file in the .celestia-app/data directory
{
  "height": "0",
  "round": 0,
  "step": 0
}
  • restart the application

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants