Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network stops producing blocks after upgrade from v0.45.x to v0.46.0-rc1 #12041

Closed
4 tasks
Tracked by #11096
kaustubhkapatral opened this issue May 25, 2022 · 15 comments
Closed
4 tasks
Tracked by #11096

Comments

@kaustubhkapatral
Copy link
Contributor

kaustubhkapatral commented May 25, 2022

Summary of Bug

After completing the software upgrade using the fix and instructions provided here : #12028, a multi node network stops producing blocks once the upgrade handler is applied. All the nodes present in the network lose their p2p connections and do not attempt to dial the node addresses specified in the persisten_peers.

6:29AM INF not caught up yet height=151 max_peer_height=0 module=blockchain timeout_in=997.693432
6:29AM ERR no progress since last advance last_advance=2022-05-25T06:28:33Z module=blockchain
6:29AM INF switching to consensus module=consensus
6:29AM INF starting service impl=ConsensusState module=consensus service=State
6:29AM INF starting service impl=baseWAL module=consensus service=baseWAL wal=/root/.simapp/data/cs.wal/wal
6:29AM INF starting service impl=Group module=consensus service=Group wal=/root/.simapp/data/cs.wal/wal
6:29AM INF starting service impl=TimeoutTicker module=consensus service=TimeoutTicker
6:29AM INF Searching for height height=151 max=0 min=0 module=consensus wal=/root/.simapp/data/cs.wal/wal
6:29AM INF Searching for height height=150 max=0 min=0 module=consensus wal=/root/.simapp/data/cs.wal/wal
6:29AM INF Found height=150 index=0 module=consensus wal=/root/.simapp/data/cs.wal/wal
6:29AM INF Catchup by replaying consensus messages height=151 module=consensus
6:29AM INF Replay: Done module=consensus
6:29AM INF Timed out dur=-59039.217293 height=151 module=consensus round=0 step=1
6:29AM INF received proposal module=consensus proposal={"Type":32,"block_id":{"hash":"E72AEB39BA6E73197AE4EB94D37699544FBFD4C03BEB43AE1BF8E23EDC9B6AC1","parts":{"hash":"2A75B30323352A17714584961D9426734304D1941F19589E5778666ECB730991","total":1}},"height":151,"pol_round":-1,"round":0,"signature":"IgKgHu9ZjzjTga4Y6uXmv7faVVLAEVZapWV3glSb9hmGx8cnl6thJOVw5LvUIMsquadL0SZdfZ54u9JvB5H/Ag==","timestamp":"2022-05-25T06:29:33.398587982Z"}
6:29AM INF received complete proposal block hash=E72AEB39BA6E73197AE4EB94D37699544FBFD4C03BEB43AE1BF8E23EDC9B6AC1 height=151 module=consensus
6:29AM INF Timed out dur=3000 height=151 module=consensus round=0 step=3

The log snippet posted above was taken from the validator which was the proposer of the upgrade height + 1 block. It made no attempts to establish a peer connection with the rest of the nodes and stalled at that point.

6:29AM INF switching to consensus module=consensus
6:29AM INF starting service impl=ConsensusState module=consensus service=State
6:29AM INF starting service impl=baseWAL module=consensus service=baseWAL wal=/root/.simapp/data/cs.wal/wal
6:29AM INF starting service impl=Group module=consensus service=Group wal=/root/.simapp/data/cs.wal/wal
6:29AM INF starting service impl=TimeoutTicker module=consensus service=TimeoutTicker
6:29AM INF Searching for height height=151 max=0 min=0 module=consensus wal=/root/.simapp/data/cs.wal/wal
6:29AM INF Searching for height height=150 max=0 min=0 module=consensus wal=/root/.simapp/data/cs.wal/wal
6:29AM INF Found height=150 index=0 module=consensus wal=/root/.simapp/data/cs.wal/wal
6:29AM INF Catchup by replaying consensus messages height=151 module=consensus
6:29AM INF Replay: Done module=consensus
6:29AM INF Timed out dur=-59036.332995 height=151 module=consensus round=0 step=1
6:29AM INF Timed out dur=3000 height=151 module=consensus round=0 step=3

The log snippet posted above was present in the rest of the validator nodes of the network.

Number of p2p connections of all the nodes were verified using curl localhost:26657/net_info | jq .result.n_peers which returned 0 in all cases.

The migration using upgrade handler was verified by observing the logs

6:28AM INF applying upgrade "v045-to-v046" at height: 150
6:28AM INF migrating module authz from version 1 to version 2
6:28AM INF migrating module bank from version 2 to version 3
6:28AM INF migrating module feegrant from version 1 to version 2
6:28AM INF migrating module gov from version 2 to version 3
6:28AM INF adding a new module: group
6:28AM INF adding a new module: nft
6:28AM INF migrating module staking from version 2 to version 3
6:28AM INF migrating module upgrade from version 1 to version 2
6:28AM INF minted coins from module account amount=1441stake from=mint module=x/bank
6:28AM INF executed block height=150 module=consensus num_invalid_txs=0 num_valid_txs=0
6:28AM INF commit synced commit=436F6D6D697449447B5B363420313831203130362032303720333720363520343520313435203731203920323531203233362039302034
3120363320313232203139382031302031383020343120362032343920313738203232302032353520393920313937203139392031373320313030203930203138315D3A39367D
6:28AM INF committed state app_hash=40B56ACF25412D914709FBEC5A293F7AC60AB42906F9B2DCFF63C5C7AD645AB5 height=150 module=consensus num_txs=0
6:28AM INF Completed ABCI Handshake - Tendermint and App are synced appHash="�<.��$�&ΰ\b���\x11-�\x00=\"swբy�z_X���" appHeight=149 module=cons
ensus
6:28AM INF Version info block=11 mode=validator p2p=8 tmVersion=0.35.0-unreleased
6:28AM INF This node is a validator addr=965F201169F1B5C975C026CCC9A0A5F8D8DC7578 module=consensus pubKey="�H������\x12\x1e\x11\x1b�P%EW�\x1e:
�\a\x15}���c�h��"

This issue does not occur on a localnet with a one node network

Version

#12028

Steps to Reproduce

cc @alexanderbez @marbar3778


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@faddat
Copy link
Contributor

faddat commented May 26, 2022

RIGOR!

Thank you. Anything we can do to help?

@cmwaters
Copy link
Contributor

Can you confirm that peerstore.db is created in the nodes data directory

@kaustubhkapatral
Copy link
Contributor Author

@cmwaters Yes, peerstore.db is created when simd tendermint key-migrate is executed.

simd tendermint key-migrate --home ~/.simapp/

6:41AM INF beginning a key migration dbctx=blockstore num=1 total=6
6:41AM INF beginning a key migration dbctx=state num=2 total=6
6:41AM INF beginning a key migration dbctx=peerstore num=3 total=6
6:41AM INF beginning a key migration dbctx=tx_index num=4 total=6
6:41AM INF beginning a key migration dbctx=evidence num=5 total=6
6:41AM INF beginning a key migration dbctx=light num=6 total=6
6:41AM INF completed database migration successfully

@kaustubhkapatral
Copy link
Contributor Author

Here are the complete logs after restarting the node with new binary.

Block proposer validator: https://pastebin.com/A5f3ZsY3
Other validator: https://pastebin.com/TSQ0N190

Note that in this test the upgrade height was kept at 120.

@tac0turtle tac0turtle moved this to 📝 Todo in Cosmos-SDK May 27, 2022
@cmwaters
Copy link
Contributor

Ok so looking a bit deeper at NetInfo, it seems that it should return all the addresses in the peerStore not just the peers a node is currently connected to. This means that the nodes aren't successfully adding addresses that were stated in the config.toml.

If you move the list of peers from persistent_peers to bootstrap_peers and run the nodes again does it start to dial?

The other thing I can try to do is add logs to check that the addresses are being added in a local testnet

@faddat
Copy link
Contributor

faddat commented May 27, 2022

I wondered about the same things that @cmwaters did-- basically even when a network loses consensus you can often get the nodes to reconnect to one another, and get consensus back.

@kaustubhkapatral kaustubhkapatral mentioned this issue May 30, 2022
19 tasks
@cmwaters
Copy link
Contributor

cmwaters commented Jun 2, 2022

@kaustubhkapatral do you mind dumping a copy of the tendermint config.toml file?

@kaustubhkapatral
Copy link
Contributor Author

@cmwaters It was the default config file generated with v0.45.4 with persistent_peers added in.
https://pastebin.com/aQb5spAs

@cmwaters
Copy link
Contributor

cmwaters commented Jun 3, 2022

Ok, we made some breaking changes to the config file from v0.34 to v0.35 which require the user to adjust their file. You can see the notes in our upgrading document here: https://github.com/tendermint/tendermint/blob/master/UPGRADING.md#config-changes-1. To accompany this, we also made a tool confix (https://github.com/tendermint/tendermint/tree/master/scripts/confix) which would automatically update a users config file. Note that this is still a new tool that hasn't had too much "production" experience

@amaury1093 amaury1093 mentioned this issue Jun 3, 2022
56 tasks
@alexanderbez
Copy link
Contributor

Just an FYI -- we've decided to downgrade to v0.34.x for the SDK v0.46 release (including a prioritized mempool).

@robert-zaremba
Copy link
Collaborator

@alexanderbez could you share more reasoning for that? If we won't have tendermint 0.35 in SDK v0.46, then it would be great to have a release 0.47 SDK release with tendermint 0.35

@alexanderbez
Copy link
Contributor

This discussion is taking place with various teams testing out v0.35 and I just dont have the cognitive bandwidth to recap everything. In short, v0.35 is not stable enough for us to garner confidence to release SDK v0.46 with it.

@robert-zaremba
Copy link
Collaborator

robert-zaremba commented Jun 9, 2022

what's the ETA for updating tendermint 0.34 to add tx prioritization and Cosmos SDK update?
cc: @marbar3778 @alexanderbez

@alexanderbez
Copy link
Contributor

alexanderbez commented Jun 9, 2022

We already have the PRs in progress -- we just need Tendermint teams' blessing/approval (which is...not going so great right now). I want to have this released next week at the latest.

@tac0turtle
Copy link
Member

closing as 0.35 isn't used

Repository owner moved this from 📝 Todo to 👏 Done in Cosmos-SDK Aug 4, 2022
@tac0turtle tac0turtle removed this from Cosmos-SDK Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants