Consensus fails when using statesync mode to synchronize the application state #9324

chengwenxi · 2021-05-13T01:46:26Z

Summary of Bug

Consensus fails when using statesync mode to synchronize the application state and then execute the ibc-transfer transaction.

Description

When the cosmos-sdk-based chain is started, the capability/keeper/keeper.go#L177:InitializeCapability(...) method will be called to initialize the memStore from the application store. However, if the node is started using statesync mode, the application store will not be loaded until the node is switched to fastsync mode. But in this case, the method InitializeCapability will not be called again to initialize memStore. Therefore, when calling the method capability/keeper/keeper.go#L344:GetCapability(...), the node started using statesync mode cannot get the same result as other node.

Steps to Reproduce

The GetCapability(...) mothod used in IBC module, so it can be reproduced through ibc-transfer:

Start two testnets via gaia and create relayer for them, then create clients and channels. Refer: https://github.com/cosmos/relayer#demo

Create node1, node2 to join testnet ibc-0:

gaiad init node1 --home node1
cp data/ibc-0/config/genesis.json node1/config/genesis.json

gaiad init node2 --home node2
cp data/ibc-0/config/genesis.json node2/config/genesis.json

then update state-sync config in node1/config/app.toml and node2/config/app.toml:

[state-sync]
snapshot-interval = 100
snapshot-keep-recent = 4

Start node1 and node2:

# NOTE: modify ports and add ibc-0 peer
gaiad start --home node1

# NOTE: modify ports and add ibc-0 peer
gaiad start --home node2

Create node3 to join testnet ibc-0.

gaiad init node3 --home node3
cp data/ibc-0/config/genesis.json node3/config/genesis.json

Update config:

# config.toml
[statesync]
enable = true

rpc_servers = "ibc-0 node rpc"
trust_height = 1
trust_hash = "block 1 hash"
trust_period = "168h0m0s"

Send ibc-transfer

rly tx transfer ibc-0 ibc-1 1000000samoleans $(rly chains address ibc-1)
rly tx relay-packets demo -d

Start node3 using statesync mode

# NOTE: modify ports and add ibc-0 peer
gaiad start --home node3

Get consensus failure error on executing the ibc-transfer transaction:

NOTE: if the latest block height is greater than the ibc-transfer tranaction exexuted height, no error is returned, you can unsafe-reset-all node3 and repeat steps 4-5.

4:49PM INF committed state app_hash=0475A43BE9A8BD240551895B01A31C5B1ABACD710DC273D819B863C9F355804C height=34721 module=state num_txs=1
4:49PM INF indexed block height=34721 module=txindex
panic: Failed to process committed block (34722:51062BEB78119D7A5CF971B7FEC787C428E8347FE20C308BF98311C2F95BFA1B): wrong Block.Header.AppHash.  Expected 0475A43BE9A8BD240551895B01A31C5B1ABACD710DC273D819B863C9F355804C, got 2443E8D78F4B2025252055EDD384DBF80839893092110C0A5D072DCABED9FB17

goroutine 135 [running]:
github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).poolRoutine(0xc000548a80, 0xc0032dac01)
  github.com/tendermint/tendermint@v0.34.9/blockchain/v0/reactor.go:401 +0x15bf
created by github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).SwitchToFastSync
  github.com/tendermint/tendermint@v0.34.9/blockchain/v0/reactor.go:125 +0xd8

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate contributors tagged
Contributor assigned/self-assigned

The text was updated successfully, but these errors were encountered:

chengwenxi · 2021-05-13T01:48:21Z

@alessio @alexanderbez @tessr I posted it here, so we can discuss it more.

alexanderbez · 2021-05-13T12:25:49Z

Thank you @chengwenxi! I appreciate the detailed writeup. We definitely need to investigate this. However, my initial suspicious is why have we not experienced or seen this reported on the mainnet hub? I know for sure there are state-sync nodes that take snapshots post IBC-enablement.

@marbar3778 can you confirm? I suppose we could easily spin up a node to verify this. Do we know of any state sync providers?

tac0turtle · 2021-05-13T18:16:20Z

I was able to state sync prior to IBC was enabled, but this smells like an issue where state sync only syncs state in IAVL, I'm guessing capability is not in there.

alexanderbez · 2021-05-13T18:27:57Z

Yes, caps are in-memory only. They're not part of state-sync on purpose.

chengwenxi · 2021-05-14T01:58:22Z

I was able to state sync prior to IBC was enabled

Yes, in this case, this problem is not reproduced in the mainnet hub because of the caps is added on channelOpen.

alexanderbez · 2021-05-14T11:05:09Z

/cc @colin-axner @AdityaSripal @cwgoes for visibility.

colin-axner · 2021-05-14T11:23:25Z

This is the issue right? Does NewApp get called for state sync? Or I guess any usage of capabilities during state sync is a problem?

In IBC capabilities are created at various times, sometimes during InitChain for binding ports, always during a channel handshake, and randomly by applications as they decide to bind to new port names.

chengwenxi · 2021-05-17T03:04:23Z

This is the issue right? Does NewApp get called for state sync?

Yes, NewApp is called. But if using state sync, the App generated by genesis that does not contain the capabilities.

I think app.CapabilityKeeper.InitializeAndSeal(ctx) should be called again to initialize the capabilities after the state synchronization is completed.

kwunyeung · 2021-05-22T13:24:44Z

Just experience this on Cosmos Hub. I’m trying to state sync a new node.

May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF committed state app_hash=67B9677E9E3638F0AB4077CFF984E14F4596BC73436534E1977CB1841CC0043D height=6308367 module=state num_txs=0
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF indexed block height=6308367 module=txindex
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF minted coins from module account amount=4385553uatom from=mint module=x/bank
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF executed block height=6308368 module=state num_invalid_txs=0 num_valid_txs=2
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF commit synced commit=436F6D6D697449447B5B383920313139203137203635203231362032312031303420313420323720313736203232312031373620383020333820353220343320393520393120323134203132302032333320313335203136352032312032313320313134203138312031393420313635203132322032372039325D3A3630343231307D
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF committed state app_hash=59771141D815680E1BB0DDB05026342B5F5BD678E987A515D572B5C2A57A1B5C height=6308368 module=state num_txs=2
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF indexed block height=6308368 module=txindex
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF minted coins from module account amount=4385553uatom from=mint module=x/bank
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF client state updated client-id=07-tendermint-146 height=2-1047307 module=x/ibc/client
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF executed block height=6308369 module=state num_invalid_txs=1 num_valid_txs=2
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF commit synced commit=436F6D6D697449447B5B313537203139372031343120313436203836203220323238203335203936203138322033372031383320323333203137312031343120373620323534203134332038332031313720323131203434203134312031353720323337203130392031393920323337203139342032353320323035203134315D3A3630343231317D
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF committed state app_hash=9DC58D925602E42360B625B7E9AB8D4CFE8F5375D32C8D9DED6DC7EDC2FDCD8D height=6308369 module=state num_txs=3
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: panic: Failed to process committed block (6308370:37377A88958D37238FC4E2AF223E4C29456C8F8A91383BD41A6ECE015A140A01): wrong Block.Header.AppHash.  Expected 9DC58D925602E42360B625B7E9AB8D4CFE8F5375D32C8D9DED6DC7EDC2FDCD8D, got F7567C7F223363DF9A6273BAA54111661026E6EBA6EDA8A911A412B78784BBA5
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: goroutine 326 [running]:
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).poolRoutine(0xc000c8ee00, 0xc029c65d01)
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]:         github.com/tendermint/tendermint@v0.34.9/blockchain/v0/reactor.go:401 +0x15bf
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: created by github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).SwitchToFastSync
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]:         github.com/tendermint/tendermint@v0.34.9/blockchain/v0/reactor.go:125 +0xd8
May 22 13:14:00 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF starting ABCI with Tendermint
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting multiAppConn service impl=multiAppConn module=proxy
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting localClient service connection=query impl=localClient module=abci-client
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting EventBus service impl=EventBus module=events
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting PubSub service impl=PubSub module=pubsub
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting IndexerService service impl=IndexerService module=txindex
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Found local state with non-zero height, skipping state sync
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF ABCI Handshake App Info hash="�ō�V\x02�#`�%�髍L��Su�,���m����͍" height=6308369 module=consensus protocol-version=0 software-version=
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF ABCI Replay Blocks appHeight=6308369 module=consensus stateHeight=6308369 storeHeight=6308370
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Replay last block using real app module=consensus
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: Error: error during handshake: error on replay: wrong Block.Header.AppHash.  Expected 9DC58D925602E42360B625B7E9AB8D4CFE8F5375D32C8D9DED6DC7EDC2FDCD8D, got F7567C7F223363DF9A6273BAA54111661026E6EBA6EDA8A911A412B78784BBA5

Tried a few times and it panics at the same block height.

alexanderbez · 2021-05-23T18:29:37Z

Thanks for confirming @kwunyeung! How do we go about fixing this @colin-axner @AdityaSripal? Seems like a pretty serious bug.

chengwenxi · 2021-05-24T02:07:50Z

@kwunyeung Hi, kwunyeung! Could you get me some snapshot nodes? I want to reproduce the issue #9234.

kwunyeung · 2021-05-24T04:22:53Z

@chengwenxi you can connect to the following nodes. They both have snapshots.

ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7@139.177.178.149:26656
ee27245d88c632a556cf72cc7f3587380c09b469@45.79.249.253:26656

And if you need RPCs
https://rpc.cosmoshub.forbole.com/
https://rpc.cosmoshub.bigdipper.live/

chengwenxi · 2021-05-24T05:35:09Z

@chengwenxi you can connect to the following nodes. They both have snapshots.

extremely grateful ❤️ !

zmanian · 2021-05-24T05:40:33Z

I think we need have something that iterates every open port and creates a capability after state is synced.

Should be a non-consensus breaking fix.

colin-axner · 2021-05-24T10:01:41Z

How do we go about fixing this @colin-axner @AdityaSripal? Seems like a pretty serious bug.

We will discuss this later today

I think we need have something that iterates every open port and creates a capability after state is synced.

This sounds like the right approach to me. I'm not sure how claiming capabilities will work. I'd also be nice to have a general solution. This does not only affect IBC, this affects any module that uses x/capability

AdityaSripal · 2021-05-24T15:46:03Z

The issue is that NewApp gets called before state-sync fills in the persistent store data for capabilities. Thus we cannot initialize the in-memory part.

The solution I have in mind is to fill in the in-memory component as-needed whenever GetCapability or a similar function is called using the data that is already available in the persistent store.

AdityaSripal · 2021-05-24T23:07:26Z

I have a non-breaking fix up that will be able to fix the issue for the 0.42 line here: https://github.com/cosmos/cosmos-sdk/tree/aditya/cap-init

Here's the diff: https://github.com/cosmos/cosmos-sdk/compare/v0.42.5...aditya/cap-init?expand=1

Unfortunately the fix I proposed above can only be done efficiently if we move the reverse mapping into the persistent store. The reverse mapping is deterministic so there's no issue moving it, it's just a breaking change. Once that is done, reconstructing the forward mapping and capmap on-the-fly is trivial. This fix should go into 0.43

I will write tests for this tomorrow, but in the meantime it would be great if someone is able to test it out and see if statesync works.

alessio · 2021-05-25T01:40:14Z

@chengwenxi @kwunyeung can you apply @AdityaSripal's patch to the sources, rebuild and see whether this fixes the issue for Cosmos SDK v0.42 and gaia, please?
https://github.com/cosmos/cosmos-sdk/compare/v0.42.5...aditya/cap-init.patch

chengwenxi · 2021-05-25T01:48:23Z

Here's the diff: https://github.com/cosmos/cosmos-sdk/compare/v0.42.5...aditya/cap-init?expand=1

It seems that k.InitializeAndSeal(ctx) will cause additional gas compared to the already initialized node, use a cache ctx?

kwunyeung · 2021-05-25T10:13:41Z

@alessio applied the patch to v0.42.5 and now trying to state sync a new node. It's weird that even the new node is connected to my other 2 nodes with snapshots, it can't find the snapshots during the snapshots discovering period. I'm going to run it for a day and see if it will start and won't crash.

zmanian · 2021-05-25T13:55:25Z

I've encountered this issue before and I think it tends to happen if the pruning settings and the snapshot settings are not aligned correctly but it's a total pain to debug.

AdityaSripal · 2021-05-25T22:58:17Z

Note the previous commit had an issue. Please try against the latest PR above.

kwunyeung · 2021-05-26T08:27:05Z

@zmanian those nodes are using default pruning and I set snapshot-interval to 500. I keep getting

ERR Failed to add snapshot err="failed to obtain the header at height #6353217: no witnesses connected. please reset light client" format=1 height=6353000 module=statesync peer=ee27245d88c632a556cf72cc7f3587380c09b469

or

ERR Failed to add snapshot err="no witnesses connected. please reset light client" format=1 height=6353000 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7

now

zmanian · 2021-05-26T10:23:33Z

What is the trust period on the syncing nodes set to?

kwunyeung · 2021-05-26T12:43:16Z

@zmanian 336h

tac0turtle · 2021-05-26T12:56:02Z

@zmanian those nodes are using default pruning and I set snapshot-interval to 500. I keep getting

ERR Failed to add snapshot err="failed to obtain the header at height #6353217: no witnesses connected. please reset light client" format=1 height=6353000 module=statesync peer=ee27245d88c632a556cf72cc7f3587380c09b469

or

ERR Failed to add snapshot err="no witnesses connected. please reset light client" format=1 height=6353000 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7

now

Could you post an exerpt of your config.toml statesync section? It is either you dont have enough rpc connections or the light client issue I also ran into. (tendermint/tendermint#6389).

Could you also change your log level to debug to see what is going on?

kwunyeung · 2021-05-27T16:17:18Z

@marbar3778 I had started a new node in the same private network as those two nodes serving snapshots and enabled debug last night. You can download a short extract of the debug log below.

https://forbole-val.s3.ap-east-1.amazonaws.com/gaiad-state-sync-log.txt.gz

I saw the following log

DBG Matching header received by witness height=6355500 module=light witness=0

And then the node did requested snapshot chunks but I never saw it applied the chunks.

May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM INF Fetching snapshot chunk chunk=11 format=1 height=6355500 module=statesync total=31
May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM DBG Requesting snapshot chunk chunk=11 format=1 height=6355500 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7
May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM DBG Send channel=97 conn={"Logger":{}} module=p2p msgBytes=1A0908ACF483031001180B peer={"id":"ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7","ip":"139.177.178.149","port":26656}
May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM INF Fetching snapshot chunk chunk=10 format=1 height=6355500 module=statesync total=31
May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM DBG Requesting snapshot chunk chunk=10 format=1 height=6355500 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7
May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM DBG Send channel=97 conn={"Logger":{}} module=p2p msgBytes=1A0908ACF483031001180A peer={"id":"ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7","ip":"139.177.178.149","port":26656}
May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM DBG Flush conn={"Logger":{}} module=p2p peer={"id":"ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7","ip":"139.177.178.149","port":26656}
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Requesting snapshot chunk chunk=8 format=1 height=6355500 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Send channel=97 conn={"Logger":{}} module=p2p msgBytes=1A0908ACF4830310011808 peer={"id":"ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7","ip":"139.177.178.149","port":26656}
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM INF Fetching snapshot chunk chunk=12 format=1 height=6355500 module=statesync total=31
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Requesting snapshot chunk chunk=12 format=1 height=6355500 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Send channel=97 conn={"Logger":{}} module=p2p msgBytes=1A0908ACF483031001180C peer={"id":"ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7","ip":"139.177.178.149","port":26656}
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Requesting snapshot chunk chunk=10 format=1 height=6355500 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Send channel=97 conn={"Logger":{}} module=p2p msgBytes=1A0908ACF483031001180A peer={"id":"ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7","ip":"139.177.178.149","port":26656}
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM INF Fetching snapshot chunk chunk=13 format=1 height=6355500 module=statesync total=31
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Requesting snapshot chunk chunk=13 format=1 height=6355500 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7

After getting all snapshot chunks, it said no witnesses connected.

May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM DBG error from light block request from primary, replacing... error="client failed to respond" height=6355008 module=light primary={}
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM ERR Can't verify err="failed to obtain the header at height #6355008: no witnesses connected. please reset light client" module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM ERR Failed to add snapshot err="failed to obtain the header at height #6355008: no witnesses connected. please reset light client" format=1 height=6355000 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=3CAC485961E4FC23199088649FD7D0BBC442603467A618E8C78D921A5216D4B1 height=6355501 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=348F8434CA521F483D52E2ED452B230AFB332987DB2F1140B2CB127DC133F2EC height=6355502 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=9DDD287823DE2058A254A01F24F3ADB6357BDE962B41991B8F53D5BF07FA01BF height=6355500 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=3CAC485961E4FC23199088649FD7D0BBC442603467A618E8C78D921A5216D4B1 height=6355501 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=348F8434CA521F483D52E2ED452B230AFB332987DB2F1140B2CB127DC133F2EC height=6355502 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=9DDD287823DE2058A254A01F24F3ADB6357BDE962B41991B8F53D5BF07FA01BF height=6355500 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Discovered new snapshot format=1 hash="8\x1a%�Z��@��\b\x15�}�B�l��Y[V}[�;�\x1a��7" height=6355500 module=statesync
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=9DDD287823DE2058A254A01F24F3ADB6357BDE962B41991B8F53D5BF07FA01BF height=6355500 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=3CAC485961E4FC23199088649FD7D0BBC442603467A618E8C78D921A5216D4B1 height=6355501 module=light

The most interesting is that the node got panic earlier is actually in the same private network and it started download the snapshot and applied the chunks immediately. The state was restored in minutes and started committing blocks but got panic at certain height. Actually that node is now running as I unsafe reset and restarted it right after that panic. The state was synced again. I stopped that node right after the state restored, disabled state sync and started it again. Then the node runs smoothly till now.

When I tried to test the fix, I was running on another machine. I couldn’t restore the state, so as this new machine with the no witnesses connected thing.

I left that new node running the whole day as I had to finish some other work. I have just had time to check it again. State not restored yet and the node keep only trying to connect new nodes. I stopped it, unsafe reset it and started it again. Voila! It downloaded snapshot chunks and applied them immediately! I’m going to run it over night again and see if it would panic at some point.

As I restarted the node without changing any settings, the only interesting thing to notice would be the trust_height. I did that last night so the value is comparatively long before the current height. It’s 6355395 in the config but the snapshot applied is at height 6368500. I wonder if picking an far earlier block height would actually help state sync easier.

kwunyeung · 2021-05-30T08:51:07Z

@AdityaSripal the node has been running without issue for 3 days already. No crash no restart.

AdityaSripal self-assigned this May 24, 2021

AdityaSripal mentioned this issue May 25, 2021

Cap Initialization Fix #9392

Merged

9 tasks

anilcse mentioned this issue Jun 2, 2021

CONSENSUS FAILURE caused by prune stores #9234

Closed

github-actions bot added the stale label Jul 15, 2021

github-actions bot closed this as completed Jul 21, 2021

neutron-robot mentioned this issue Feb 14, 2024

[Snyk] Upgrade @docusaurus/preset-classic from 2.2.0 to 2.4.3 neutron-org/cosmos-sdk#21

Closed

This was referenced Apr 28, 2024

[Snyk] Upgrade @docusaurus/core from 2.4.1 to 2.4.3 hedgeblock/cosmos-sdk#5

Open

[Snyk] Upgrade @docusaurus/plugin-client-redirects from 2.4.1 to 2.4.3 hedgeblock/cosmos-sdk#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consensus fails when using statesync mode to synchronize the application state #9324

Consensus fails when using statesync mode to synchronize the application state #9324

chengwenxi commented May 13, 2021

chengwenxi commented May 13, 2021

alexanderbez commented May 13, 2021

tac0turtle commented May 13, 2021

alexanderbez commented May 13, 2021

chengwenxi commented May 14, 2021

alexanderbez commented May 14, 2021

colin-axner commented May 14, 2021 •

edited

Loading

chengwenxi commented May 17, 2021 •

edited

Loading

kwunyeung commented May 22, 2021

alexanderbez commented May 23, 2021

chengwenxi commented May 24, 2021

kwunyeung commented May 24, 2021

chengwenxi commented May 24, 2021

zmanian commented May 24, 2021

colin-axner commented May 24, 2021

AdityaSripal commented May 24, 2021

AdityaSripal commented May 24, 2021

alessio commented May 25, 2021

chengwenxi commented May 25, 2021

kwunyeung commented May 25, 2021

zmanian commented May 25, 2021

AdityaSripal commented May 25, 2021

kwunyeung commented May 26, 2021

zmanian commented May 26, 2021

kwunyeung commented May 26, 2021

tac0turtle commented May 26, 2021

kwunyeung commented May 27, 2021

kwunyeung commented May 30, 2021

Consensus fails when using statesync mode to synchronize the application state #9324

Consensus fails when using statesync mode to synchronize the application state #9324

Comments

chengwenxi commented May 13, 2021

Summary of Bug

Description

Steps to Reproduce

For Admin Use

chengwenxi commented May 13, 2021

alexanderbez commented May 13, 2021

tac0turtle commented May 13, 2021

alexanderbez commented May 13, 2021

chengwenxi commented May 14, 2021

alexanderbez commented May 14, 2021

colin-axner commented May 14, 2021 • edited Loading

chengwenxi commented May 17, 2021 • edited Loading

kwunyeung commented May 22, 2021

alexanderbez commented May 23, 2021

chengwenxi commented May 24, 2021

kwunyeung commented May 24, 2021

chengwenxi commented May 24, 2021

zmanian commented May 24, 2021

colin-axner commented May 24, 2021

AdityaSripal commented May 24, 2021

AdityaSripal commented May 24, 2021

alessio commented May 25, 2021

chengwenxi commented May 25, 2021

kwunyeung commented May 25, 2021

zmanian commented May 25, 2021

AdityaSripal commented May 25, 2021

kwunyeung commented May 26, 2021

zmanian commented May 26, 2021

kwunyeung commented May 26, 2021

tac0turtle commented May 26, 2021

kwunyeung commented May 27, 2021

kwunyeung commented May 30, 2021

colin-axner commented May 14, 2021 •

edited

Loading

chengwenxi commented May 17, 2021 •

edited

Loading