Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consensus fails when using statesync mode to synchronize the application state #9324

Closed
4 tasks
chengwenxi opened this issue May 13, 2021 · 28 comments
Closed
4 tasks
Assignees

Comments

@chengwenxi
Copy link
Contributor

Summary of Bug

Consensus fails when using statesync mode to synchronize the application state and then execute the ibc-transfer transaction.

Description

When the cosmos-sdk-based chain is started, the capability/keeper/keeper.go#L177:InitializeCapability(...) method will be called to initialize the memStore from the application store. However, if the node is started using statesync mode, the application store will not be loaded until the node is switched to fastsync mode. But in this case, the method InitializeCapability will not be called again to initialize memStore. Therefore, when calling the method capability/keeper/keeper.go#L344:GetCapability(...), the node started using statesync mode cannot get the same result as other node.

Steps to Reproduce

The GetCapability(...) mothod used in IBC module, so it can be reproduced through ibc-transfer:

  1. Start two testnets via gaia and create relayer for them, then create clients and channels. Refer: https://github.com/cosmos/relayer#demo

  2. Create node1, node2 to join testnet ibc-0:

    gaiad init node1 --home node1
    cp data/ibc-0/config/genesis.json node1/config/genesis.json
    
    gaiad init node2 --home node2
    cp data/ibc-0/config/genesis.json node2/config/genesis.json

    then update state-sync config in node1/config/app.toml and node2/config/app.toml:

    [state-sync]
    snapshot-interval = 100
    snapshot-keep-recent = 4

    Start node1 and node2:

    # NOTE: modify ports and add ibc-0 peer
    gaiad start --home node1
    
    # NOTE: modify ports and add ibc-0 peer
    gaiad start --home node2
  3. Create node3 to join testnet ibc-0.

    gaiad init node3 --home node3
    cp data/ibc-0/config/genesis.json node3/config/genesis.json

    Update config:

    # config.toml
    [statesync]
    enable = true
    
    rpc_servers = "ibc-0 node rpc"
    trust_height = 1
    trust_hash = "block 1 hash"
    trust_period = "168h0m0s"
  4. Send ibc-transfer

    rly tx transfer ibc-0 ibc-1 1000000samoleans $(rly chains address ibc-1)
    rly tx relay-packets demo -d
  5. Start node3 using statesync mode

    # NOTE: modify ports and add ibc-0 peer
    gaiad start --home node3

    Get consensus failure error on executing the ibc-transfer transaction:

    NOTE: if the latest block height is greater than the ibc-transfer tranaction exexuted height, no error is returned, you can unsafe-reset-all node3 and repeat steps 4-5.

    4:49PM INF committed state app_hash=0475A43BE9A8BD240551895B01A31C5B1ABACD710DC273D819B863C9F355804C height=34721 module=state num_txs=1
    4:49PM INF indexed block height=34721 module=txindex
    panic: Failed to process committed block (34722:51062BEB78119D7A5CF971B7FEC787C428E8347FE20C308BF98311C2F95BFA1B): wrong Block.Header.AppHash.  Expected 0475A43BE9A8BD240551895B01A31C5B1ABACD710DC273D819B863C9F355804C, got 2443E8D78F4B2025252055EDD384DBF80839893092110C0A5D072DCABED9FB17
    
    goroutine 135 [running]:
    github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).poolRoutine(0xc000548a80, 0xc0032dac01)
      github.com/tendermint/tendermint@v0.34.9/blockchain/v0/reactor.go:401 +0x15bf
    created by github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).SwitchToFastSync
      github.com/tendermint/tendermint@v0.34.9/blockchain/v0/reactor.go:125 +0xd8

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@chengwenxi
Copy link
Contributor Author

@alessio @alexanderbez @tessr I posted it here, so we can discuss it more.

@alexanderbez
Copy link
Contributor

Thank you @chengwenxi! I appreciate the detailed writeup. We definitely need to investigate this. However, my initial suspicious is why have we not experienced or seen this reported on the mainnet hub? I know for sure there are state-sync nodes that take snapshots post IBC-enablement.

@marbar3778 can you confirm? I suppose we could easily spin up a node to verify this. Do we know of any state sync providers?

@tac0turtle
Copy link
Member

I was able to state sync prior to IBC was enabled, but this smells like an issue where state sync only syncs state in IAVL, I'm guessing capability is not in there.

@alexanderbez
Copy link
Contributor

Yes, caps are in-memory only. They're not part of state-sync on purpose.

@chengwenxi
Copy link
Contributor Author

I was able to state sync prior to IBC was enabled

Yes, in this case, this problem is not reproduced in the mainnet hub because of the caps is added on channelOpen.

@alexanderbez
Copy link
Contributor

/cc @colin-axner @AdityaSripal @cwgoes for visibility.

@colin-axner
Copy link
Contributor

colin-axner commented May 14, 2021

This is the issue right? Does NewApp get called for state sync? Or I guess any usage of capabilities during state sync is a problem?

In IBC capabilities are created at various times, sometimes during InitChain for binding ports, always during a channel handshake, and randomly by applications as they decide to bind to new port names.

@chengwenxi
Copy link
Contributor Author

chengwenxi commented May 17, 2021

This is the issue right? Does NewApp get called for state sync?

Yes, NewApp is called. But if using state sync, the App generated by genesis that does not contain the capabilities.

I think app.CapabilityKeeper.InitializeAndSeal(ctx) should be called again to initialize the capabilities after the state synchronization is completed.

@kwunyeung
Copy link
Contributor

Just experience this on Cosmos Hub. I’m trying to state sync a new node.

May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF committed state app_hash=67B9677E9E3638F0AB4077CFF984E14F4596BC73436534E1977CB1841CC0043D height=6308367 module=state num_txs=0
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF indexed block height=6308367 module=txindex
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF minted coins from module account amount=4385553uatom from=mint module=x/bank
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF executed block height=6308368 module=state num_invalid_txs=0 num_valid_txs=2
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF commit synced commit=436F6D6D697449447B5B383920313139203137203635203231362032312031303420313420323720313736203232312031373620383020333820353220343320393520393120323134203132302032333320313335203136352032312032313320313134203138312031393420313635203132322032372039325D3A3630343231307D
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF committed state app_hash=59771141D815680E1BB0DDB05026342B5F5BD678E987A515D572B5C2A57A1B5C height=6308368 module=state num_txs=2
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF indexed block height=6308368 module=txindex
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF minted coins from module account amount=4385553uatom from=mint module=x/bank
May 22 13:13:56 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF client state updated client-id=07-tendermint-146 height=2-1047307 module=x/ibc/client
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF executed block height=6308369 module=state num_invalid_txs=1 num_valid_txs=2
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF commit synced commit=436F6D6D697449447B5B313537203139372031343120313436203836203220323238203335203936203138322033372031383320323333203137312031343120373620323534203134332038332031313720323131203434203134312031353720323337203130392031393920323337203139342032353320323035203134315D3A3630343231317D
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: 1:13PM INF committed state app_hash=9DC58D925602E42360B625B7E9AB8D4CFE8F5375D32C8D9DED6DC7EDC2FDCD8D height=6308369 module=state num_txs=3
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: panic: Failed to process committed block (6308370:37377A88958D37238FC4E2AF223E4C29456C8F8A91383BD41A6ECE015A140A01): wrong Block.Header.AppHash.  Expected 9DC58D925602E42360B625B7E9AB8D4CFE8F5375D32C8D9DED6DC7EDC2FDCD8D, got F7567C7F223363DF9A6273BAA54111661026E6EBA6EDA8A911A412B78784BBA5
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: goroutine 326 [running]:
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).poolRoutine(0xc000c8ee00, 0xc029c65d01)
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]:         github.com/tendermint/tendermint@v0.34.9/blockchain/v0/reactor.go:401 +0x15bf
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]: created by github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).SwitchToFastSync
May 22 13:13:57 cosmos-m-rpc-ln-fra gaiad[65965]:         github.com/tendermint/tendermint@v0.34.9/blockchain/v0/reactor.go:125 +0xd8
May 22 13:14:00 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF starting ABCI with Tendermint
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting multiAppConn service impl=multiAppConn module=proxy
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting localClient service connection=query impl=localClient module=abci-client
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting EventBus service impl=EventBus module=events
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting PubSub service impl=PubSub module=pubsub
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Starting IndexerService service impl=IndexerService module=txindex
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Found local state with non-zero height, skipping state sync
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF ABCI Handshake App Info hash="�ō�V\x02�#`�%�髍L��Su�,���m����͍" height=6308369 module=consensus protocol-version=0 software-version=
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF ABCI Replay Blocks appHeight=6308369 module=consensus stateHeight=6308369 storeHeight=6308370
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: 1:14PM INF Replay last block using real app module=consensus
May 22 13:14:03 cosmos-m-rpc-ln-fra gaiad[65985]: Error: error during handshake: error on replay: wrong Block.Header.AppHash.  Expected 9DC58D925602E42360B625B7E9AB8D4CFE8F5375D32C8D9DED6DC7EDC2FDCD8D, got F7567C7F223363DF9A6273BAA54111661026E6EBA6EDA8A911A412B78784BBA5

Tried a few times and it panics at the same block height.

@alexanderbez
Copy link
Contributor

Thanks for confirming @kwunyeung! How do we go about fixing this @colin-axner @AdityaSripal? Seems like a pretty serious bug.

@chengwenxi
Copy link
Contributor Author

@kwunyeung Hi, kwunyeung! Could you get me some snapshot nodes? I want to reproduce the issue #9234.

@kwunyeung
Copy link
Contributor

@chengwenxi you can connect to the following nodes. They both have snapshots.

ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7@139.177.178.149:26656
ee27245d88c632a556cf72cc7f3587380c09b469@45.79.249.253:26656

And if you need RPCs
https://rpc.cosmoshub.forbole.com/
https://rpc.cosmoshub.bigdipper.live/

@chengwenxi
Copy link
Contributor Author

@chengwenxi you can connect to the following nodes. They both have snapshots.

extremely grateful ❤️ !

@zmanian
Copy link
Member

zmanian commented May 24, 2021

I think we need have something that iterates every open port and creates a capability after state is synced.

Should be a non-consensus breaking fix.

@colin-axner
Copy link
Contributor

How do we go about fixing this @colin-axner @AdityaSripal? Seems like a pretty serious bug.

We will discuss this later today

I think we need have something that iterates every open port and creates a capability after state is synced.

This sounds like the right approach to me. I'm not sure how claiming capabilities will work. I'd also be nice to have a general solution. This does not only affect IBC, this affects any module that uses x/capability

@AdityaSripal AdityaSripal self-assigned this May 24, 2021
@AdityaSripal
Copy link
Member

The issue is that NewApp gets called before state-sync fills in the persistent store data for capabilities. Thus we cannot initialize the in-memory part.

The solution I have in mind is to fill in the in-memory component as-needed whenever GetCapability or a similar function is called using the data that is already available in the persistent store.

@AdityaSripal
Copy link
Member

I have a non-breaking fix up that will be able to fix the issue for the 0.42 line here: https://github.com/cosmos/cosmos-sdk/tree/aditya/cap-init

Here's the diff: https://github.com/cosmos/cosmos-sdk/compare/v0.42.5...aditya/cap-init?expand=1

Unfortunately the fix I proposed above can only be done efficiently if we move the reverse mapping into the persistent store. The reverse mapping is deterministic so there's no issue moving it, it's just a breaking change. Once that is done, reconstructing the forward mapping and capmap on-the-fly is trivial. This fix should go into 0.43

I will write tests for this tomorrow, but in the meantime it would be great if someone is able to test it out and see if statesync works.

@alessio
Copy link
Contributor

alessio commented May 25, 2021

@chengwenxi @kwunyeung can you apply @AdityaSripal's patch to the sources, rebuild and see whether this fixes the issue for Cosmos SDK v0.42 and gaia, please?
https://github.com/cosmos/cosmos-sdk/compare/v0.42.5...aditya/cap-init.patch

@chengwenxi
Copy link
Contributor Author

Here's the diff: https://github.com/cosmos/cosmos-sdk/compare/v0.42.5...aditya/cap-init?expand=1

It seems that k.InitializeAndSeal(ctx) will cause additional gas compared to the already initialized node, use a cache ctx?

@kwunyeung
Copy link
Contributor

@alessio applied the patch to v0.42.5 and now trying to state sync a new node. It's weird that even the new node is connected to my other 2 nodes with snapshots, it can't find the snapshots during the snapshots discovering period. I'm going to run it for a day and see if it will start and won't crash.

@zmanian
Copy link
Member

zmanian commented May 25, 2021

I've encountered this issue before and I think it tends to happen if the pruning settings and the snapshot settings are not aligned correctly but it's a total pain to debug.

@AdityaSripal
Copy link
Member

Note the previous commit had an issue. Please try against the latest PR above.

@kwunyeung
Copy link
Contributor

@zmanian those nodes are using default pruning and I set snapshot-interval to 500. I keep getting

ERR Failed to add snapshot err="failed to obtain the header at height #6353217: no witnesses connected. please reset light client" format=1 height=6353000 module=statesync peer=ee27245d88c632a556cf72cc7f3587380c09b469

or

ERR Failed to add snapshot err="no witnesses connected. please reset light client" format=1 height=6353000 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7

now

@zmanian
Copy link
Member

zmanian commented May 26, 2021

What is the trust period on the syncing nodes set to?

@kwunyeung
Copy link
Contributor

@zmanian 336h

@tac0turtle
Copy link
Member

@zmanian those nodes are using default pruning and I set snapshot-interval to 500. I keep getting

ERR Failed to add snapshot err="failed to obtain the header at height #6353217: no witnesses connected. please reset light client" format=1 height=6353000 module=statesync peer=ee27245d88c632a556cf72cc7f3587380c09b469

or

ERR Failed to add snapshot err="no witnesses connected. please reset light client" format=1 height=6353000 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7

now

Could you post an exerpt of your config.toml statesync section? It is either you dont have enough rpc connections or the light client issue I also ran into. (tendermint/tendermint#6389).

Could you also change your log level to debug to see what is going on?

@kwunyeung
Copy link
Contributor

@marbar3778 I had started a new node in the same private network as those two nodes serving snapshots and enabled debug last night. You can download a short extract of the debug log below.

https://forbole-val.s3.ap-east-1.amazonaws.com/gaiad-state-sync-log.txt.gz

I saw the following log

DBG Matching header received by witness height=6355500 module=light witness=0

And then the node did requested snapshot chunks but I never saw it applied the chunks.

May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM INF Fetching snapshot chunk chunk=11 format=1 height=6355500 module=statesync total=31
May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM DBG Requesting snapshot chunk chunk=11 format=1 height=6355500 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7
May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM DBG Send channel=97 conn={"Logger":{}} module=p2p msgBytes=1A0908ACF483031001180B peer={"id":"ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7","ip":"139.177.178.149","port":26656}
May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM INF Fetching snapshot chunk chunk=10 format=1 height=6355500 module=statesync total=31
May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM DBG Requesting snapshot chunk chunk=10 format=1 height=6355500 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7
May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM DBG Send channel=97 conn={"Logger":{}} module=p2p msgBytes=1A0908ACF483031001180A peer={"id":"ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7","ip":"139.177.178.149","port":26656}
May 26 13:02:50 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:02PM DBG Flush conn={"Logger":{}} module=p2p peer={"id":"ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7","ip":"139.177.178.149","port":26656}
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Requesting snapshot chunk chunk=8 format=1 height=6355500 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Send channel=97 conn={"Logger":{}} module=p2p msgBytes=1A0908ACF4830310011808 peer={"id":"ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7","ip":"139.177.178.149","port":26656}
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM INF Fetching snapshot chunk chunk=12 format=1 height=6355500 module=statesync total=31
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Requesting snapshot chunk chunk=12 format=1 height=6355500 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Send channel=97 conn={"Logger":{}} module=p2p msgBytes=1A0908ACF483031001180C peer={"id":"ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7","ip":"139.177.178.149","port":26656}
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Requesting snapshot chunk chunk=10 format=1 height=6355500 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Send channel=97 conn={"Logger":{}} module=p2p msgBytes=1A0908ACF483031001180A peer={"id":"ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7","ip":"139.177.178.149","port":26656}
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM INF Fetching snapshot chunk chunk=13 format=1 height=6355500 module=statesync total=31
May 26 13:03:00 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:03PM DBG Requesting snapshot chunk chunk=13 format=1 height=6355500 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7

After getting all snapshot chunks, it said no witnesses connected.

May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM DBG error from light block request from primary, replacing... error="client failed to respond" height=6355008 module=light primary={}
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM ERR Can't verify err="failed to obtain the header at height #6355008: no witnesses connected. please reset light client" module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM ERR Failed to add snapshot err="failed to obtain the header at height #6355008: no witnesses connected. please reset light client" format=1 height=6355000 module=statesync peer=ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=3CAC485961E4FC23199088649FD7D0BBC442603467A618E8C78D921A5216D4B1 height=6355501 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=348F8434CA521F483D52E2ED452B230AFB332987DB2F1140B2CB127DC133F2EC height=6355502 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=9DDD287823DE2058A254A01F24F3ADB6357BDE962B41991B8F53D5BF07FA01BF height=6355500 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=3CAC485961E4FC23199088649FD7D0BBC442603467A618E8C78D921A5216D4B1 height=6355501 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=348F8434CA521F483D52E2ED452B230AFB332987DB2F1140B2CB127DC133F2EC height=6355502 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=9DDD287823DE2058A254A01F24F3ADB6357BDE962B41991B8F53D5BF07FA01BF height=6355500 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Discovered new snapshot format=1 hash="8\x1a%�Z��@��\b\x15�}�B�l��Y[V}[�;�\x1a��7" height=6355500 module=statesync
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=9DDD287823DE2058A254A01F24F3ADB6357BDE962B41991B8F53D5BF07FA01BF height=6355500 module=light
May 26 13:04:52 cosmos-m-test-fullnode-ln-fra gaiad[65397]: 1:04PM INF Header has already been verified hash=3CAC485961E4FC23199088649FD7D0BBC442603467A618E8C78D921A5216D4B1 height=6355501 module=light

The most interesting is that the node got panic earlier is actually in the same private network and it started download the snapshot and applied the chunks immediately. The state was restored in minutes and started committing blocks but got panic at certain height. Actually that node is now running as I unsafe reset and restarted it right after that panic. The state was synced again. I stopped that node right after the state restored, disabled state sync and started it again. Then the node runs smoothly till now.

When I tried to test the fix, I was running on another machine. I couldn’t restore the state, so as this new machine with the no witnesses connected thing.

I left that new node running the whole day as I had to finish some other work. I have just had time to check it again. State not restored yet and the node keep only trying to connect new nodes. I stopped it, unsafe reset it and started it again. Voila! It downloaded snapshot chunks and applied them immediately! I’m going to run it over night again and see if it would panic at some point.

As I restarted the node without changing any settings, the only interesting thing to notice would be the trust_height. I did that last night so the value is comparatively long before the current height. It’s 6355395 in the config but the snapshot applied is at height 6368500. I wonder if picking an far earlier block height would actually help state sync easier.

@kwunyeung
Copy link
Contributor

@AdityaSripal the node has been running without issue for 3 days already. No crash no restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants