CONSENSUS FAILURE caused by prune stores #9234

dreamer-zq · 2021-04-30T01:56:32Z

8:02PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:03PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:04PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:05PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:06PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:07PM ERR Failed to read request err="read tcp 10.140.0.28:26657->60.12.58.83:64467: i/o timeout" module=rpc-server protocol=websocket remote={"IP":"60.12.58.83","Port":64467,"Zone":""}

8:07PM ERR error while stopping connection error="already stopped" module=rpc-server protocol=websocket

8:07PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:09PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:10PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:10PM ERR CONSENSUS FAILURE!!! err="unable to delete version 9978100 with 1 active readers" module=consensus stack="goroutine 956 [running]:\nruntime/debug.Stack(0xc02fb77f88, 0x1d1ff40, 0xc00a995ee0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9f\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine.func2(0xc0011c4700, 0x225c5f0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:726 +0x57\npanic(0x1d1ff40, 0xc00a995ee0)\n\t/usr/local/go/src/runtime/panic.go:969 +0x1b9\ngithub.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).pruneStores(0xc0000d6280)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/store/rootmulti/store.go:388 +0x268\ngithub.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Commit(0xc0000d6280, 0x9840fe, 0x0, 0xc012a21080, 0x71374491c28a2f98)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/store/rootmulti/store.go:362 +0x1c6\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).Commit(0xc000d621a0, 0x0, 0x0, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/baseapp/abci.go:293 +0x29c\ngithub.com/tendermint/tendermint/abci/client.(*localClient).CommitSync(0xc000dc7320, 0x0, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/abci/client/local_client.go:258 +0xab\ngithub.com/tendermint/tendermint/proxy.(*appConnConsensus).CommitSync(0xc002472180, 0x0, 0x0, 0xc008cf7fb0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/proxy/app_conn.go:93 +0x33\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).Commit(0xc000ca9500, 0xb, 0x0, 0x0, 0x0, 0xc0024935d0, 0x9, 0x8b9058, 0x9840fe, 0xc0131bc5a0, ...)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/state/execution.go:228 +0x23f\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc000ca9500, 0xb, 0x0, 0x0, 0x0, 0xc0024935d0, 0x9, 0x8b9058, 0x9840fe, 0xc0131bc5a0, ...)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/state/execution.go:180 +0x6c5\ngithub.com/tendermint/tendermint/consensus.(*State).finalizeCommit(0xc0011c4700, 0x9840fe)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1635 +0xaa8\ngithub.com/tendermint/tendermint/consensus.(*State).tryFinalizeCommit(0xc0011c4700, 0x9840fe)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1546 +0x3ba\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit.func1(0xc0011c4700, 0xc000000000, 0x9840fe)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1481 +0x8e\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit(0xc0011c4700, 0x9840fe, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1519+0x665\ngithub.com/tendermint/tendermint/consensus.(*State).addVote(0xc0011c4700, 0xc01a7f9d60, 0xc00152b590, 0x28, 0xc0117c8ea0, 0x107, 0x107)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:2132 +0xd68\ngithub.com/tendermint/tendermint/consensus.(*State).tryAddVote(0xc0011c4700, 0xc01a7f9d60, 0xc00152b590, 0x28, 0x267d1c8e80ed5, 0x3423be0, 0x27f0ceed)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1930 +0x53\ngithub.com/tendermint/tendermint/consensus.(*State).handleMsg(0xc0011c4700, 0x246ae20, 0xc01ff61f00, 0xc00152b590, 0x28)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:838 +0x8b7\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine(0xc0011c4700, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:762 +0x7d6\ncreated by github.com/tendermint/tendermint/consensus.(*State).OnStart\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:378 +0x896\n"

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xf7621a]

goroutine 1211848566 [running]:

github.com/cosmos/iavl.(*nodeDB).incrVersionReaders(0x0, 0x0)

/root/go/pkg/mod/github.com/cosmos/iavl@v0.15.3/nodedb.go:595 +0x3a

github.com/cosmos/iavl.newExporter(0xc0177f7d80, 0x9840f4)

/root/go/pkg/mod/github.com/cosmos/iavl@v0.15.3/export.go:45 +0xd2

github.com/cosmos/iavl.(*ImmutableTree).Export(...)

/root/go/pkg/mod/github.com/cosmos/iavl@v0.15.3/immutable_tree.go:142

github.com/cosmos/cosmos-sdk/store/iavl.(*Store).Export(0xc0004608c0, 0x9840f4, 0xc000b37a60, 0x1, 0x0)

/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/store/iavl/store.go:239 +0x116

github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Snapshot.func2(0xc015e70cc0, 0xc013552c00, 0x16, 0x20, 0x9840f4)

/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/store/rootmulti/store.go:656 +0x636

created by github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Snapshot

/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/store/rootmulti/store.go:623 +0x4de

The text was updated successfully, but these errors were encountered:

alexanderbez · 2021-04-30T12:28:00Z

unable to delete version 9978100 with 1 active readers seems suspect to me...Why was there an active reader? Did you have any other external process trying to read from the database? What application is this?

dreamer-zq · 2021-05-06T01:17:26Z

unable to delete version 9978100 with 1 active readers seems suspect to me...Why was there an active reader? Did you have any other external process trying to read from the database? What application is this?

This node provides api services for external system queries. There may be external access when data is being cut, so apart from mistakes, I think if the api should return directly to failure or let him wait for a timeout when performing data reduction, it cannot affect The normal operation of the node

alexanderbez · 2021-05-06T12:47:47Z

You cannot let there be other readers of the database. I'm pretty sure we prevented this from even happening.

chengwenxi · 2021-05-07T10:55:45Z

You cannot let there be other readers of the database. I'm pretty sure we prevented this from even happening.

@alexanderbez It may be caused by there is another reader for snapshotting if the snapshot-interval is not a multiple of pruning-keep-every, I found that there is no check for this.

alexanderbez · 2021-05-07T15:50:09Z

@chengwenxi yes, you are correct. I took a look and I cannot find anywhere where we explicitly check that snapshot-interval is a multiple of pruning-keep-every. However, we clearly document this requirement in the configuration template/docs.

It might be good to enforce this invariant somewhere, but that somewhere is not clear. We could do it in BaseApp, we could do it in the server command logic, and perhaps a few other places.

Regardless, it is documented.

chengwenxi · 2021-05-08T12:08:45Z

@alexanderbez It's my fault, there is a check for it in
baseapp#L308:

if pruningOpts.KeepEvery > 0 && app.snapshotInterval%pruningOpts.KeepEvery != 0 {
	return fmt.Errorf(
		"state sync snapshot interval %v must be a multiple of pruning keep every interval %v",
		app.snapshotInterval, pruningOpts.KeepEvery)
}

But it does not consider this case pruningOpts.KeepEvery == 0.

alexanderbez · 2021-05-10T12:36:17Z

Ahh I didn't bother to check BaseApp#init -- IMHO having it there is not that intuitive.

Do we need to add a case for pruningOpts.KeepEvery == 0 and app.snapshotInterval > 0?

chengwenxi · 2021-05-11T01:45:39Z

Yes, this is required, but I’m not sure.

If the application data is too large, the snapshot time will be too long(more than 30s), this issue should be reproducible. Do you know some snapshot nodes of cosmoshub-4? I plan to run a full node and reproduce it.

alexanderbez · 2021-05-11T12:53:02Z

There are many snapshotting nodes but it's not easy to find out where they are. @marbar3778 do you know of any?

blushi · 2021-06-01T15:14:41Z

@robert-zaremba @anilcse any updates on this?

anilcse · 2021-06-02T18:39:43Z

Looks like it is actively being discussed here #9324 . @alexanderbez @chengwenxi are these issues relavant? Do we need to fix this separately?

alexanderbez · 2021-06-02T18:51:25Z

I believe this is a separate fix, yes.

chengwenxi · 2021-06-03T01:55:06Z

Looks like it is actively being discussed here #9324 . @alexanderbez @chengwenxi are these issues relavant? Do we need to fix this separately?

It's a separate issue, I mentioned it just that I want to find some snapshot nodes to reproduce this issue.

I'm sorry I haven't done the test recently, I will do it this weekend.

chengwenxi · 2021-06-07T07:04:40Z

@alexanderbez @anilcse You can reproduce it using the configuration files below

app.toml

pruning = "custom"

# These are applied if and only if the pruning strategy is custom.
pruning-keep-recent = "100"
pruning-keep-every = "0"
pruning-interval = "100"

[state-sync]

# snapshot-interval specifies the block interval at which local state sync snapshots are
# taken (0 to disable). Must be a multiple of pruning-keep-every.
snapshot-interval = 100

# snapshot-keep-recent specifies the number of recent snapshots to keep and serve (0 to keep all).
snapshot-keep-recent = 2

config.toml

persistent_peers = "ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7@139.177.178.149:26656,ee27245d88c632a556cf72cc7f3587380c09b469@45.79.249.253:26656"

[statesync]
# State sync rapidly bootstraps a new node by discovering, fetching, and restoring a state machine
# snapshot from peers instead of fetching and replaying historical blocks. Requires some peers in
# the network to take and serve state machine snapshots. State sync is not attempted if the node
# has any local state (LastBlockHeight > 0). The node will have a truncated block history,
# starting from the height of the snapshot.
enable = false

# RPC servers (comma-separated) for light client verification of the synced state machine and
# retrieval of state data for node bootstrapping. Also needs a trusted height and corresponding
# header hash obtained from a trusted source, and a period during which validators can be trusted.
#
# For Cosmos SDK-based chains, trust_period should usually be about 2/3 of the unbonding time (~2
# weeks) during which they can be financially punished (slashed) for misbehavior.
rpc_servers = "tcp://139.177.178.149:26657,tcp://45.79.249.253:26657"
trust_height = 6480000
trust_hash = "ACE641C2509D02DBB4063AAE6F8AAE953FE11057D93AF1C7A0EB263CACBA3D53"
trust_period = "168h0m0s"

# Time to spend discovering snapshots before initiating a restore.
discovery_time = "15s"

tac0turtle · 2021-11-23T12:41:45Z

closing in favour of #10352

ryanchristo added this to the v0.43 milestone May 7, 2021

ryanchristo added the Status: Backlog label May 7, 2021

ryanchristo assigned ryanchristo and anilcse and unassigned ryanchristo May 10, 2021

robert-zaremba added Status: In Progress and removed Status: Backlog labels May 20, 2021

chengwenxi mentioned this issue May 24, 2021

Consensus fails when using statesync mode to synchronize the application state #9324

Closed

4 tasks

clevinson added the Priority: Nice To Have label Jun 9, 2021

clevinson removed the Status: In Progress label Jun 30, 2021

github-actions bot added the stale label Aug 16, 2021

aaronc modified the milestones: v0.43, v0.43.x Aug 19, 2021

robert-zaremba removed the stale label Aug 19, 2021

tac0turtle closed this as completed Nov 23, 2021

mhofman mentioned this issue Sep 19, 2023

Consensus failure on cosmos DB pruning Agoric/agoric-sdk#8354

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONSENSUS FAILURE caused by prune stores #9234

CONSENSUS FAILURE caused by prune stores #9234

dreamer-zq commented Apr 30, 2021

alexanderbez commented Apr 30, 2021

dreamer-zq commented May 6, 2021

alexanderbez commented May 6, 2021

chengwenxi commented May 7, 2021

alexanderbez commented May 7, 2021

chengwenxi commented May 8, 2021

alexanderbez commented May 10, 2021

chengwenxi commented May 11, 2021

alexanderbez commented May 11, 2021 •

edited

Loading

blushi commented Jun 1, 2021

anilcse commented Jun 2, 2021

alexanderbez commented Jun 2, 2021

chengwenxi commented Jun 3, 2021

chengwenxi commented Jun 7, 2021 •

edited

Loading

tac0turtle commented Nov 23, 2021

CONSENSUS FAILURE caused by prune stores #9234

CONSENSUS FAILURE caused by prune stores #9234

Comments

dreamer-zq commented Apr 30, 2021

alexanderbez commented Apr 30, 2021

dreamer-zq commented May 6, 2021

alexanderbez commented May 6, 2021

chengwenxi commented May 7, 2021

alexanderbez commented May 7, 2021

chengwenxi commented May 8, 2021

alexanderbez commented May 10, 2021

chengwenxi commented May 11, 2021

alexanderbez commented May 11, 2021 • edited Loading

blushi commented Jun 1, 2021

anilcse commented Jun 2, 2021

alexanderbez commented Jun 2, 2021

chengwenxi commented Jun 3, 2021

chengwenxi commented Jun 7, 2021 • edited Loading

tac0turtle commented Nov 23, 2021

alexanderbez commented May 11, 2021 •

edited

Loading

chengwenxi commented Jun 7, 2021 •

edited

Loading