Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CONSENSUS FAILURE caused by prune stores #9234

Closed
dreamer-zq opened this issue Apr 30, 2021 · 15 comments
Closed

CONSENSUS FAILURE caused by prune stores #9234

dreamer-zq opened this issue Apr 30, 2021 · 15 comments
Assignees
Milestone

Comments

@dreamer-zq
Copy link
Collaborator

8:02PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:02PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:03PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:04PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:04PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:05PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:05PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:06PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:06PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:07PM ERR Failed to read request err="read tcp 10.140.0.28:26657->60.12.58.83:64467: i/o timeout" module=rpc-server protocol=websocket remote={"IP":"60.12.58.83","Port":64467,"Zone":""}

8:07PM ERR error while stopping connection error="already stopped" module=rpc-server protocol=websocket

8:07PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:09PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:09PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:10PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:10PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

8:10PM ERR CONSENSUS FAILURE!!! err="unable to delete version 9978100 with 1 active readers" module=consensus stack="goroutine 956 [running]:\nruntime/debug.Stack(0xc02fb77f88, 0x1d1ff40, 0xc00a995ee0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9f\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine.func2(0xc0011c4700, 0x225c5f0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:726 +0x57\npanic(0x1d1ff40, 0xc00a995ee0)\n\t/usr/local/go/src/runtime/panic.go:969 +0x1b9\ngithub.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).pruneStores(0xc0000d6280)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/store/rootmulti/store.go:388 +0x268\ngithub.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Commit(0xc0000d6280, 0x9840fe, 0x0, 0xc012a21080, 0x71374491c28a2f98)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/store/rootmulti/store.go:362 +0x1c6\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).Commit(0xc000d621a0, 0x0, 0x0, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/baseapp/abci.go:293 +0x29c\ngithub.com/tendermint/tendermint/abci/client.(*localClient).CommitSync(0xc000dc7320, 0x0, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/abci/client/local_client.go:258 +0xab\ngithub.com/tendermint/tendermint/proxy.(*appConnConsensus).CommitSync(0xc002472180, 0x0, 0x0, 0xc008cf7fb0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/proxy/app_conn.go:93 +0x33\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).Commit(0xc000ca9500, 0xb, 0x0, 0x0, 0x0, 0xc0024935d0, 0x9, 0x8b9058, 0x9840fe, 0xc0131bc5a0, ...)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/state/execution.go:228 +0x23f\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc000ca9500, 0xb, 0x0, 0x0, 0x0, 0xc0024935d0, 0x9, 0x8b9058, 0x9840fe, 0xc0131bc5a0, ...)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/state/execution.go:180 +0x6c5\ngithub.com/tendermint/tendermint/consensus.(*State).finalizeCommit(0xc0011c4700, 0x9840fe)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1635 +0xaa8\ngithub.com/tendermint/tendermint/consensus.(*State).tryFinalizeCommit(0xc0011c4700, 0x9840fe)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1546 +0x3ba\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit.func1(0xc0011c4700, 0xc000000000, 0x9840fe)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1481 +0x8e\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit(0xc0011c4700, 0x9840fe, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1519+0x665\ngithub.com/tendermint/tendermint/consensus.(*State).addVote(0xc0011c4700, 0xc01a7f9d60, 0xc00152b590, 0x28, 0xc0117c8ea0, 0x107, 0x107)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:2132 +0xd68\ngithub.com/tendermint/tendermint/consensus.(*State).tryAddVote(0xc0011c4700, 0xc01a7f9d60, 0xc00152b590, 0x28, 0x267d1c8e80ed5, 0x3423be0, 0x27f0ceed)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1930 +0x53\ngithub.com/tendermint/tendermint/consensus.(*State).handleMsg(0xc0011c4700, 0x246ae20, 0xc01ff61f00, 0xc00152b590, 0x28)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:838 +0x8b7\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine(0xc0011c4700, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:762 +0x7d6\ncreated by github.com/tendermint/tendermint/consensus.(*State).OnStart\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:378 +0x896\n"

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xf7621a]

goroutine 1211848566 [running]:

github.com/cosmos/iavl.(*nodeDB).incrVersionReaders(0x0, 0x0)

/root/go/pkg/mod/github.com/cosmos/iavl@v0.15.3/nodedb.go:595 +0x3a

github.com/cosmos/iavl.newExporter(0xc0177f7d80, 0x9840f4)

/root/go/pkg/mod/github.com/cosmos/iavl@v0.15.3/export.go:45 +0xd2

github.com/cosmos/iavl.(*ImmutableTree).Export(...)

/root/go/pkg/mod/github.com/cosmos/iavl@v0.15.3/immutable_tree.go:142

github.com/cosmos/cosmos-sdk/store/iavl.(*Store).Export(0xc0004608c0, 0x9840f4, 0xc000b37a60, 0x1, 0x0)

/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/store/iavl/store.go:239 +0x116

github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Snapshot.func2(0xc015e70cc0, 0xc013552c00, 0x16, 0x20, 0x9840f4)

/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/store/rootmulti/store.go:656 +0x636

created by github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Snapshot

/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/store/rootmulti/store.go:623 +0x4de
@alexanderbez
Copy link
Contributor

unable to delete version 9978100 with 1 active readers seems suspect to me...Why was there an active reader? Did you have any other external process trying to read from the database? What application is this?

@dreamer-zq
Copy link
Collaborator Author

unable to delete version 9978100 with 1 active readers seems suspect to me...Why was there an active reader? Did you have any other external process trying to read from the database? What application is this?

This node provides api services for external system queries. There may be external access when data is being cut, so apart from mistakes, I think if the api should return directly to failure or let him wait for a timeout when performing data reduction, it cannot affect The normal operation of the node

@alexanderbez
Copy link
Contributor

You cannot let there be other readers of the database. I'm pretty sure we prevented this from even happening.

@chengwenxi
Copy link
Contributor

You cannot let there be other readers of the database. I'm pretty sure we prevented this from even happening.

@alexanderbez It may be caused by there is another reader for snapshotting if the snapshot-interval is not a multiple of pruning-keep-every, I found that there is no check for this.

@ryanchristo ryanchristo added this to the v0.43 milestone May 7, 2021
@alexanderbez
Copy link
Contributor

@chengwenxi yes, you are correct. I took a look and I cannot find anywhere where we explicitly check that snapshot-interval is a multiple of pruning-keep-every. However, we clearly document this requirement in the configuration template/docs.

It might be good to enforce this invariant somewhere, but that somewhere is not clear. We could do it in BaseApp, we could do it in the server command logic, and perhaps a few other places.

Regardless, it is documented.

@chengwenxi
Copy link
Contributor

@alexanderbez It's my fault, there is a check for it in
baseapp#L308
:

if pruningOpts.KeepEvery > 0 && app.snapshotInterval%pruningOpts.KeepEvery != 0 {
	return fmt.Errorf(
		"state sync snapshot interval %v must be a multiple of pruning keep every interval %v",
		app.snapshotInterval, pruningOpts.KeepEvery)
}

But it does not consider this case pruningOpts.KeepEvery == 0.

@alexanderbez
Copy link
Contributor

Ahh I didn't bother to check BaseApp#init -- IMHO having it there is not that intuitive.

Do we need to add a case for pruningOpts.KeepEvery == 0 and app.snapshotInterval > 0?

@chengwenxi
Copy link
Contributor

Yes, this is required, but I’m not sure.

If the application data is too large, the snapshot time will be too long(more than 30s), this issue should be reproducible. Do you know some snapshot nodes of cosmoshub-4? I plan to run a full node and reproduce it.

@alexanderbez
Copy link
Contributor

alexanderbez commented May 11, 2021

There are many snapshotting nodes but it's not easy to find out where they are. @marbar3778 do you know of any?

@blushi
Copy link
Contributor

blushi commented Jun 1, 2021

@robert-zaremba @anilcse any updates on this?

@anilcse
Copy link
Collaborator

anilcse commented Jun 2, 2021

Looks like it is actively being discussed here #9324 . @alexanderbez @chengwenxi are these issues relavant? Do we need to fix this separately?

@alexanderbez
Copy link
Contributor

I believe this is a separate fix, yes.

@chengwenxi
Copy link
Contributor

Looks like it is actively being discussed here #9324 . @alexanderbez @chengwenxi are these issues relavant? Do we need to fix this separately?

It's a separate issue, I mentioned it just that I want to find some snapshot nodes to reproduce this issue.

I'm sorry I haven't done the test recently, I will do it this weekend.

@chengwenxi
Copy link
Contributor

chengwenxi commented Jun 7, 2021

@alexanderbez @anilcse You can reproduce it using the configuration files below

app.toml

pruning = "custom"

# These are applied if and only if the pruning strategy is custom.
pruning-keep-recent = "100"
pruning-keep-every = "0"
pruning-interval = "100"

[state-sync]

# snapshot-interval specifies the block interval at which local state sync snapshots are
# taken (0 to disable). Must be a multiple of pruning-keep-every.
snapshot-interval = 100

# snapshot-keep-recent specifies the number of recent snapshots to keep and serve (0 to keep all).
snapshot-keep-recent = 2

config.toml

persistent_peers = "ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7@139.177.178.149:26656,ee27245d88c632a556cf72cc7f3587380c09b469@45.79.249.253:26656"

[statesync]
# State sync rapidly bootstraps a new node by discovering, fetching, and restoring a state machine
# snapshot from peers instead of fetching and replaying historical blocks. Requires some peers in
# the network to take and serve state machine snapshots. State sync is not attempted if the node
# has any local state (LastBlockHeight > 0). The node will have a truncated block history,
# starting from the height of the snapshot.
enable = false

# RPC servers (comma-separated) for light client verification of the synced state machine and
# retrieval of state data for node bootstrapping. Also needs a trusted height and corresponding
# header hash obtained from a trusted source, and a period during which validators can be trusted.
#
# For Cosmos SDK-based chains, trust_period should usually be about 2/3 of the unbonding time (~2
# weeks) during which they can be financially punished (slashed) for misbehavior.
rpc_servers = "tcp://139.177.178.149:26657,tcp://45.79.249.253:26657"
trust_height = 6480000
trust_hash = "ACE641C2509D02DBB4063AAE6F8AAE953FE11057D93AF1C7A0EB263CACBA3D53"
trust_period = "168h0m0s"

# Time to spend discovering snapshots before initiating a restore.
discovery_time = "15s"

@tac0turtle
Copy link
Member

closing in favour of #10352

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants