-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CONSENSUS FAILURE caused by prune stores #9234
Comments
|
This node provides api services for external system queries. There may be external access when data is being cut, so apart from mistakes, I think if the api should return directly to failure or let him wait for a timeout when performing data reduction, it cannot affect The normal operation of the node |
You cannot let there be other readers of the database. I'm pretty sure we prevented this from even happening. |
@alexanderbez It may be caused by there is another reader for snapshotting if the |
@chengwenxi yes, you are correct. I took a look and I cannot find anywhere where we explicitly check that It might be good to enforce this invariant somewhere, but that somewhere is not clear. We could do it in BaseApp, we could do it in the server command logic, and perhaps a few other places. Regardless, it is documented. |
@alexanderbez It's my fault, there is a check for it in if pruningOpts.KeepEvery > 0 && app.snapshotInterval%pruningOpts.KeepEvery != 0 {
return fmt.Errorf(
"state sync snapshot interval %v must be a multiple of pruning keep every interval %v",
app.snapshotInterval, pruningOpts.KeepEvery)
} But it does not consider this case |
Ahh I didn't bother to check Do we need to add a case for |
Yes, this is required, but I’m not sure. If the application data is too large, the snapshot time will be too long(more than 30s), this issue should be reproducible. Do you know some snapshot nodes of |
There are many snapshotting nodes but it's not easy to find out where they are. @marbar3778 do you know of any? |
@robert-zaremba @anilcse any updates on this? |
Looks like it is actively being discussed here #9324 . @alexanderbez @chengwenxi are these issues relavant? Do we need to fix this separately? |
I believe this is a separate fix, yes. |
It's a separate issue, I mentioned it just that I want to find some snapshot nodes to reproduce this issue. I'm sorry I haven't done the test recently, I will do it this weekend. |
@alexanderbez @anilcse You can reproduce it using the configuration files below app.toml pruning = "custom"
# These are applied if and only if the pruning strategy is custom.
pruning-keep-recent = "100"
pruning-keep-every = "0"
pruning-interval = "100"
[state-sync]
# snapshot-interval specifies the block interval at which local state sync snapshots are
# taken (0 to disable). Must be a multiple of pruning-keep-every.
snapshot-interval = 100
# snapshot-keep-recent specifies the number of recent snapshots to keep and serve (0 to keep all).
snapshot-keep-recent = 2 config.toml persistent_peers = "ae26f01b2bc504532a1cc15ce9da0b85ee5a98e7@139.177.178.149:26656,ee27245d88c632a556cf72cc7f3587380c09b469@45.79.249.253:26656"
[statesync]
# State sync rapidly bootstraps a new node by discovering, fetching, and restoring a state machine
# snapshot from peers instead of fetching and replaying historical blocks. Requires some peers in
# the network to take and serve state machine snapshots. State sync is not attempted if the node
# has any local state (LastBlockHeight > 0). The node will have a truncated block history,
# starting from the height of the snapshot.
enable = false
# RPC servers (comma-separated) for light client verification of the synced state machine and
# retrieval of state data for node bootstrapping. Also needs a trusted height and corresponding
# header hash obtained from a trusted source, and a period during which validators can be trusted.
#
# For Cosmos SDK-based chains, trust_period should usually be about 2/3 of the unbonding time (~2
# weeks) during which they can be financially punished (slashed) for misbehavior.
rpc_servers = "tcp://139.177.178.149:26657,tcp://45.79.249.253:26657"
trust_height = 6480000
trust_hash = "ACE641C2509D02DBB4063AAE6F8AAE953FE11057D93AF1C7A0EB263CACBA3D53"
trust_period = "168h0m0s"
# Time to spend discovering snapshots before initiating a restore.
discovery_time = "15s" |
closing in favour of #10352 |
8:02PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:02PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:03PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:04PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:04PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:05PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:05PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:06PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:06PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:07PM ERR Failed to read request err="read tcp 10.140.0.28:26657->60.12.58.83:64467: i/o timeout" module=rpc-server protocol=websocket remote={"IP":"60.12.58.83","Port":64467,"Zone":""}
8:07PM ERR error while stopping connection error="already stopped" module=rpc-server protocol=websocket
8:07PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:09PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:09PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:10PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:10PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
8:10PM ERR CONSENSUS FAILURE!!! err="unable to delete version 9978100 with 1 active readers" module=consensus stack="goroutine 956 [running]:\nruntime/debug.Stack(0xc02fb77f88, 0x1d1ff40, 0xc00a995ee0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9f\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine.func2(0xc0011c4700, 0x225c5f0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:726 +0x57\npanic(0x1d1ff40, 0xc00a995ee0)\n\t/usr/local/go/src/runtime/panic.go:969 +0x1b9\ngithub.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).pruneStores(0xc0000d6280)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/store/rootmulti/store.go:388 +0x268\ngithub.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Commit(0xc0000d6280, 0x9840fe, 0x0, 0xc012a21080, 0x71374491c28a2f98)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/store/rootmulti/store.go:362 +0x1c6\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).Commit(0xc000d621a0, 0x0, 0x0, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.4/baseapp/abci.go:293 +0x29c\ngithub.com/tendermint/tendermint/abci/client.(*localClient).CommitSync(0xc000dc7320, 0x0, 0x0, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/abci/client/local_client.go:258 +0xab\ngithub.com/tendermint/tendermint/proxy.(*appConnConsensus).CommitSync(0xc002472180, 0x0, 0x0, 0xc008cf7fb0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/proxy/app_conn.go:93 +0x33\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).Commit(0xc000ca9500, 0xb, 0x0, 0x0, 0x0, 0xc0024935d0, 0x9, 0x8b9058, 0x9840fe, 0xc0131bc5a0, ...)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/state/execution.go:228 +0x23f\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc000ca9500, 0xb, 0x0, 0x0, 0x0, 0xc0024935d0, 0x9, 0x8b9058, 0x9840fe, 0xc0131bc5a0, ...)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/state/execution.go:180 +0x6c5\ngithub.com/tendermint/tendermint/consensus.(*State).finalizeCommit(0xc0011c4700, 0x9840fe)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1635 +0xaa8\ngithub.com/tendermint/tendermint/consensus.(*State).tryFinalizeCommit(0xc0011c4700, 0x9840fe)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1546 +0x3ba\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit.func1(0xc0011c4700, 0xc000000000, 0x9840fe)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1481 +0x8e\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit(0xc0011c4700, 0x9840fe, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1519+0x665\ngithub.com/tendermint/tendermint/consensus.(*State).addVote(0xc0011c4700, 0xc01a7f9d60, 0xc00152b590, 0x28, 0xc0117c8ea0, 0x107, 0x107)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:2132 +0xd68\ngithub.com/tendermint/tendermint/consensus.(*State).tryAddVote(0xc0011c4700, 0xc01a7f9d60, 0xc00152b590, 0x28, 0x267d1c8e80ed5, 0x3423be0, 0x27f0ceed)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:1930 +0x53\ngithub.com/tendermint/tendermint/consensus.(*State).handleMsg(0xc0011c4700, 0x246ae20, 0xc01ff61f00, 0xc00152b590, 0x28)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:838 +0x8b7\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine(0xc0011c4700, 0x0)\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:762 +0x7d6\ncreated by github.com/tendermint/tendermint/consensus.(*State).OnStart\n\t/root/go/pkg/mod/github.com/tendermint/tendermint@v0.34.9/consensus/state.go:378 +0x896\n"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xf7621a]
goroutine 1211848566 [running]:
github.com/cosmos/iavl.(*nodeDB).incrVersionReaders(0x0, 0x0)
github.com/cosmos/iavl.newExporter(0xc0177f7d80, 0x9840f4)
github.com/cosmos/iavl.(*ImmutableTree).Export(...)
github.com/cosmos/cosmos-sdk/store/iavl.(*Store).Export(0xc0004608c0, 0x9840f4, 0xc000b37a60, 0x1, 0x0)
github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Snapshot.func2(0xc015e70cc0, 0xc013552c00, 0x16, 0x20, 0x9840f4)
created by github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Snapshot
The text was updated successfully, but these errors were encountered: