Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to slash a validator for double sign at the first block #2372

Closed
4 tasks
kidinamoto01 opened this issue Sep 21, 2018 · 19 comments
Closed
4 tasks

Failed to slash a validator for double sign at the first block #2372

kidinamoto01 opened this issue Sep 21, 2018 · 19 comments
Assignees

Comments

@kidinamoto01
Copy link
Contributor

Summary of Bug

When someone double signed at the first block, the slashing module will not handle it correctly.

Steps to Reproduce

When generating the first block, someone may double sign and trigger the slashing module.
log:

Confirmed double sign from FDC8E91E2F361F4BFCCEFA80C1ABAD138E5175AE at height 1, age of 0 less than max age of 120 module=x/slashing 
E[09-21|13:27:34.653] CONSENSUS FAILURE!!!                         module=consensus err="impossible attempt to slash future infraction at height 1 but we are at height 0" stack="goroutine 1193 [running]:\nruntime/debug.Stack(0xc424c4cea0, 0xd5e1c0, 0xc426777690)\n\t/usr/lib/go-

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@rigelrozanski
Copy link
Contributor

CC @cwgoes

@cwgoes
Copy link
Contributor

cwgoes commented Sep 22, 2018

That's unusual, can you replicate this in a testcase or describe more about the order in which you ran the daemons and double-signed? I don't understand why you would get evidence for a double-sign at height 1 before block 0 was committed.

@kidinamoto01
Copy link
Contributor Author

@cwgoes we did some analysis and figured out double-sign evidence was included in the first block, then got broadcasted. However, there's mismatch to treat double-sign evidence. in tendermint the infractionHeight is 1 but when slashing module process the evidence, ctx.BlockHeight() is still 0.
We tried to solve the issue with this PR: irisnet#109
Hope to get more input from you.

@abelliumnt
Copy link

@cwgoes
I think it is possible to include evidence at the same height block. For instance, if the network is poor and the network come to consensus of a block in multiple rounds,then then double sign evidence maybe happen to be included by a block proposal. Then in applying the block,the infractionHeight will equal to ctx.BlockHeight()+1.
In this situation, cosmos will throw panic error.

@abelliumnt
Copy link

If a new validator is slashed at the first height after it becomes the real validator, cosmos may panic. Because currently the validator has no sign info which is necessary for figuring out jail period. I think this may be a bug too.

@alexanderbez
Copy link
Contributor

If a new validator is slashed at the first height after it becomes the real validator, cosmos may panic.

This definitely seems possible.

I'm still a bit lost on how evidence reaches the block height +1 except maybe the case of the first block?

@abelliumnt
Copy link

@alexanderbez
Suppose the blockchain height is A. The block store height is A. ctx.blockHeight() in delieverTx is A. Now the consensus height is A+1. In the consensus process, due to poor network, the network may come to consensus of new block in multiple rounds. Suppose the new validator has done double sign in round N, but in this round, the consensus is failed. Then in next round, the double sign evidence may be included by a block proposal. Thus in applying block, evidence height euqals to ctx.blockHeight() + 1 .
How do you think about the above assumption?

@alexanderbez
Copy link
Contributor

Thus in applying block, evidence height equals to ctx.blockHeight() + 1 .

Yes, but wont the application also be onto the next block?

@abelliumnt
Copy link

@alexanderbez
This is low probability thing. Only when the evidence is broadcasted to the block proposal and then it is included in the proposal block, then this will happed. In most case, the evidence will be included in next block.

@cwgoes cwgoes self-assigned this Oct 2, 2018
@cwgoes
Copy link
Contributor

cwgoes commented Oct 3, 2018

Suppose the blockchain height is A. The block store height is A. ctx.blockHeight() in delieverTx is A. Now the consensus height is A+1. In the consensus process, due to poor network, the network may come to consensus of new block in multiple rounds. Suppose the new validator has done double sign in round N, but in this round, the consensus is failed. Then in next round, the double sign evidence may be included by a block proposal. Thus in applying block, evidence height euqals to ctx.blockHeight() + 1 .

We don't care about the consensus round, just the consensus height. If we're attempting to come to consensus at some height n, and I double-sign in round 1 but we don't come to consensus until round 3, the evidence passed to the ABCI app when the block is executed after round 3 should still have height n. Have you found otherwise?

@cwgoes
Copy link
Contributor

cwgoes commented Oct 3, 2018

If a new validator is slashed at the first height after it becomes the real validator, cosmos may panic. Because currently the validator has no sign info which is necessary for figuring out jail period. I think this may be a bug too.

This is why evidence handling is run after validator signature handling in the slashing tick function. Do you mean as a result of slashing from elsewhere (e.g. governance), or do you have a testcase which replicate this (which seems like a separate issue to evidence at a future height)?

@cwgoes
Copy link
Contributor

cwgoes commented Oct 3, 2018

We tried to solve the issue with this PR: irisnet#109

Although this will probably fix the symptom, I don't think this is the right approach. It's illogical to handle double-signature evidence for a block at a height we haven't started processing yet - if Tendermint is indeed sending such evidence, that's a bug in Tendermint.

Do you know how I might replicate either issue, either by running commands locally or with a test case? That's the next step to debug the root cause.

@abelliumnt
Copy link

abelliumnt commented Oct 10, 2018

  1. Clone my code: https://github.com/HaoyangLiu/cosmos-sdk.git
  2. Checkout branch develop and build gaiad
make get_vendor_deps build-linux

I have synced my develop with the latest cosmos-sdk develop branch
3. Checkout branch develop-byzantine and build gaiad-byzantine. gaiad-byzantine will always vote nil. Here is my change
4. Create five full nodes with normal gaiad, mark them as node0, node1, node2, node3 and node4
5. Create two byzantine nodes with gaiad-byzantine, make them as node-byz-0 and node-byz-1.
6. node1 and node-byz-0 share one validator address. node2 and node-byz-1 share another one validator address. So there will be two groups conflict voting.
7. I have upload all the node home directory which include genesis file and executable binary: iris.zip
8. Start these node in turns: node1 , node-byz-0, node2, node-byz-1
9. Wait half a minute, then start node0
10. Wait half a minute, then start node4.
11. Then after the first block is applied, a panic error will arise:

E[10-10|16:26:26.205] CONSENSUS FAILURE!!!                         module=consensus err="Expected signing info for validator cosmosvalcons15ad46aj0u39n6grh7gjyk494d380jen6xp89w4 but not found" stack="goroutine 709 [running]:\nruntime/debug.Stack(0xc42302eae0, 0xd6c6e0, 0xc42300d600)
	/usr/local/go/src/runtime/debug/stack.go:24 +0xa7\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine.func2(0xc4200f0900, 0x105f450)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:585 +0x57\npanic(0xd6c6e0, 0xc42300d600)
	/usr/local/go/src/runtime/panic.go:502 +0x229\ngithub.com/cosmos/cosmos-sdk/x/slashing.Keeper.handleDoubleSign(0x10dc4e0, 0xc420b55720, 0xc420a46cb0, 0x10eb700, 0xc4200dd1d0, 0xc420a46cb0, 0x10dc4e0, 0xc420b55750, 0xa, 0x10e4060, ...)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/x/slashing/keeper.go:84 +0xe75\ngithub.com/cosmos/cosmos-sdk/x/slashing.BeginBlocker(0x10e4060, 0xc422fbcde0, 0xc420413040, 0xc, 0xc42301b800, 0x14, 0x20, 0xc422e5e710, 0x5, 0x1, ...)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/x/slashing/tick.go:33 +0x6ac\ngithub.com/cosmos/cosmos-sdk/cmd/gaia/app.(*GaiaApp).BeginBlocker(0xc4200c8000, 0x10e4060, 0xc422fbcde0, 0xc420413040, 0xc, 0xc42301b800, 0x14, 0x20, 0xc422e5e710, 0x5, ...)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/cmd/gaia/app/app.go:140 +0xea\ngithub.com/cosmos/cosmos-sdk/cmd/gaia/app.(*GaiaApp).BeginBlocker-fm(0x10e4060, 0xc422fbcde0, 0xc420413040, 0xc, 0xc42301b800, 0x14, 0x20, 0xc422e5e710, 0x5, 0x1, ...)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/cmd/gaia/app/app.go:111 +0xba\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).BeginBlock(0xc420f7f680, 0xc42301b800, 0x14, 0x20, 0xc422e5e710, 0x5, 0x1, 0x3aca512d, 0xed34faeaa, 0x0, ...)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/baseapp/baseapp.go:433 +0x23a\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/abci/client.(*localClient).BeginBlockSync(0xc420b670e0, 0xc42301b800, 0x14, 0x20, 0xc422e5e710, 0x5, 0x1, 0x3aca512d, 0xed34faeaa, 0x0, ...)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/abci/client/local_client.go:206 +0x9e\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/proxy.(*appConnConsensus).BeginBlockSync(0xc420480b80, 0xc42301b800, 0x14, 0x20, 0xc422e5e710, 0x5, 0x1, 0x3aca512d, 0xed34faeaa, 0x0, ...)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/proxy/app_conn.go:69 +0x6b\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/state.execBlockOnProxyApp(0x10e4ce0, 0xc420b2b260, 0x10e9d80, 0xc420480b80, 0xc4228ec8c0, 0xc422fbc870, 0x10ee220, 0xc42000e918, 0x0, 0x0, ...)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/state/execution.go:209 +0x421\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc420b29860, 0xc420b388a0, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/state/execution.go:77 +0x111\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).finalizeCommit(0xc4200f0900, 0x1)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1334 +0xa7f\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).tryFinalizeCommit(0xc4200f0900, 0x1)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1265 +0x462\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).enterCommit.func1(0xc4200f0900, 0x5, 0x1)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1213 +0x98\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).enterCommit(0xc4200f0900, 0x1, 0x5)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1242 +0x72c\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).addVote(0xc4200f0900, 0xc422f8caa0, 0xc420a18360, 0x28, 0x1060690, 0xf5, 0x7fc4f3379000)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1649 +0xb7f\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).tryAddVote(0xc4200f0900, 0xc422f8caa0, 0xc420a18360, 0x28, 0xc4208dc960, 0xc422f98200, 0xf5)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1507 +0x59\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).handleMsg(0xc4200f0900, 0xdab4a0, 0xc422a4d888, 0xc420a18360, 0x28)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:659 +0x667\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine(0xc4200f0900, 0x0)
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:616 +0x6a2\ncreated by github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).OnStart
	/home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:310 +0x140\n"
  1. To reproduce the issue, the key point is to make sure the first proposer has received the evidence. Here node0 is the first proposer. We start byzantine nodes first, then when the proposer is started, it will include the received evidence into the first block.

@abelliumnt
Copy link

@cwgoes
Currently, tendermint does include evidence in blocks with the same height. If you think my fix is not the right approach, maybe we should change tendermint to ensure evidence will only be included in later blocks.

@cwgoes
Copy link
Contributor

cwgoes commented Oct 10, 2018

Thanks @HaoyangLiu - I'm trying to replicate this, working on a shell script at https://gist.github.com/cwgoes/3ab18585a72966bb9e68e8a9428a28b8 to make it easier.

So far I can replicate the Byzantine duplicate-voting but not the crash - the VoteInfos are processed before the evidence, as expected, and blocks continue onwards.

@abelliumnt
Copy link

@cwgoes
Maybe it is much diffcult to reproduce this issue just by your script. Because, to reproduce this issue, we have to control the node starting order, as well as the validator order in genesis.json. It should be the same as the node names. Here the name order is node0, node1, node2, node3, node4. But the genesis.json which is generated by command gaiad init will make the order random.
Here I just created a repository to provide a easier way to reproduce.
https://github.com/HaoyangLiu/cosmos_issue_reproduction
Just clone this repository and run run.sh.

@cwgoes
Copy link
Contributor

cwgoes commented Oct 11, 2018

Thanks @HaoyangLiu, I can replicate the issue now. On the first block, it looks like the vote information sent by Tendermint is empty (which I suppose makes sense, since that's for the last commit) - but the SDK currently expects to have seen a validator signature in LastCommit before evidence can be submitted for the validator.

I wonder if this is also a problem in general - does Tendermint send evidence for e.g. double-signatures at the current height which the SDK will not necessarily have seen yet in the LastCommit vote info list?

@cwgoes
Copy link
Contributor

cwgoes commented Oct 11, 2018

Ah OK - we should set a SigningInfo for a validator whenever we instruct Tendermint to add that validator to the validator set, including at genesis. That should solve this issue, this was just subtly flawed design on the SDK end.

@cwgoes
Copy link
Contributor

cwgoes commented Oct 11, 2018

#2480 fixes already, trying to incorporate #1867 (comment) into that PR too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants