Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consensus failure on gaia-6002 #1496

Closed
7768 opened this issue Jul 2, 2018 · 9 comments
Closed

Consensus failure on gaia-6002 #1496

7768 opened this issue Jul 2, 2018 · 9 comments

Comments

@7768
Copy link
Contributor

7768 commented Jul 2, 2018

I[07-02|03:03:13.313] Absent validator 747A72E570DD41F95687D3C2DD40128A9DBF798A at height 721617 module=x/slashing
I[07-02|03:03:13.314] Validator 747A72E570DD41F95687D3C2DD40128A9DBF798A past min height of 582143 and below signed blocks threshold of 50 module=x/slashing
E[07-02|03:03:13.333] CONSENSUS FAILURE!!!                         module=consensus err="Attempted to slash a nonexistent validator with address 747A72E570DD41F95687D3C2DD40128A9DBF798A" stack="goroutine 2537 [running]:\nruntime/debug.Stack(0xc49e5225b0, 0xbf46c0, 0xc42ce1a560)\n\t/usr/lib/go-1.10/src/runtime/debug/stack.go:24 +0xa7\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine.func1(0xc435b74000)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:538 +0x57\npanic(0xbf46c0, 0xc42ce1a560)\n\t/usr/lib/go-1.10/src/runtime/panic.go:505 +0x229\ngithub.com/cosmos/cosmos-sdk/x/stake.Keeper.Slash(0xed9bc0, 0xc421136590, 0xc42007e480, 0xed9bc0, 0xc421136570, 0xee65a0, 0xc420bb4730, 0xc42007e480, 0x4, 0xee03a0, ...)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/x/stake/keeper.go:807 +0x8b6\ngithub.com/cosmos/cosmos-sdk/x/slashing.Keeper.handleValidatorSignature(0xed9bc0, 0xc4211365a0, 0xc42007e480, 0xee3e00, 0xc420bb4780, 0xa, 0xee03a0, 0xc433520a80, 0xc43886c380, 0x9, ...)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/x/slashing/keeper.go:87 +0x634\ngithub.com/cosmos/cosmos-sdk/x/slashing.BeginBlocker(0xee03a0, 0xc433520a80, 0xc43886c380, 0x9, 0xc43cfb18e0, 0x14, 0x20, 0xc499d9e750, 0x9, 0xb02d1, ...)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/x/slashing/tick.go:40 +0x5a2\ngithub.com/cosmos/cosmos-sdk/cmd/gaia/app.(*GaiaApp).BeginBlocker(0xc4200aca00, 0xee03a0, 0xc433520a80, 0xc43886c380, 0x9, 0xc43cfb18e0, 0x14, 0x20, 0xc499d9e750, 0x9, ...)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/cmd/gaia/app/app.go:117 +0xc3\ngithub.com/cosmos/cosmos-sdk/cmd/gaia/app.(*GaiaApp).BeginBlocker-fm(0xee03a0, 0xc433520a80, 0xc43886c380, 0x9, 0xc43cfb18e0, 0x14, 0x20, 0xc499d9e750, 0x9, 0xb02d1, ...)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/cmd/gaia/app/app.go:90 +0xa0\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).BeginBlock(0xc4200da380, 0xc43cfb18e0, 0x14, 0x20, 0xc499d9e750, 0x9,
0xb02d1, 0x5b3995f0, 0x0, 0x6c74, ...)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/baseapp/baseapp.go:386 +0x157\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/abci/client.(*localClient).BeginBlockSync(0xc4601dac60, 0xc43cfb18e0, 0x14, 0x20, 0xc499d9e750, 0x9, 0xb02d1, 0x5b3995f0, 0x0, 0x6c74, ...)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/abci/client/local_client.go:206 +0xab\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/proxy.(*appConnConsensus).BeginBlockSync(0xc437e325a0, 0xc43cfb18e0, 0x14, 0x20, 0xc499d9e750, 0x9, 0xb02d1, 0x5b3995f0, 0x0, 0x6c74, ...)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/proxy/app_conn.go:69 +0x78\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/state.execBlockOnProxyApp(0xee0aa0, 0xc449a1c7c0, 0xee38c0, 0xc437e325a0, 0xc43fefeb90, 0xc46c8ff6e0, 0xee6f20, 0xc42cb455c8, 0x1, 0xc4240a3280, ...)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/state/execution.go:190 +0x547\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc4601db440, 0xc4697ad990, 0x9, 0xb02d0, 0x6c74, 0xc45d45cfe0, 0x14, 0x20, 0x1, 0xc4240a3280, ...)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/state/execution.go:76 +0x12f\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).finalizeCommit(0xc435b74000, 0xb02d1)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1267 +0xbb2\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).tryFinalizeCommit(0xc435b74000, 0xb02d1)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1198 +0x468\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).enterComm
it.func1(0xc435b74000, 0x0, 0xb02d1)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1146 +0x98\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).enterCommit(0xc435b74000, 0xb02d1, 0x0)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1175 +0x802\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).addVote(0xc435b74000, 0xc4758f08c0, 0xc45ef49560, 0x28, 0x0, 0x0, 0xc469841658)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1532 +0xbb4\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).tryAddVote(0xc435b74000, 0xc4758f08c0, 0xc45ef49560, 0x28, 0xf6, 0xc47ade3e00)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1390 +0x56\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).handleMsg(0xc435b74000, 0xbf6fa0, 0xc4269f4420, 0xc45ef49560, 0x28)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:607 +0x643\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine(0xc435b74000, 0x0)\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:560 +0x6c0\ncreated by github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).OnStart\n\t/home/fish/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:272 +0x140\n"
I[07-02|03:03:13.632] Ensure peers                                 module=p2p numOutPeers=10 numInPeers=30 numDialing=0 numToDial=0
E[07-02|03:03:13.753] Send failed                                  module=p2p peer=206.81.7.62:55370 channel=32 conn=MConn{206.81.7.62:55370} msgBytes=C96A6FAB0900000000000B02CD100018042000280004

This may be related to #1466 and #1484

Right before this consensus failure, I killed off a bunch of my ex-validators. They haven't been voting or bonded for a long time since I didn't have enough steak (you needed 3/4 steaks due to the issues referenced above)

@7768 7768 changed the title Consensus failure on gaia-6002 Consensus failure on gaia-6002 Jul 2, 2018
@7768
Copy link
Contributor Author

7768 commented Jul 2, 2018

This may be related: #1427

@7768
Copy link
Contributor Author

7768 commented Jul 2, 2018

This is definitely related: #1470

@kwunyeung
Copy link
Contributor

The only missing validator in the latest height is the zero voting power validator 747A72E570DD41F95687D3C2DD40128A9DBF798A.

@cwgoes
Copy link
Contributor

cwgoes commented Jul 2, 2018

This validator - 747A72E570DD41F95687D3C2DD40128A9DBF798A / cosmosvaladdr1w3a89etsm4qlj45860pd6sqj32wm77v28s72rk (bech32) had zero power at this block (as returned by Tendermint through ABCI) and shoudn't have been in the validator set. Debugging further...

@7768 7768 mentioned this issue Jul 2, 2018
@cwgoes
Copy link
Contributor

cwgoes commented Jul 3, 2018

Possibly a mismatch in Tendermint & SDK-side validator set tracking - #1470 (comment).

@cwgoes
Copy link
Contributor

cwgoes commented Jul 4, 2018

Following up, I believe this is a symptom of the combination of #1519 and #1513 - validator 747A72E570DD41F95687D3C2DD40128A9DBF798A was mistakenly stored as the cliff validator and kept in the Tendermint validator set even after it has been unbonded (and its candidacy had been removed). Later, that validator eventually went offline, and we tried to slash it - but, of course, it wasn't in the application state since it shouldn't have been in the Tendermint validator set in the first place.

Checking that hypothesis by tracing store states.

@cwgoes
Copy link
Contributor

cwgoes commented Jul 5, 2018

At block 582090, validator 747A72E570DD41F95687D3C2DD40128A9DBF798A was unbonded and deleted from the store:

gaiadebug tx "1gHwYl3rD6U26qMKFHaNOFi6UxrMkOx+K4mpynX7VHKeEhR2jThYulMazJDsfiuJqcp1+1RynhoDTUFYBBMOAwERAAAAAAAAAAAEEQAAAAAAAw1ABB4DAQ8WJN5iIOs8d8HI0lIJO+4fGflR+3MT9KJCStTuu81E32eVTHfiFz2h2ypAbWL3f+v6lgW2RQzpnbD/0rHtT5D4Jeal8OPD/yGNbda7mcyUNSwqn5xuJSNp0bjuHqgcrqEGdvGjl5bAsjcAABkAAAAAAAAA5yEAAAAAAAAABAQE"

{
  "type": "8EFE47F0625DE8",
  "value": {
    "msg": {
      "type": "ECCB5CA536EAA0",
      "value": {
        "delegator_addr": "768D3858BA531ACC90EC7E2B89A9CA75FB54729E",
        "validator_addr": "768D3858BA531ACC90EC7E2B89A9CA75FB54729E",
        "shares": "MAX"
      }
    },
    "fee": {
      "amount": [
        {
          "denom": "",
          "amount": 0
        }
      ],
      "gas": 200000
    },
    "signatures": [
      {
        "pub_key": {
          "type": "AC26791624DE60",
          "value": "6zx3wcjSUgk77h8Z+VH7cxP0okJK1O67zUTfZ5VMd+I="
        },
        "signature": {
          "type": "6BF5903DA1DB28",
          "value": "bWL3f+v6lgW2RQzpnbD/0rHtT5D4Jeal8OPD/yGNbda7mcyUNSwqn5xuJSNp0bjuHqgcrqEGdvGjl5bAsjcAAA=="
        },
        "account_number": 231,
        "sequence": 4
      }
    ]
  }
}

At that block, this validator wasn't bonded in the application state - so the app did not instruct Tendermint to remove it (no validator set updates) - but it was in Tendermint's validator set, so it remained there (and kept signing blocks!) until later going offline just before the consensus fault, when we tried to slash the validator, couldn't find it in the app state, and panicked. Why was it in Tendermint's validator set? Continuing to trace store updates, we find:

At block 571445, an unrevoke transaction was sent for validator 631E2B3AF1F9FE4873D21B7B74CEF568E834F77D.

gaiadebug tx "uwHwYl3rD7A8uusKFGMeKzrx+f5Ic9Ibe3TO9WjoNPd9BBMOAwERAAAAAAAAAAAEEQAAAAAABhqABB4DAQ8WJN5iIGok5Vt9nXlgYBnpQxbHGQ236WtH17Cy/KBKoTJI4W49Fz2h2ypAZSbGwBzookjGj//4ncSsrzO6eXrWMpF0QqpGbR3gYkRatijiS5LpKYV7m+sq/Iw5V50ILZ3s3qVhhRXADQ41ChkAAAAAAAABLCEAAAAAAAAABAQE"

{
  "type": "8EFE47F0625DE8",
  "value": {
    "msg": {
      "type": "A73B89B03CBAE8",
      "value": {
        "address": "631E2B3AF1F9FE4873D21B7B74CEF568E834F77D"
      }
    },
    "fee": {
      "amount": [
        {
          "denom": "",
          "amount": 0
        }
      ],
      "gas": 400000
    },
    "signatures": [
      {
        "pub_key": {
          "type": "AC26791624DE60",
          "value": "aiTlW32deWBgGelDFscZDbfpa0fXsLL8oEqhMkjhbj0="
        },
        "signature": {
          "type": "6BF5903DA1DB28",
          "value": "ZSbGwBzookjGj//4ncSsrzO6eXrWMpF0QqpGbR3gYkRatijiS5LpKYV7m+sq/Iw5V50ILZ3s3qVhhRXADQ41Cg=="
        },
        "account_number": 300,
        "sequence": 4
      }
    ]
  }
}

This resulted in two validator set updates. The first was validator 631E2B3AF1F9FE4873D21B7B74CEF568E834F77D being unrevoked, as expected, but the second was validator 747A72E570DD41F95687D3C2DD40128A9DBF798A being passed with a power of zero - which was then added to Tendermint's validator set at the next block.

Debugging that transaction, logging bonded/unbonded validators, we see:

ID CommitID{[218 151 19 112 168 117 250 210 47 69 183 61 136 96 161 151 201 225 160 109]:B02D0}
LastBlockHeight 721616
Txn: BB01F0625DEB0FB03CBAEB0A14631E2B3AF1F9FE4873D21B7B74CEF568E834F77D04130E030111000000000000000004110000000000061A80041E03010F1624DE62206A24E55B7D9D79606019E94316C7190DB7E96B47D7B0B2FCA04AA13248E16E3D173DA1DB2A406526C6C01CE8A248C68FFFF89DC4ACAF33BA797AD632917442AA466D1DE062445AB628E24B92E929857B9BEB2AFC8C39579D082D9DECDEA5618515C00D0E350A19000000000000012C2100000000000000040404
bonded validator 5DE0434D4E5ACA408BB0823B7243B79B1B251717
bonded validator 747A72E570DD41F95687D3C2DD40128A9DBF798A
unbonded validator: 747A72E570DD41F95687D3C2DD40128A9DBF798A
I[07-05|23:00:41.255] Validator 5DE0434D4E5ACA408BB0823B7243B79B1B251717 unrevoked module=x/stake 

This validator, stored with zero power - #1513 - was incorrectly set as the cliff validator - #1519 - then rebonded and unbonded through the logic in updateBondedValidators. When a validator is rebonded and unbonded, the SDK will send it with zero power back to Tendermint - which if it was not previously in Tendermint's validator set (as in this case), will cause Tendermint to add it with zero power!

@cwgoes
Copy link
Contributor

cwgoes commented Jul 5, 2018

Closing in favor of #1513 / #1519.

@cwgoes cwgoes closed this as completed Jul 5, 2018
@rigelrozanski
Copy link
Contributor

makes sense as far as I can tell

chillyvee pushed a commit to chillyvee/cosmos-sdk that referenced this issue Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants