Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaiad network crash in executing redelegation transactions #2241

Closed
abelliumnt opened this issue Sep 5, 2018 · 23 comments
Closed

Gaiad network crash in executing redelegation transactions #2241

abelliumnt opened this issue Sep 5, 2018 · 23 comments
Assignees

Comments

@abelliumnt
Copy link

abelliumnt commented Sep 5, 2018

Summary of Bug

When I tried to send redelegate (maybe unbond can also reproduce this issue?) transactions twice between two given validators, the blockchain network encountered a consensus failure and failed to produce new blocks.

E[09-05|07:45:54.249] Error on ApplyBlock. Did the application crash? Please restart tendermint module=consensus err="Commit failed for application: Error changing validator set: Failed to remove validator BE3D48310A7CB5284A6FF73A48A9F0E1E2CC25A5"

The above error log locates in tendermint/state/execution.go

Code for reproduce

Steps to Reproduce

  1. Create gaia network with two nodes on the same machine. Only the first node is the validator.
gaiad init --name testA --home $HOME/testA --chain-id find-bug
gaiad init --name testB --home $HOME/testB --chain-id find-bug
cp $HOME/testA/config/genesis.json $HOME/testB/config/genesis.json
gaiad start --home $HOME/testA
  1. Edit $HOME/testB/config/config.toml
    1. Assign testA's p2p address to persistent_peers, for instance:
    # Comma separated list of nodes to keep persistent connections to
    persistent_peers = "81ce2efeefbf9634d99063b4c704a9ee9dd044c7@10.0.2.15:26656"
    
    1. Change the following items to avoid ports conflict:
    proxy_app = "tcp://127.0.0.1:26648"
    rpc.laddr = "tcp://0.0.0.0:26647"
    p2p.laddr = "tcp://0.0.0.0:26646"
    
  2. Start testB node. Now testB is not a validator.
  3. Create new validator:
    1. Get public key, mard ad publicKeyB:
    gaiad tendermint show-validator --home $HOME/testB
    
    1. Send token to addrB:
    gaiacli send --to=<addrB> --from=testA --amount=10steak --chain-id=find-bug
    
    1. Send transction to create new validator:
    gaiacli stake create-validator --address-delegator=<addrB> --chain-id=find-bug --from=testB --    pubkey=<publicKeyB> --amount 5steak
    
    1. Query the validators, there will be two validators, mark validator address as validatorA and validatorB:
    gaiacli stake validators
    
  4. Send redelegate transactions:
    1. First transaction:
    gaiacli stake redelegate begin --addr-validator-source=<validatorB> --addr-validator-dest=<validatorA > --chain-id find-bug --shares-percent=0.95 --from testB
    
    1. Second redelegate:
    gaiacli stake redelegate begin --addr-validator-source=<validatorB> --addr-validator-dest=<validatorA > --chain-id find-bug --shares-percent=0.03 --from testB
    
  5. Then the blockchain network will be crash.

Analysis of Bug

Currently, in staking module, we use sdk.Dec as the date type of token amount and shares. When calculating voting power, we convert it to an int64.

v.BondedTokens().RoundInt64()

After the first redelegation is done, the remaining bonded token on validatorB is 0.025 and its equivalent voting power is zero. So once the first delegation is done, validatorB will be removed from the validator set in Tendermint.

When the second redelegation transaction is executed, the EndBlock will produce a new validator set change: set validatorB voting power to zero again.

However, the validatorB has already removed from validator set. The second remove operation will cause the fatal error.

Ideas about bug fix

The simplest way to fix this issue is to change code in tendermint: check if the validator exist before executing remove operation.

Maybe we can also fix this bug in staking module. Currently in staking, only the the validator bonded token is zero, will the validator be removed. Maybe here we should take its voting power into consideration.

@alexanderbez
Copy link
Contributor

Interesting. Thanks for submitting this @HaoyangLiu. We can take a look at this ASAP.

/cc @rigelrozanski

@alexanderbez
Copy link
Contributor

alexanderbez commented Sep 5, 2018

Btw @HaoyangLiu, you can use the make localnet-start|stop command to easily create local testnets via docker-compose. All the ports are mapped to localhost. Once the network is running, you can then add nodes for your convenience.

Was able to reproduce this on a local testnet (4 nodes) by simply calling gaiacli stake redelegate begin a bunch of times until a node crashed.

@abelliumnt
Copy link
Author

@alexanderbez
Awesome idea! I never know that. Thanks.
I have tried the samplest method I mentioned above to fix this issue: https://github.com/HaoyangLiu/tendermint/commit/2cc97a05fde49713d927234c5986b429e59493ef.
I have done some tests and it did work. But more tests is still required. Besides, this change will bring in some difference between validator list in staking and validator set in tendermint. I not sure if this difference will result in more issues.

@alexanderbez
Copy link
Contributor

alexanderbez commented Sep 5, 2018

@HaoyangLiu I'd like to look through the staking code more to understand this.

@alexanderbez
Copy link
Contributor

alexanderbez commented Sep 5, 2018

I'm thinking this should ideally be handled in the SDK state machine. I don't think the second tx should be valid and make it through under these circumstances. Seems like we might need to take a look at Keeper#unbond and/or it's caller, BeginRedelegation, in more detail.

Side question, is it valid to create a bunch of redelegation begin txs from the same src to the same dst without completing them @cwgoes @rigelrozanski?

@rigelrozanski
Copy link
Contributor

Thanks for the submission - I'll look further into this issue. This should be fixed on the SDK side not the tendermint side - Related to #2188

@cwgoes
Copy link
Contributor

cwgoes commented Sep 11, 2018

Thanks @HaoyangLiu - #2238 will fix this, just as you suggest (the latter option): checking whether a validator is bonded or not before we tell Tendermint to remove it. We still want Tendermint to only deleted validators previously in the validator set because it serves as an additional sanity check on the SDK staking state machine.

@abelliumnt
Copy link
Author

Awesome!Hope the PR can be merged soon.

@alexanderbez
Copy link
Contributor

@cwgoes I tested the latest develop with #2238 merged. This issue still persists. I think this is a redelegation issue.

@rigelrozanski
Copy link
Contributor

rigelrozanski commented Sep 13, 2018

Oh wow - pretty crazy that this crashes the blockchain - Redelegation is not feature complete in that is only is supporting a single redelegation per delegator-sourceValidator pair - (see #1402)

I've got a strong feeling it's related to this ^

but a bug nonetheless as this shouldn't cause a crash it should just be a failed transaction for the time being... @bez or @HaoyangLiu does somebody have output logs they could post here?

@cwgoes
Copy link
Contributor

cwgoes commented Sep 13, 2018

@cwgoes I tested the latest develop with #2238 merged. This issue still persists. I think this is a redelegation issue.

Oh really, OK, then this continues to block 0.25. Can you paste the logs?

@alexanderbez
Copy link
Contributor

Yes, I will paste logs shortly...and try to tackle this.

@alexanderbez
Copy link
Contributor

Log:

I[09-13|21:04:03.363] Executed block                               module=state height=68 validTxs=1 invalidTxs=0
I[09-13|21:04:03.363] Updates to validators                        module=state updates="[{\"address\":\"17D3D083061CAB0140118B61407DA3F9F5273ED5\",\"pub_key\":\"DYg2eATRJPz4riN5jxVgK43q7nC1pDoRVl1wRlKiUNA=\",\"power\":0}]"
2018/09/13 21:04:03 http: multiple response.WriteHeader calls
E[09-13|21:04:03.364] Error on ApplyBlock. Did the application crash? Please restart tendermint module=consensus err="Commit failed for application: Error changing validator set: Failed to remove validator 17D3D083061CAB0140118B61407DA3F9F5273ED5"
captured terminated, exiting...

@rigelrozanski
Copy link
Contributor

That log looks like the error that we're attempting to remove a cliff validator which is not part of the validator set (aka a cliff bug)

@alexanderbez
Copy link
Contributor

@rigelrozanski Are you sure? I think the analysis of @HaoyangLiu is correct -- this has to do with rounding of power. In any case, I'll try to replicate this via a unit test.

@rigelrozanski
Copy link
Contributor

hmm well maybe not a cliff validator issue (could very well be rounding!) - but it definitely looks like we're sending a 0 validator update for a non-existent validator which is killing the Tendermint.

@alexanderbez
Copy link
Contributor

alexanderbez commented Sep 13, 2018

@rigelrozanski yes indeed!

I can send two REDs:

  • First puts the validator tokens at .2425000000 -- still bonded! But TM sees power as zero (rounding down), so it removes.
  • Attempt second RED, redelagates more tokens, sends TM another zero validator update, but it's already been removed.
return abci.Validator{
    PubKey:  tmtypes.TM2PB.PubKey(v.ConsPubKey),
    Address: v.ConsPubKey.Address(),
    Power:   v.BondedTokens().RoundInt64(),
}

So I suppose we should not allow a redelegation, if you're rounded power is zero?

@abelliumnt
Copy link
Author

If a validator rounding voting power is zero,then Tendermint will remove it but staking mudule will still keep it. As a result, the validator set of Tendermint and staking are conflict. So how about just remove the validator in staking if its rounding voting power is zero.

@rigelrozanski
Copy link
Contributor

This is a really extreme edge case, I hope we should never see a validator with one voting power in production. That said, the validator is still persisting in the staking side of things thus, all staking related functionality for a bonded validator should be available. REALLY the only difference should be that the Tendermint update is not sent, however, redelegation should still be possible (as the validator is still there and real! - just not in tendermint). As well, the validator should NOT simply be removed, those funds should be kept and maintained until withdrawn. (of course you cannot withdrawn less than one token, however it can be redelegated to another validator and then withdrawn)

@alexanderbez
Copy link
Contributor

So how about just remove the validator in staking if its rounding voting power is zero.

We can't do this as the validator is still bonded and rightfully so. What @rigelrozanski is suggesting is cleaner and better approach. I'll amend my PR shortly. Thanks all 👍

@alexanderbez
Copy link
Contributor

Although I had a PR out that fixed this, we'll be addressing this and other issues via #2312.

@alexanderbez
Copy link
Contributor

Actually, re-opening this to keep track of the issue.

@alexanderbez alexanderbez reopened this Sep 25, 2018
@cwgoes
Copy link
Contributor

cwgoes commented Oct 2, 2018

This will be fixed in #2394, which checks the rounded power before determining whether or not to send Tendermint a validator update.

Thanks @HaoyangLiu.

@cwgoes cwgoes closed this as completed Oct 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants