Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AcceptState not timing out to RoundChangeState #243

Closed
mrwillis opened this issue Nov 25, 2021 · 2 comments
Closed

AcceptState not timing out to RoundChangeState #243

mrwillis opened this issue Nov 25, 2021 · 2 comments
Assignees

Comments

@mrwillis
Copy link
Contributor

mrwillis commented Nov 25, 2021

AcceptState not timing out to RoundChangeState

Description

When debugging our latest testnet crash (#235 and #232), we noticed a lot of the nodes got stuck in AcceptState, unable to get pushed out to RoundChangeState. So essentially we had a bunch of nodes in RoundChangeState, waiting for others to get into RoundChangeState, and some nodes in AcceptState unable to get booted to RoundChangeState.

We're not quite sure how or why this happens yet.

Unless I'm missing something, in runAcceptState, it looks like the only way this can happen is if getNextMessage continues to return a false value in the second parameter.

Your environment

  • OS and version Ubuntu 20
  • version of the Polygon SDK 2bebf50
  • branch that causes this issue

Steps to reproduce

  • Tell us how to reproduce this issue
  • Where the issue is, if you know
  • Which commands triggered the issue, if any

Expected behaviour

After timing out, it should get pushed to RoundChangeState according to the spec

image

Actual behaviour

Nodes get stuck in AcceptState in an exponentially-increasing timeout loop.

Logs

Nov 18 23:37:15 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:15.529Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:37:15 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:15.530Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x64B1845355e8D85D62d7ea01CAec8Df7ecB09e84 block=173592
Nov 18 23:37:25 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:25.721Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:37:25 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:25.723Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:37:25 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:25.724Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x27F6Ae1f03986A374Cca5A4CF4f9d43324B9F09E block=173592
Nov 18 23:37:37 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:37.736Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:37:37 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:37.736Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:37:37 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:37.737Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0xce6e21a1f84C68C9C2F45EA43B93F5525dfE2a95 block=173592
Nov 18 23:37:51 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:51.860Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:37:51 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:51.860Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:37:51 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:51.863Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x1AC502B21b8b73d881dA7F5af6abDe15FC22EB33 block=173592
Nov 18 23:38:08 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:08.201Z [INFO]  polygon.network: Peer disconnected: id=16Uiu2HAmFpTiQcctV2KNpdKm9FehHKFLTUC13oaBGVzvZta7iFMc
Nov 18 23:38:09 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:09.873Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:38:09 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:09.873Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:38:09 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:09.873Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x097F8AB58FaAa6f282C3a50cAb0c1F43b26AA9aF block=173592
Nov 18 23:38:35 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:35.895Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:38:35 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:35.895Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:38:35 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:35.896Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0xB73a1fBE13bAc3b92d6fc9c10a6068C8Edb3b19c block=173592
Nov 18 23:39:17 ip-172-31-39-197 main[63679]: 2021-11-18T23:39:17.948Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:39:17 ip-172-31-39-197 main[63679]: 2021-11-18T23:39:17.948Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:39:17 ip-172-31-39-197 main[63679]: 2021-11-18T23:39:17.948Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x369a595Acea231aEA9809ddab58117204838EA71 block=173592
Nov 18 23:40:31 ip-172-31-39-197 main[63679]: 2021-11-18T23:40:31.965Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:40:31 ip-172-31-39-197 main[63679]: 2021-11-18T23:40:31.965Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:40:31 ip-172-31-39-197 main[63679]: 2021-11-18T23:40:31.966Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x7C482A9234255433A96c1431ef92A8DaC5ebF488 block=173592
@brkomir brkomir self-assigned this Nov 29, 2021
@brkomir
Copy link
Contributor

brkomir commented Nov 29, 2021

Hey @mrwillis, thank you for reaching out!

I have investigated this issue and concluded the following:

  1. The method you mentioned getNextMessage, returns false only when the "closeCh" channel is closed. In that case, the node is being shut down and all code execution in the ibft state machine will stop, so this isn’t relevant for this situation.
  2. Exponential timeout is achieved by adding 2 to the power of the current round number to the base timeout of 10 seconds, so this implies that the node isn’t stuck in the "accept state" because rounds are progressing, effectively the node is constantly switching between the "accept", "validate" and "round change" states.
  3. The misleading thing here is that no logs are indicating that rounds are progressing, so this is something that I will address.
  4. The situation you described is a consequence of the network falling out of consensus, meaning that not enough validator nodes are functioning (more than 1/3 of total validators). You can easily reproduce this by, for example, starting the 4 node cluster, shutting down 2 nodes, and observing the logs on running nodes which will show identical patterns to those you left in the description.

To summarize, I haven’t found any problems in the code logic for the "accept state" regarding the situation you experienced.

@mrwillis
Copy link
Contributor Author

mrwillis commented Nov 29, 2021

@brkomir thanks for looking into this. I just checked on 3. and yes as there wasn't debug logging available, it's possible they were cycling between the states.

We were 100% confident that at the time that all the nodes were up and running, so we do believe there is an issue somewhere.

Then it is possible the logs were the same under the hood as in #248.

Added logging would be great to see the state changes at the INFO level. We can close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants