Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential round timeouts cause very long delays in recovering lost consesus #263

Conversation

brkomir
Copy link
Contributor

@brkomir brkomir commented Dec 1, 2021

Description

The problem addressed in this PR was that recovering consensus was delayed for a long time which is caused by long timeouts produced by the randomTimeout() method in ibft.go, which calculated timeout values in seconds as 10 + 2^(round number), so the timeout could grow to be very large. Issues #261, #245 and #248 have all emerged from this essentially.

Changes include

  • Bugfix (non-breaking change that solves an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)

Checklist

  • I have assigned this PR to myself
  • I have added at least 1 reviewer
  • I have tested this code
  • I have updated the README and other relevant documents (guides...)
  • I have added sufficient documentation both in code, as well as in the READMEs

-improved logs for state chages and round count
Copy link
Contributor

@zivkovicmilos zivkovicmilos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 💯

Thank you for looking into the linked issues and for adding great in-code docs and coverage tests 🙏

consensus/ibft/timeout_provider.go Outdated Show resolved Hide resolved
Copy link
Contributor

@lazartravica lazartravica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for linting!

  1. Can you elaborate a bit more on the Round num + 1 when printing?
    It is zero indexed, if the only thing here is to add more clarity then we can remove +1, as it's clear in the code that the first round will be 0.

  2. Please resolve the timeout Milos noticed

  3. _provider is bugging me. The entire thing where we have a file for an IBFT timeout function is bugging me a bit, can we perhaps have it as timeout.go or move the func back to ibft.go

@brkomir
Copy link
Contributor Author

brkomir commented Dec 1, 2021

@lazartravica

  1. Can you elaborate a bit more on the Round num + 1 when printing?
    It is zero indexed, if the only thing here is to add more clarity then we can remove +1, as it's clear in the code that the first round will be 0.

I have left a comment above why i did this. #263 (comment)
If you think this should not be done in such way, I can make the changes.

3. _provider is bugging me. The entire thing where we have a file for an IBFT timeout function is bugging me a bit, can we perhaps have it as timeout.go or move the func back to ibft.go

I agree that file name can be changed to the timeout.go. But I'm not inclined to move this logic back into the ibft.go, since more code lives there than it should, as you can conclude just by observing that file is more than 1000 lines long. I'm not a fan of such design practices where we put so much logic in a single file, because reading, navigating and understanding the responsibilities of the code there becomes hard.

@mrwillis
Copy link
Contributor

mrwillis commented Dec 1, 2021

Just chiming in here and maybe I just need some clarity. Are the exponential timeouts necessary for the lower round nodes to "catch up"? Or is there another catch up mechanism for this in place.

@brkomir
Copy link
Contributor Author

brkomir commented Dec 1, 2021

@mrwillis great question, as always :)

The exponential timeout isn't used as a mechanism for the lower round number nodes to catch up as you asked. It's simply being used for waiting on the response from other nodes. So it basically works like this, the nodes that are running but out of consensus, will be waiting for the other nodes to return, and every time this timeout expires a new round will start.

When the minimal number of nodes to restore consensus is achieved, this timeout will trigger a new round in which all active nodes will participate, the "returning" nodes will receive the reached round number by the previously running nodes, and the block production will continue.

I hope I was clear enough to answer your question.

@brkomir brkomir merged commit 621241d into develop Dec 1, 2021
@brkomir brkomir deleted the fix/exponential-timeouts-cause-long-delay-in-consesus-recovery branch December 1, 2021 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants