-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exponential round timeouts cause very long delays in recovering lost consesus #263
Exponential round timeouts cause very long delays in recovering lost consesus #263
Conversation
-improved logs for state chages and round count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great 💯
Thank you for looking into the linked issues and for adding great in-code docs and coverage tests 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for linting!
-
Can you elaborate a bit more on the Round num + 1 when printing?
It is zero indexed, if the only thing here is to add more clarity then we can remove +1, as it's clear in the code that the first round will be 0. -
Please resolve the timeout Milos noticed
-
_provider
is bugging me. The entire thing where we have a file for an IBFT timeout function is bugging me a bit, can we perhaps have it astimeout.go
or move the func back toibft.go
I have left a comment above why i did this. #263 (comment)
I agree that file name can be changed to the timeout.go. But I'm not inclined to move this logic back into the ibft.go, since more code lives there than it should, as you can conclude just by observing that file is more than 1000 lines long. I'm not a fan of such design practices where we put so much logic in a single file, because reading, navigating and understanding the responsibilities of the code there becomes hard. |
Just chiming in here and maybe I just need some clarity. Are the exponential timeouts necessary for the lower round nodes to "catch up"? Or is there another catch up mechanism for this in place. |
@mrwillis great question, as always :) The exponential timeout isn't used as a mechanism for the lower round number nodes to catch up as you asked. It's simply being used for waiting on the response from other nodes. So it basically works like this, the nodes that are running but out of consensus, will be waiting for the other nodes to return, and every time this timeout expires a new round will start. When the minimal number of nodes to restore consensus is achieved, this timeout will trigger a new round in which all active nodes will participate, the "returning" nodes will receive the reached round number by the previously running nodes, and the block production will continue. I hope I was clear enough to answer your question. |
Description
The problem addressed in this PR was that recovering consensus was delayed for a long time which is caused by long timeouts produced by the
randomTimeout()
method inibft.go
, which calculated timeout values in seconds as 10 + 2^(round number), so the timeout could grow to be very large. Issues #261, #245 and #248 have all emerged from this essentially.Changes include
Checklist