Exponential round timeouts cause very long delays in recovering lost consesus #263

brkomir · 2021-12-01T12:40:10Z

Description

The problem addressed in this PR was that recovering consensus was delayed for a long time which is caused by long timeouts produced by the randomTimeout() method in ibft.go, which calculated timeout values in seconds as 10 + 2^(round number), so the timeout could grow to be very large. Issues #261, #245 and #248 have all emerged from this essentially.

Changes include

Bugfix (non-breaking change that solves an issue)
New feature (non-breaking change that adds functionality)
Breaking change (change that is not backwards-compatible and/or changes current functionality)

Checklist

I have assigned this PR to myself
I have added at least 1 reviewer
I have tested this code
I have updated the README and other relevant documents (guides...)
I have added sufficient documentation both in code, as well as in the READMEs

-improved logs for state chages and round count

consensus/ibft/ibft.go

zivkovicmilos

Looks great 💯

Thank you for looking into the linked issues and for adding great in-code docs and coverage tests 🙏

consensus/ibft/timeout_provider.go

lazartravica

Thanks for linting!

Can you elaborate a bit more on the Round num + 1 when printing?
It is zero indexed, if the only thing here is to add more clarity then we can remove +1, as it's clear in the code that the first round will be 0.
Please resolve the timeout Milos noticed
_provider is bugging me. The entire thing where we have a file for an IBFT timeout function is bugging me a bit, can we perhaps have it as timeout.go or move the func back to ibft.go

brkomir · 2021-12-01T13:49:18Z

@lazartravica

Can you elaborate a bit more on the Round num + 1 when printing?
It is zero indexed, if the only thing here is to add more clarity then we can remove +1, as it's clear in the code that the first round will be 0.

I have left a comment above why i did this. #263 (comment)
If you think this should not be done in such way, I can make the changes.

3. _provider is bugging me. The entire thing where we have a file for an IBFT timeout function is bugging me a bit, can we perhaps have it as timeout.go or move the func back to ibft.go

I agree that file name can be changed to the timeout.go. But I'm not inclined to move this logic back into the ibft.go, since more code lives there than it should, as you can conclude just by observing that file is more than 1000 lines long. I'm not a fan of such design practices where we put so much logic in a single file, because reading, navigating and understanding the responsibilities of the code there becomes hard.

mrwillis · 2021-12-01T13:54:29Z

Just chiming in here and maybe I just need some clarity. Are the exponential timeouts necessary for the lower round nodes to "catch up"? Or is there another catch up mechanism for this in place.

brkomir · 2021-12-01T14:30:12Z

@mrwillis great question, as always :)

The exponential timeout isn't used as a mechanism for the lower round number nodes to catch up as you asked. It's simply being used for waiting on the response from other nodes. So it basically works like this, the nodes that are running but out of consensus, will be waiting for the other nodes to return, and every time this timeout expires a new round will start.

When the minimal number of nodes to restore consensus is achieved, this timeout will trigger a new round in which all active nodes will participate, the "returning" nodes will receive the reached round number by the previously running nodes, and the block production will continue.

I hope I was clear enough to answer your question.

-added a 5min timeout cap for the round timeout

c5c05c7

-improved logs for state chages and round count

brkomir requested review from lazartravica, Vuksan, zivkovicmilos, Kourin1996, dbrajovic and munna0908 December 1, 2021 12:40

brkomir mentioned this pull request Dec 1, 2021

Consensus failure cycling AcceptState -> ValidateState -> RoundChangeState -> AcceptState #248

Closed

brkomir self-assigned this Dec 1, 2021

zivkovicmilos reviewed Dec 1, 2021

View reviewed changes

consensus/ibft/ibft.go Outdated Show resolved Hide resolved

zivkovicmilos reviewed Dec 1, 2021

View reviewed changes

consensus/ibft/ibft.go Show resolved Hide resolved

zivkovicmilos approved these changes Dec 1, 2021

View reviewed changes

Kourin1996 approved these changes Dec 1, 2021

View reviewed changes

consensus/ibft/timeout_provider.go Outdated Show resolved Hide resolved

lazartravica suggested changes Dec 1, 2021

View reviewed changes

Addressing code review findings

3b61239

lazartravica approved these changes Dec 1, 2021

View reviewed changes

brkomir merged commit 621241d into develop Dec 1, 2021

brkomir deleted the fix/exponential-timeouts-cause-long-delay-in-consesus-recovery branch December 1, 2021 14:47

dankostiuk mentioned this pull request Dec 1, 2021

Stuck chain due to block locked is incorrect #232

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponential round timeouts cause very long delays in recovering lost consesus #263

Exponential round timeouts cause very long delays in recovering lost consesus #263

brkomir commented Dec 1, 2021 •

edited

Loading

zivkovicmilos left a comment

lazartravica left a comment

brkomir commented Dec 1, 2021

mrwillis commented Dec 1, 2021

brkomir commented Dec 1, 2021

Exponential round timeouts cause very long delays in recovering lost consesus #263

Exponential round timeouts cause very long delays in recovering lost consesus #263

Conversation

brkomir commented Dec 1, 2021 • edited Loading

Description

Changes include

Checklist

zivkovicmilos left a comment

Choose a reason for hiding this comment

lazartravica left a comment

Choose a reason for hiding this comment

brkomir commented Dec 1, 2021

mrwillis commented Dec 1, 2021

brkomir commented Dec 1, 2021

brkomir commented Dec 1, 2021 •

edited

Loading