-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] Require locking agreement to ChangeView and allowing committed nodes to move to higher views #653
Conversation
…reement to ChangeView.
It appears the current code in master doesn't pass unit tests. These changes did not break the unit tests. |
You can update master now ;) |
Maybe we need to lock prepare response to ask changing view as well (otherwise it may "fork"), but let's investigate with careful. |
…t lock changing view if we have sent prepare response.
I agree that we need to not allow locking change view after sending prepare response, to prevent the scenario of duplicate blocks, since this change now allows uncommitting from the lower view. I made that change. |
I am testing this code now using:
|
…des that missed prepare response.
I believe this is now a major improvement over #642. In the general case liveness will be maintained with these changes with less than |
It is working reasonably now from testing after the last commit; however it would be better if the recovery messages were requested more often. I'll let tests run for a few hours and see how it is doing. |
Something seems to be broken in prepare request message from the recovery message some of the time. Looking into it. |
…ge is inhibitted if a validator has sent its preparation.
if (!knownHashes.Add(payload.Hash)) return; | ||
// If we see a recovery message from a lower view we should recover them to our higher view; they may | ||
// be stuck in commit. | ||
if (!IsRecoveryAllowed(payload.ValidatorIndex, 1, context.F() + 1 )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double check when ready because maybe this guy +1
is not in the View, thus, anyway would recover the guy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least one of F+1
nodes must be in the correct view or they never would have changed view.
Testing now seems to be reliable whether running with all nodes or running with |
Why it can prevent stalls? |
Give me a couple of days for testing, i am trying to make some stats. |
There is still a flaw. I will have to pick back up on it later. Others testing can abort for now until the issue is fixed.
The thought was that it is useful to have a mechanism to know how long to continue to allow accepting preparations. By requiring |
This situation is acceptable. Because if it gets stuck, we will call the node's maintainer and let them restart the node. Then the consensus will be restored. |
I am afraid that this change may be like "cover the sun with the sieve". Furthermore, it looks like that Maybe we can merge #642 and #643 and keep this PR open for a more experiments. I do not see a problem in changing the master later. |
Wouldn’t it be great if consensus still worked while you were trying to get ahold of the node maintainer whose node was failed? This helps achieve that. |
I have no objection to merging #642 and then possibly accept this later if testing shows it is better. #642 does not stall if all nodes are available except in the rare case of nodes stuck in commits in earlier views due to poor network scenarios; in which case logs make it clear who is stuck. These changes though also should never stall with all nodes available (once the bugs are out). Also, this should not stall with 1 node shut down (except if they shut down in certain edge case states — I still need to document exactly what those states are). In the general case though with this PR if you start the consensus nodes with In general the operational burden and liveness of the system should be better if the algorithm can continue to create blocks with up to That being said I believe this change needs more testing and documentation before it will be ready to be accepted; where #642 is basically well understood and tested a this point. |
This still has more issues. Let’s go for #642 and revisit this later. |
@jsolman, let's keep this open for some time, I think it has good insights. In the worst case that we do not use anything from here, it will help us to reaffirm the quality of the current implementation. |
Summary of the insights of this PR. Motivation: Current modifications of PR653:
Detected edge cases:
|
@erikzhang and @igormcoelho, do you think it is worth to keep this discussion? Or this cause has low change of success? |
Very good, @shargon. 🗡️ This PR still has some problems and it is stalling sometimes. |
Let's discuss this off the PR and implement with careful. |
@vncoelho sounds good that was my thought also when I closed it earlier. |
Hope is the last to die 💃 |
This change maintains liveness with up to
F
failed nodes and prevents stalls due to commits getting stuck in lower views. It is an alternative solution to #642 to prevent stalling.This change adds a new concept/phase of locking change views.
>= M
change view messages must be known in order for a node to sendChangeView
with aLocked
flag set. In order for a node to move to a new view, it must see>= M
change views with theLocked
flag set and with a new view at or above the view to which it is changing.