[DRAFT] Require locking agreement to ChangeView and allowing committed nodes to move to higher views #653

jsolman · 2019-03-24T16:16:09Z

This change maintains liveness with up to F failed nodes and prevents stalls due to commits getting stuck in lower views. It is an alternative solution to #642 to prevent stalling.

This change adds a new concept/phase of locking change views. >= M change view messages must be known in order for a node to send ChangeView with a Locked flag set. In order for a node to move to a new view, it must see >= M change views with the Locked flag set and with a new view at or above the view to which it is changing.

…reement to ChangeView.

jsolman · 2019-03-24T16:19:41Z

It appears the current code in master doesn't pass unit tests. These changes did not break the unit tests.

igormcoelho · 2019-03-24T16:44:29Z

You can update master now ;)

vncoelho · 2019-03-24T17:11:36Z

Maybe we need to lock prepare response to ask changing view as well (otherwise it may "fork"), but let's investigate with careful.

…t lock changing view if we have sent prepare response.

jsolman · 2019-03-24T17:30:29Z

Maybe we need to lock prepare response to ask changing view as well, but let's investigate with careful.

I agree that we need to not allow locking change view after sending prepare response, to prevent the scenario of duplicate blocks, since this change now allows uncommitting from the lower view. I made that change.

jsolman · 2019-03-24T17:31:20Z

I am testing this code now using:

public class P2PLossyConsensus : Plugin, IP2PPlugin
    {
        private static Random MyRandom = new Random();
        public override void Configure()
        {
        }

        public bool OnP2PMessage(Message message)
        {
            return true;
        }

        public bool OnConsensusMessage(ConsensusPayload payload)
        {
            if (payload.ConsensusMessage.Type == ConsensusMessageType.Commit ||
                payload.ConsensusMessage.Type == ConsensusMessageType.PrepareResponse)
            {
                return MyRandom.Next() % 2 == 0;
            }

            if (payload.ConsensusMessage.Type == ConsensusMessageType.PrepareRequest)
            {
                return MyRandom.Next() % 4 > 0;
            }

            return true;
        }
    }

…des that missed prepare response.

jsolman · 2019-03-24T18:07:49Z

I believe this is now a major improvement over #642. In the general case liveness will be maintained with these changes with less than F failed validators. There may be a couple crash scenarios where it could stall with less than F failed validators, but that may be fixed if we introduce saving and restoring the context now on top of these changes.

neo/Consensus/ConsensusContext.cs

jsolman · 2019-03-24T19:38:41Z

It is working reasonably now from testing after the last commit; however it would be better if the recovery messages were requested more often. I'll let tests run for a few hours and see how it is doing.

jsolman · 2019-03-24T20:07:28Z

Something seems to be broken in prepare request message from the recovery message some of the time. Looking into it.

neo/Consensus/ConsensusService.cs

…ge is inhibitted if a validator has sent its preparation.

vncoelho · 2019-03-24T22:07:36Z

neo/Consensus/ConsensusService.cs

+ if (!knownHashes.Add(payload.Hash)) return;
+ // If we see a recovery message from a lower view we should recover them to our higher view; they may
+ // be stuck in commit.
+ if (!IsRecoveryAllowed(payload.ValidatorIndex, 1, context.F() + 1 ))


Double check when ready because maybe this guy +1 is not in the View, thus, anyway would recover the guy.

At least one of F+1 nodes must be in the correct view or they never would have changed view.

jsolman · 2019-03-25T05:09:37Z

Testing now seems to be reliable whether running with all nodes or running with F shutdown nodes.

erikzhang · 2019-03-25T07:22:52Z

Why it can prevent stalls?

shargon · 2019-03-25T09:01:51Z

Give me a couple of days for testing, i am trying to make some stats.

jsolman · 2019-03-25T12:18:28Z

There is still a flaw. I will have to pick back up on it later. Others testing can abort for now until the issue is fixed.

Why it can prevent stalls?

The thought was that it is useful to have a mechanism to know how long to continue to allow accepting preparations. By requiring >= M ChangeViews to be seen to lock a decision to change view, it can use that as a criteria of when to consider the view changing such that it cannot accept prepare requests any longer. The theory was that this should prevent stalling when F nodes are failed or offline because because since the primary won't ever lock it's change view, there won't be enough nodes to lock change view when F nodes are failed; however, thinking about it now I think this code still has an issue because it allows the primary to send his change view unlocked; which will allow others to lock their change view. The primary probably shouldn't ever send a change view message at all though, unless the view changes and he becomes no longer primary. With that adjustment, the other nodes stay accepting preparations with F nodes failed and it should continue generating blocks. In #642 once F nodes failed, if one node committed and the others had sent change view, they would no longer accept preparations due to ViewChanging and it would stall.

erikzhang · 2019-03-25T12:41:28Z

In #642 once F nodes failed, if one node committed and the others had sent change view, they would no longer accept preparations due to ViewChanging and it would stall.

This situation is acceptable. Because if it gets stuck, we will call the node's maintainer and let them restart the node. Then the consensus will be restored.

vncoelho · 2019-03-25T12:42:46Z

I am afraid that this change may be like "cover the sun with the sieve".
The feeling that nodes gonna stall during PrepareResponse at quite the same conditions as stalling with 2 committed guys and 2 guys on viewchanging looks almost the same to me.

Furthermore, it looks like that commit phase is still needed.

Maybe we can merge #642 and #643 and keep this PR open for a more experiments. I do not see a problem in changing the master later.

jsolman · 2019-03-25T13:54:43Z

This situation is acceptable. Because if it gets stuck, we will call the node's maintainer and let them restart the node. Then the consensus will be restored

Wouldn’t it be great if consensus still worked while you were trying to get ahold of the node maintainer whose node was failed? This helps achieve that.

jsolman · 2019-03-25T14:09:01Z

I have no objection to merging #642 and then possibly accept this later if testing shows it is better. #642 does not stall if all nodes are available except in the rare case of nodes stuck in commits in earlier views due to poor network scenarios; in which case logs make it clear who is stuck.

These changes though also should never stall with all nodes available (once the bugs are out). Also, this should not stall with 1 node shut down (except if they shut down in certain edge case states — I still need to document exactly what those states are). In the general case though with this PR if you start the consensus nodes with F nodes shut down, it will create blocks and never stall. However, for #642 with Starting the consensus nodes with F nodes shut down, if F nodes commit, and the other nodes are view changing it stalls.

In general the operational burden and liveness of the system should be better if the algorithm can continue to create blocks with up to F failed (crashed or shut off) nodes.

That being said I believe this change needs more testing and documentation before it will be ready to be accepted; where #642 is basically well understood and tested a this point.

jsolman · 2019-03-25T15:42:29Z

This still has more issues. Let’s go for #642 and revisit this later.

vncoelho · 2019-03-25T15:57:38Z

@jsolman, let's keep this open for some time, I think it has good insights.

In the worst case that we do not use anything from here, it will help us to reaffirm the quality of the current implementation.

vncoelho · 2019-03-25T16:03:23Z

Summary of the insights of this PR.

Motivation:
Currently, nodes can stall if f nodes are "crashed/non-accessible" and at least one of the M remainders is committed.

Current modifications of PR653:

Additional layer of "locked" "unlocked" during change view
Primary will not send changeview
Primary and responsesent are not going to be locked regarding changeviews, thus, not really contributing for view increase
Primary and responsesent are not going to be locked regarding changeviews, thus, not really contributing for view increase
Committed nodes are going to move to a higher view if anybody can provide M valid locked changeviews

Detected edge cases:

network can still stall if nodes sent unlock changeviews before dying (f nodes dead and 1 one them sent changeview_unlocked, that would stall until some come back by the operators);
Other possibilities that need to be analysed.

vncoelho · 2019-03-25T16:07:09Z

@erikzhang and @igormcoelho, do you think it is worth to keep this discussion? Or this cause has low change of success?

shargon · 2019-03-25T19:07:57Z

My tests:

vncoelho · 2019-03-25T19:14:40Z

Very good, @shargon. 🗡️
The idea is exactly this. Now we are all having better tools for analyzing and testing, thus, it is interesting to have this draft PR for us to keep the discussion alive.

This PR still has some problems and it is stalling sometimes.

vncoelho · 2019-03-27T12:24:20Z

Let's discuss this off the PR and implement with careful.

jsolman · 2019-03-27T14:41:37Z

@vncoelho sounds good that was my thought also when I closed it earlier.

vncoelho · 2019-03-27T15:34:08Z

Hope is the last to die 💃

Restore Liveness: Prevent Commit after ChangeView. Require locking ag…

3f56ceb

…reement to ChangeView.

jsolman requested review from erikzhang and vncoelho March 24, 2019 16:16

Merge branch 'master' into consensus/liveness

7b36d00

To prevent duplicate blocks, since we allow commits to move, we canno…

4edf61f

…t lock changing view if we have sent prepare response.

jsolman added 3 commits March 24, 2019 10:34

Fix log message.

6099547

Add additional criteria for sending recovery since we must recover no…

01be5aa

…des that missed prepare response.

Merge branch 'master' into consensus/liveness

d4d233a

vncoelho reviewed Mar 24, 2019

View reviewed changes

neo/Consensus/ConsensusContext.cs Outdated Show resolved Hide resolved

vncoelho added critical Issues (bugs) that need to be fixed ASAP potential enhancement labels Mar 24, 2019

jsolman added 4 commits March 24, 2019 11:23

Fix logic for inhibiting locking commit.

87d61cb

Primary should always help recover nodes missing prepare response.

f795cf8

Need to be able to recover commit in the same view.

51fbbf7

Fix bug in sending recovery logic.

770ca57

Fix recovery response logic.

00a5e0a

vncoelho reviewed Mar 24, 2019

View reviewed changes

neo/Consensus/ConsensusService.cs Outdated Show resolved Hide resolved

Adjust logic for inhibiting view changing since now locking view chan…

b4ded67

…ge is inhibitted if a validator has sent its preparation.

vncoelho reviewed Mar 24, 2019

View reviewed changes

jsolman added 3 commits March 24, 2019 15:11

Simplify.

900baac

Make all respond for now for shouldRecoverPrepareResponseInSameView.

9b59e5c

Minor naming refactor.

da33693

Primary doesn't send requests to change view.

f75e037

vncoelho added 2 commits March 25, 2019 10:02

Adding some comments and notes

23586db

Updating notes

67d53f0

vncoelho mentioned this pull request Mar 25, 2019

Prevent Commit after ChangeView #642

Merged

jsolman closed this Mar 25, 2019

vncoelho changed the title ~~Restore Liveness: Prevent Commit after ChangeView. Require locking agreement to ChangeView.~~ [DRAFT] Modify Liveness: Require locking agreement to ChangeView and allowing committed nodes to move to higher views Mar 25, 2019

vncoelho reopened this Mar 25, 2019

vncoelho changed the title ~~[DRAFT] Modify Liveness: Require locking agreement to ChangeView and allowing committed nodes to move to higher views~~ [DRAFT] Require locking agreement to ChangeView (extra phase) and allowing committed nodes to move to higher views Mar 25, 2019

vncoelho changed the title ~~[DRAFT] Require locking agreement to ChangeView (extra phase) and allowing committed nodes to move to higher views~~ [DRAFT] Require locking agreement to ChangeView and allowing committed nodes to move to higher views Mar 25, 2019

vncoelho removed the critical Issues (bugs) that need to be fixed ASAP label Mar 25, 2019

vncoelho closed this Mar 27, 2019

vncoelho deleted the consensus/liveness branch May 6, 2019 13:14

Thacryba pushed a commit to simplitech/neo that referenced this pull request Feb 17, 2020

Add community tutorial page (neo-project#653)

17230f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Require locking agreement to ChangeView and allowing committed nodes to move to higher views #653

[DRAFT] Require locking agreement to ChangeView and allowing committed nodes to move to higher views #653

jsolman commented Mar 24, 2019 •

edited

Loading

jsolman commented Mar 24, 2019

igormcoelho commented Mar 24, 2019

vncoelho commented Mar 24, 2019 •

edited

Loading

jsolman commented Mar 24, 2019

jsolman commented Mar 24, 2019

jsolman commented Mar 24, 2019

jsolman commented Mar 24, 2019 •

edited

Loading

jsolman commented Mar 24, 2019

vncoelho Mar 24, 2019

jsolman Mar 25, 2019

jsolman commented Mar 25, 2019

erikzhang commented Mar 25, 2019

shargon commented Mar 25, 2019

jsolman commented Mar 25, 2019

erikzhang commented Mar 25, 2019

vncoelho commented Mar 25, 2019 •

edited

Loading

jsolman commented Mar 25, 2019

jsolman commented Mar 25, 2019 •

edited

Loading

jsolman commented Mar 25, 2019 •

edited

Loading

vncoelho commented Mar 25, 2019 •

edited

Loading

vncoelho commented Mar 25, 2019 •

edited

Loading

vncoelho commented Mar 25, 2019

shargon commented Mar 25, 2019

vncoelho commented Mar 25, 2019

vncoelho commented Mar 27, 2019

jsolman commented Mar 27, 2019

vncoelho commented Mar 27, 2019

[DRAFT] Require locking agreement to ChangeView and allowing committed nodes to move to higher views #653

[DRAFT] Require locking agreement to ChangeView and allowing committed nodes to move to higher views #653

Conversation

jsolman commented Mar 24, 2019 • edited Loading

jsolman commented Mar 24, 2019

igormcoelho commented Mar 24, 2019

vncoelho commented Mar 24, 2019 • edited Loading

jsolman commented Mar 24, 2019

jsolman commented Mar 24, 2019

jsolman commented Mar 24, 2019

jsolman commented Mar 24, 2019 • edited Loading

jsolman commented Mar 24, 2019

vncoelho Mar 24, 2019

Choose a reason for hiding this comment

jsolman Mar 25, 2019

Choose a reason for hiding this comment

jsolman commented Mar 25, 2019

erikzhang commented Mar 25, 2019

shargon commented Mar 25, 2019

jsolman commented Mar 25, 2019

erikzhang commented Mar 25, 2019

vncoelho commented Mar 25, 2019 • edited Loading

jsolman commented Mar 25, 2019

jsolman commented Mar 25, 2019 • edited Loading

jsolman commented Mar 25, 2019 • edited Loading

vncoelho commented Mar 25, 2019 • edited Loading

vncoelho commented Mar 25, 2019 • edited Loading

vncoelho commented Mar 25, 2019

shargon commented Mar 25, 2019

vncoelho commented Mar 25, 2019

vncoelho commented Mar 27, 2019

jsolman commented Mar 27, 2019

vncoelho commented Mar 27, 2019

jsolman commented Mar 24, 2019 •

edited

Loading

vncoelho commented Mar 24, 2019 •

edited

Loading

jsolman commented Mar 24, 2019 •

edited

Loading

vncoelho commented Mar 25, 2019 •

edited

Loading

jsolman commented Mar 25, 2019 •

edited

Loading

jsolman commented Mar 25, 2019 •

edited

Loading

vncoelho commented Mar 25, 2019 •

edited

Loading

vncoelho commented Mar 25, 2019 •

edited

Loading