Prevent Commit after ChangeView #642

jsolman · 2019-03-18T22:20:07Z

Since we do not ever allow the view to jump backward during recovery, we cannot allow change view to occur after more than F nodes have committed to any view. Therefore, if a node has sent a change view message, it can not be allowed to reach commit in the view from which it has requested to change. This fix addresses the issue and prevents a stall.

vncoelho

We just detected this issue this afternoon.

Perfect and well done, @jsolman.
Thanks for solving this quick and in a simple manner. Good vision.

I will soon complete some basic test cases and approve it.

vncoelho · 2019-03-18T22:45:58Z

Check this, Jeff:

[22:44:38.406] initialize: height=550 view=0 index=3 role=Backup
[22:44:40.282] OnPrepareRequestReceived: height=550 view=0 index=2 tx=1
[22:44:40.282] send prepare response
[22:44:40.946] timeout: height=550 view=0
[22:44:40.946] request change view: height=550 view=0 nv=1
[22:44:40.997] OnCommitReceived: height=550 view=0 index=1
[22:44:41.003] OnCommitReceived: height=550 view=0 index=2
[22:44:41.009] OnCommitReceived: height=550 view=0 index=0
[22:44:41.014] relay block: 0x596994b17b42de6480863dfa4df823fdb56a22f054f8d7efcc91d54c377c9dfd
[22:44:41.044] persist block: 0x596994b17b42de6480863dfa4df823fdb56a22f054f8d7efcc91d54c377c9dfd
[22:44:41.044] initialize: height=551 view=0 index=3 role=Primary

aehauheua, as expected relaying a block by acting as a bridge. :D 💃

jsolman · 2019-03-18T23:45:08Z

One more adjustment is needed to prevent things getting stuck. If we detect that F nodes are committed, then nodes should stop trying to change view.

vncoelho · 2019-03-19T00:12:32Z

For those who want to check the logs of this last problem.

[23:17:07.861] initialize: height=2055 view=0 index=2 role=Backup
[23:17:09.819] OnPrepareRequestReceived: height=2055 view=0 index=3 tx=1
[23:17:09.820] send prepare response
[23:17:09.872] timeout: height=2055 view=0
[23:17:09.878] request change view: height=2055 view=0 nv=1
[23:17:10.645] OnCommitReceived: height=2055 view=0 index=3
[23:17:10.655] OnCommitReceived: height=2055 view=0 index=0
[23:17:10.664] OnChangeViewReceived: height=2055 view=0 index=1 nv=1
[23:17:10.672] OnChangeViewReceived: height=2055 view=0 index=2 nv=1

[23:17:07.927] initialize: height=2055 view=0 index=0 role=Backup
[23:17:09.820] OnPrepareRequestReceived: height=2055 view=0 index=3 tx=1
[23:17:09.820] send prepare response
[23:17:09.877] OnPrepareResponseReceived: height=2055 view=0 index=2
[23:17:09.878] send commit
[23:17:09.935] OnCommitReceived: height=2055 view=0 index=3
[23:17:10.848] OnRecoveryMessageReceived: height=2055 view=0 index=3
[23:17:10.929] timeout: height=2055 view=0

[23:17:08.812] initialize: height=2055 view=0 index=3 role=Primary
[23:17:09.816] timeout: height=2055 view=0
[23:17:09.816] send prepare request: height=2055 view=0
[23:17:09.836] OnPrepareResponseReceived: height=2055 view=0 index=2
[23:17:09.837] OnPrepareResponseReceived: height=2055 view=0 index=0
[23:17:09.838] send commit
[23:17:10.718] OnCommitReceived: height=2055 view=0 index=0
[23:17:10.846] timeout: height=2055 view=0
[23:17:10.846] send recovery to resend commit

[23:17:07.931] initialize: height=2055 view=0 index=1 role=Backup
[23:17:09.823] OnPrepareResponseReceived: height=2055 view=0 index=2
[23:17:09.832] OnPrepareResponseReceived: height=2055 view=0 index=0
[23:17:09.938] timeout: height=2055 view=0
[23:17:09.940] request change view: height=2055 view=0 nv=1
[23:17:09.994] OnCommitReceived: height=2055 view=0 index=0
[23:17:10.005] OnCommitReceived: height=2055 view=0 index=3

vncoelho · 2019-03-19T00:38:46Z

Let's just let this comment and let it flow...aehuaheahuea

// A possible attack can happen if the last committed node does not want to recover the others
// In addition, currently, if a the node asking change views crashes it will come back and possibly accept recover from any committed node, thus, splitting nodes among committed nodes and possible stalling the network
public static bool FNodesValidCommitted(this IConsensusContext context) => context.CommitPayloads.Count(p => p != null) >= context.F();

…twork to stall creating blocks.

neo/Consensus/ConsensusService.cs

erikzhang · 2019-03-19T03:20:54Z

neo/Consensus/ConsensusService.cs

@@ -293,7 +296,7 @@ private void OnRecoveryMessageReceived(ConsensusPayload payload, RecoveryMessage
 ReverifyAndProcessPayload(changeViewPayload);
 }
 if (message.ViewNumber != context.ViewNumber) return;
- if (!context.CommitSent())
+ if (!context.ViewChanging() && !context.CommitSent())


You should allow receiving PrepareRequest and PrepareResponse even when ViewChanging() == true. Because ViewChanging() indicates that I want to change views, it may not be able to change successfully.

See the comments below if it accepts prepare requests and prepare responses in the current view after it has sent a ChangeView message it can end up sending the other nodes to the next view, where they will never be able to receive the recovery message from this view with the commits, since it will be a lower view number; thus the network will be stalled.

If the view has not been changed, then we should allow consensus on the current view. Otherwise, we may not be able to reach a consensus or change views.

If the view is changing for a node (then the view has to be considered to be changed from it's perspective), because it has already sent a change view message that can be used to allow others to accept a change view, so it cannot commit in the current view. If it did commit in the current view after sending a change view, then the others can all accept moving to the later view, where they won't be able to reach consensus.

If more than F nodes have committed, then this code does allow receiving PrepareRequest and PrepareResponse in the current view effectively to obtain consensus by treating the ViewChanging flag as not changing. Maybe you missed this line that is changed in this PR:

neo/neo/Consensus/Helper.cs

Line 39 in ee72d58

public static bool MoreThanFNodesCommitted(this IConsensusContext context) => context.CommitPayloads.Count(p => p != null) > context.F();

If less than or equal to F nodes have been committed then they will still be able to obtain consensus in a higher view. So this code should address your concern of nodes being able to reach consensus in the current view when it is not possible to change view anymore.

neo/Consensus/ConsensusService.cs

erikzhang · 2019-03-19T05:02:19Z

NGD will test #641 first. And if there are too many stalls, they will test #642 then.

…r changing view and coming to consensus.

jsolman · 2019-03-19T05:34:39Z

And if there are too many stalls, they will test #642 then.

It will be good to have a baseline before this change. However, it doesn't take very severe network latency to potentially cause the stall that should be fixed by this issue, so I suspect we should have this tested afterward if it stalls even once during NGD testing.

vncoelho · 2019-03-19T06:00:38Z

@erikzhang, until this commit 57065f8 it basically solves the issue.

Maybe we can revert at this point, @jsolman and open a new PR with 442bfe8.

For me, both are good, but until that commit looks quite necessary at the moment.

jsolman · 2019-03-19T06:10:55Z

@vncoelho I already did remove the code that saved consensus context to a separate PR. This only has the minimal required now to fix the issue.

vncoelho · 2019-03-19T06:24:35Z

Thanks, @jsolman.
The last tests were carried on with >= f, let's start them again with > f. Thanks for the reminder.

neo/Consensus/ConsensusService.cs

…cepting PrepareRequest and PrepareResponse from different views." This reverts commit 9838afc.

Conflicts: neo/Consensus/ConsensusService.cs neo/Consensus/Helper.cs

vncoelho · 2019-03-21T07:58:02Z

As always, great job @erikzhang, it is looks cleaner now.
Tests are being updated with this version.

erikzhang · 2019-03-21T15:57:49Z

NGD will test it these days.

jsolman · 2019-03-21T23:41:57Z

I'm running a test using the P2P plugin now that drops commits and prepare responses 50% of the time.

    public class P2PLossyConsensus : Plugin, IP2PPlugin
    {
        private static Random MyRandom = new Random();
        public override void Configure()
        {
        }

        public bool OnP2PMessage(Message message)
        {
            return true;
        }

        public bool OnConsensusMessage(ConsensusPayload payload)
        {
            if (payload.ConsensusMessage.Type == ConsensusMessageType.Commit || 
                 payload.ConsensusMessage.Type == ConsensusMessageType.PrepareResponse)
            {
                return MyRandom.Next() % 2 == 0;
            }
            return true;
        }
    }

Edit : I added dropping prepare response 50% of the time above since this simulates the problem this solves.

jsolman · 2019-03-21T23:42:51Z

So far testing is going well. No stalls.

jsolman · 2019-03-21T23:56:58Z

Stressing it harder with dropping other message types now too. Dropping prepare response 50% of the time is a good way to simulate the problem this solves.

jsolman · 2019-03-22T09:06:40Z

Testing indicates this is working well so far. I've heard good things from NGD so far also. I look forward to hearing their final result.

shargon · 2019-03-22T09:09:45Z

Good test @jsolman !

vncoelho · 2019-03-22T19:01:28Z

Blocks are quite stable for almost 48hrs (resetting every 12) in a 2s network with low computational resources:

Some outliers can be seen but those are mostly due to nodes lagging.
Even one of the CN is sometimes lagging.

jsolman · 2019-03-23T03:02:49Z

I’ve thought of a better solution than this. Expect an alternative PR within 48 hours that solves the same problem as this PR, while being cleaner and maintaining liveness. Also without needing #643.

vncoelho · 2019-03-24T15:38:21Z

@jsolman, I like the idea of allowing committed nodes to move to higher views! I think it is a necessary assumption of the liveness of pBFT, as @igormcoelho has been advocating. It is a modification that we need to do after this one.

However, I think we need #643 and also the lock protection for nodes with flag viewchanging that was the main motivation of this current PR.

It is necessary to discuss this "special rule" in detail and think about in order to be sure that it will not break our assumption of never creating double headers when we have less than f+1 Malicious Nodes.

We created an example in the thread above. Let's discuss that.

shargon · 2019-03-25T09:01:24Z

Give me a couple of days for testing, i am trying to make some stats.

vncoelho · 2019-03-25T12:51:22Z

Go for it, @shargon.
For me this is approved. I think this is the natural solution we have, considering the path we had picked in our previous design of dBFT 2.0.

In addition, PR643 has potential in "helping" (not completely "avoiding") a "crashed" node to not contribute in a lower view instead of the last one it knew.

erikzhang · 2019-03-25T14:47:51Z

Are we going to merge this first? @jsolman @vncoelho

vncoelho · 2019-03-25T15:19:28Z

Such a hard answer...@erikzhang...aehauheauhea
I will let this "box of bees" with you 📦

I am joking, Erik.
This PR has been tested and it solves the critical problem, maybe it is not the best solution (but this is a kind of consequence of our design - optimal or not is something that I do not how to answer right now) because f nodes can stuck the network if some committed guys (from the M remainders) do not contribute for changingview.

I do not see a problem in merging this and later discussing #653, even if it is after 1-2 weeks.

vncoelho

@erikzhang, fell free to merge when NGD finishes tests.
I think this should go to Testnet asap.

If a not satisfactory behavior is detected on Testnet we move in another direction.
But if it is ok let's put on Mainnet and keep discussions going on. Maybe in some more couple of months/weeks/even_days we can achieve an even more polished version.

jsolman · 2019-03-25T15:47:00Z

@erikzhang I agree with moving forward with this when you are ready. Other solutions like #653 still have issues and need further adjustments before they are ready; arguably they have other issues that can be potentially worse.

shargon · 2019-03-25T19:05:39Z

My tests

Prevent commit after ChangeView.

aa16b7f

jsolman requested review from erikzhang and vncoelho March 18, 2019 22:20

vncoelho reviewed Mar 18, 2019

View reviewed changes

jsolman added 2 commits March 18, 2019 15:29

Minor simplification.

d686236

Minimize differences.

0d98768

jsolman changed the title ~~Prevent commit after ChangeView~~ Prevent Commit after ChangeView Mar 18, 2019

jsolman requested review from shargon and belane March 18, 2019 22:33

vncoelho mentioned this pull request Mar 18, 2019

Adding more logs to OnRecovery payload #640

Merged

Don't consider view changing once F nodes committed.

d9ea44e

jsolman added 5 commits March 18, 2019 17:56

Add a comment describing current potential issues.

57065f8

Save consensus context upon changing view to prevent split commit.

442bfe8

Adjust comment concerning potential issues.

f471b32

Further improve comment.

7d5b201

Reduce to minimal changes to fix the main issue that can cause the ne…

470b35c

…twork to stall creating blocks.

erikzhang requested changes Mar 19, 2019

View reviewed changes

Rename helper method that checks if F nodes are already committed.

bfb729e

jsolman added 2 commits March 18, 2019 22:02

Greater than F nodes must have committed for it to not be possible fo…

76bd576

…r changing view and coming to consensus.

Rename flag to MoreThanFNodesCommitted.

ee72d58

shargon reviewed Mar 19, 2019

View reviewed changes

neo/Consensus/ConsensusService.cs Show resolved Hide resolved

jsolman commented Mar 21, 2019

View reviewed changes

neo/Consensus/ConsensusService.cs Outdated Show resolved Hide resolved

jsolman added 2 commits March 21, 2019 00:21

Revert "Fix a merge issue. OnConsensusPayload method should not be ac…

663a8ae

…cepting PrepareRequest and PrepareResponse from different views." This reverts commit 9838afc.

Merge branch 'master' into consensus/preventStall

d0f89c6

Conflicts: neo/Consensus/ConsensusService.cs neo/Consensus/Helper.cs

Merge branch 'master' into consensus/preventStall

c3005ac

jsolman requested a review from erikzhang March 22, 2019 09:04

Merge branch 'master' into consensus/preventStall

dda3e9a

jsolman mentioned this pull request Mar 24, 2019

[DRAFT] Require locking agreement to ChangeView and allowing committed nodes to move to higher views #653

Closed

Merge branch 'master' into consensus/preventStall

5e925b0

Merge branch 'master' into consensus/preventStall

45091ce

vncoelho approved these changes Mar 25, 2019

View reviewed changes

erikzhang approved these changes Mar 25, 2019

View reviewed changes

erikzhang merged commit 28443ef into master Mar 25, 2019

vncoelho deleted the consensus/preventStall branch March 25, 2019 16:27

Thacryba pushed a commit to simplitech/neo that referenced this pull request Feb 17, 2020

update per neo-project#629 (neo-project#642)

463a876

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent Commit after ChangeView #642

Prevent Commit after ChangeView #642

jsolman commented Mar 18, 2019 •

edited

Loading

vncoelho left a comment •

edited

Loading

vncoelho commented Mar 18, 2019 •

edited

Loading

jsolman commented Mar 18, 2019

vncoelho commented Mar 19, 2019 •

edited

Loading

vncoelho commented Mar 19, 2019 •

edited

Loading

erikzhang Mar 19, 2019

jsolman Mar 19, 2019

erikzhang Mar 19, 2019

jsolman Mar 19, 2019 •

edited

Loading

erikzhang commented Mar 19, 2019

jsolman commented Mar 19, 2019

vncoelho commented Mar 19, 2019 •

edited

Loading

jsolman commented Mar 19, 2019

vncoelho commented Mar 19, 2019 •

edited by jsolman

Loading

vncoelho commented Mar 21, 2019

erikzhang commented Mar 21, 2019

jsolman commented Mar 21, 2019 •

edited

Loading

jsolman commented Mar 21, 2019

jsolman commented Mar 21, 2019 •

edited

Loading

jsolman commented Mar 22, 2019

shargon commented Mar 22, 2019

vncoelho commented Mar 22, 2019 •

edited

Loading

jsolman commented Mar 23, 2019

vncoelho commented Mar 24, 2019 •

edited

Loading

shargon commented Mar 25, 2019

vncoelho commented Mar 25, 2019

erikzhang commented Mar 25, 2019

vncoelho commented Mar 25, 2019 •

edited

Loading

vncoelho left a comment •

edited

Loading

jsolman commented Mar 25, 2019 •

edited

Loading

shargon commented Mar 25, 2019

Prevent Commit after ChangeView #642

Prevent Commit after ChangeView #642

Conversation

jsolman commented Mar 18, 2019 • edited Loading

vncoelho left a comment • edited Loading

Choose a reason for hiding this comment

vncoelho commented Mar 18, 2019 • edited Loading

jsolman commented Mar 18, 2019

vncoelho commented Mar 19, 2019 • edited Loading

vncoelho commented Mar 19, 2019 • edited Loading

erikzhang Mar 19, 2019

Choose a reason for hiding this comment

jsolman Mar 19, 2019

Choose a reason for hiding this comment

erikzhang Mar 19, 2019

Choose a reason for hiding this comment

jsolman Mar 19, 2019 • edited Loading

Choose a reason for hiding this comment

erikzhang commented Mar 19, 2019

jsolman commented Mar 19, 2019

vncoelho commented Mar 19, 2019 • edited Loading

jsolman commented Mar 19, 2019

vncoelho commented Mar 19, 2019 • edited by jsolman Loading

vncoelho commented Mar 21, 2019

erikzhang commented Mar 21, 2019

jsolman commented Mar 21, 2019 • edited Loading

jsolman commented Mar 21, 2019

jsolman commented Mar 21, 2019 • edited Loading

jsolman commented Mar 22, 2019

shargon commented Mar 22, 2019

vncoelho commented Mar 22, 2019 • edited Loading

jsolman commented Mar 23, 2019

vncoelho commented Mar 24, 2019 • edited Loading

shargon commented Mar 25, 2019

vncoelho commented Mar 25, 2019

erikzhang commented Mar 25, 2019

vncoelho commented Mar 25, 2019 • edited Loading

vncoelho left a comment • edited Loading

Choose a reason for hiding this comment

jsolman commented Mar 25, 2019 • edited Loading

shargon commented Mar 25, 2019

jsolman commented Mar 18, 2019 •

edited

Loading

vncoelho left a comment •

edited

Loading

vncoelho commented Mar 18, 2019 •

edited

Loading

vncoelho commented Mar 19, 2019 •

edited

Loading

vncoelho commented Mar 19, 2019 •

edited

Loading

jsolman Mar 19, 2019 •

edited

Loading

vncoelho commented Mar 19, 2019 •

edited

Loading

vncoelho commented Mar 19, 2019 •

edited by jsolman

Loading

jsolman commented Mar 21, 2019 •

edited

Loading

jsolman commented Mar 21, 2019 •

edited

Loading

vncoelho commented Mar 22, 2019 •

edited

Loading

vncoelho commented Mar 24, 2019 •

edited

Loading

vncoelho commented Mar 25, 2019 •

edited

Loading

vncoelho left a comment •

edited

Loading

jsolman commented Mar 25, 2019 •

edited

Loading