Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent Commit after ChangeView #642

Merged
merged 35 commits into from
Mar 25, 2019
Merged

Prevent Commit after ChangeView #642

merged 35 commits into from
Mar 25, 2019

Conversation

jsolman
Copy link
Contributor

@jsolman jsolman commented Mar 18, 2019

Since we do not ever allow the view to jump backward during recovery, we cannot allow change view to occur after more than F nodes have committed to any view. Therefore, if a node has sent a change view message, it can not be allowed to reach commit in the view from which it has requested to change. This fix addresses the issue and prevents a stall.

Copy link
Member

@vncoelho vncoelho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just detected this issue this afternoon.

Perfect and well done, @jsolman.
Thanks for solving this quick and in a simple manner. Good vision.

I will soon complete some basic test cases and approve it.

@jsolman jsolman changed the title Prevent commit after ChangeView Prevent Commit after ChangeView Mar 18, 2019
@jsolman jsolman requested review from shargon and belane March 18, 2019 22:33
@vncoelho
Copy link
Member

vncoelho commented Mar 18, 2019

Check this, Jeff:

[22:44:38.406] initialize: height=550 view=0 index=3 role=Backup
[22:44:40.282] OnPrepareRequestReceived: height=550 view=0 index=2 tx=1
[22:44:40.282] send prepare response
[22:44:40.946] timeout: height=550 view=0
[22:44:40.946] request change view: height=550 view=0 nv=1
[22:44:40.997] OnCommitReceived: height=550 view=0 index=1
[22:44:41.003] OnCommitReceived: height=550 view=0 index=2
[22:44:41.009] OnCommitReceived: height=550 view=0 index=0
[22:44:41.014] relay block: 0x596994b17b42de6480863dfa4df823fdb56a22f054f8d7efcc91d54c377c9dfd
[22:44:41.044] persist block: 0x596994b17b42de6480863dfa4df823fdb56a22f054f8d7efcc91d54c377c9dfd
[22:44:41.044] initialize: height=551 view=0 index=3 role=Primary

aehauheua, as expected relaying a block by acting as a bridge. :D 💃

@jsolman
Copy link
Contributor Author

jsolman commented Mar 18, 2019

One more adjustment is needed to prevent things getting stuck. If we detect that F nodes are committed, then nodes should stop trying to change view.

@vncoelho
Copy link
Member

vncoelho commented Mar 19, 2019

For those who want to check the logs of this last problem.

[23:17:07.861] initialize: height=2055 view=0 index=2 role=Backup
[23:17:09.819] OnPrepareRequestReceived: height=2055 view=0 index=3 tx=1
[23:17:09.820] send prepare response
[23:17:09.872] timeout: height=2055 view=0
[23:17:09.878] request change view: height=2055 view=0 nv=1
[23:17:10.645] OnCommitReceived: height=2055 view=0 index=3
[23:17:10.655] OnCommitReceived: height=2055 view=0 index=0
[23:17:10.664] OnChangeViewReceived: height=2055 view=0 index=1 nv=1
[23:17:10.672] OnChangeViewReceived: height=2055 view=0 index=2 nv=1
[23:17:07.927] initialize: height=2055 view=0 index=0 role=Backup
[23:17:09.820] OnPrepareRequestReceived: height=2055 view=0 index=3 tx=1
[23:17:09.820] send prepare response
[23:17:09.877] OnPrepareResponseReceived: height=2055 view=0 index=2
[23:17:09.878] send commit
[23:17:09.935] OnCommitReceived: height=2055 view=0 index=3
[23:17:10.848] OnRecoveryMessageReceived: height=2055 view=0 index=3
[23:17:10.929] timeout: height=2055 view=0
[23:17:08.812] initialize: height=2055 view=0 index=3 role=Primary
[23:17:09.816] timeout: height=2055 view=0
[23:17:09.816] send prepare request: height=2055 view=0
[23:17:09.836] OnPrepareResponseReceived: height=2055 view=0 index=2
[23:17:09.837] OnPrepareResponseReceived: height=2055 view=0 index=0
[23:17:09.838] send commit
[23:17:10.718] OnCommitReceived: height=2055 view=0 index=0
[23:17:10.846] timeout: height=2055 view=0
[23:17:10.846] send recovery to resend commit
[23:17:07.931] initialize: height=2055 view=0 index=1 role=Backup
[23:17:09.823] OnPrepareResponseReceived: height=2055 view=0 index=2
[23:17:09.832] OnPrepareResponseReceived: height=2055 view=0 index=0
[23:17:09.938] timeout: height=2055 view=0
[23:17:09.940] request change view: height=2055 view=0 nv=1
[23:17:09.994] OnCommitReceived: height=2055 view=0 index=0
[23:17:10.005] OnCommitReceived: height=2055 view=0 index=3

@vncoelho
Copy link
Member

vncoelho commented Mar 19, 2019

Let's just let this comment and let it flow...aehuaheahuea

// A possible attack can happen if the last committed node does not want to recover the others
// In addition, currently, if a the node asking change views crashes it will come back and possibly accept recover from any committed node, thus, splitting nodes among committed nodes and possible stalling the network
public static bool FNodesValidCommitted(this IConsensusContext context) => context.CommitPayloads.Count(p => p != null) >= context.F();

neo/Consensus/ConsensusService.cs Outdated Show resolved Hide resolved
@@ -293,7 +296,7 @@ private void OnRecoveryMessageReceived(ConsensusPayload payload, RecoveryMessage
ReverifyAndProcessPayload(changeViewPayload);
}
if (message.ViewNumber != context.ViewNumber) return;
if (!context.CommitSent())
if (!context.ViewChanging() && !context.CommitSent())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should allow receiving PrepareRequest and PrepareResponse even when ViewChanging() == true. Because ViewChanging() indicates that I want to change views, it may not be able to change successfully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comments below if it accepts prepare requests and prepare responses in the current view after it has sent a ChangeView message it can end up sending the other nodes to the next view, where they will never be able to receive the recovery message from this view with the commits, since it will be a lower view number; thus the network will be stalled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the view has not been changed, then we should allow consensus on the current view. Otherwise, we may not be able to reach a consensus or change views.

Copy link
Contributor Author

@jsolman jsolman Mar 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the view is changing for a node (then the view has to be considered to be changed from it's perspective), because it has already sent a change view message that can be used to allow others to accept a change view, so it cannot commit in the current view. If it did commit in the current view after sending a change view, then the others can all accept moving to the later view, where they won't be able to reach consensus.

If more than F nodes have committed, then this code does allow receiving PrepareRequest and PrepareResponse in the current view effectively to obtain consensus by treating the ViewChanging flag as not changing. Maybe you missed this line that is changed in this PR:

public static bool MoreThanFNodesCommitted(this IConsensusContext context) => context.CommitPayloads.Count(p => p != null) > context.F();

If less than or equal to F nodes have been committed then they will still be able to obtain consensus in a higher view. So this code should address your concern of nodes being able to reach consensus in the current view when it is not possible to change view anymore.

neo/Consensus/ConsensusService.cs Outdated Show resolved Hide resolved
neo/Consensus/ConsensusService.cs Outdated Show resolved Hide resolved
@erikzhang
Copy link
Member

NGD will test #641 first. And if there are too many stalls, they will test #642 then.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 19, 2019

And if there are too many stalls, they will test #642 then.

It will be good to have a baseline before this change. However, it doesn't take very severe network latency to potentially cause the stall that should be fixed by this issue, so I suspect we should have this tested afterward if it stalls even once during NGD testing.

@vncoelho
Copy link
Member

vncoelho commented Mar 19, 2019

@erikzhang, until this commit 57065f8 it basically solves the issue.

Maybe we can revert at this point, @jsolman and open a new PR with 442bfe8.

For me, both are good, but until that commit looks quite necessary at the moment.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 19, 2019

@vncoelho I already did remove the code that saved consensus context to a separate PR. This only has the minimal required now to fix the issue.

@vncoelho
Copy link
Member

vncoelho commented Mar 19, 2019

Thanks, @jsolman.
The last tests were carried on with >= f, let's start them again with > f. Thanks for the reminder.

…cepting PrepareRequest and PrepareResponse from different views."

This reverts commit 9838afc.
Conflicts:
	neo/Consensus/ConsensusService.cs
	neo/Consensus/Helper.cs
@vncoelho
Copy link
Member

As always, great job @erikzhang, it is looks cleaner now.
Tests are being updated with this version.

@erikzhang
Copy link
Member

NGD will test it these days.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 21, 2019

I'm running a test using the P2P plugin now that drops commits and prepare responses 50% of the time.

    public class P2PLossyConsensus : Plugin, IP2PPlugin
    {
        private static Random MyRandom = new Random();
        public override void Configure()
        {
        }

        public bool OnP2PMessage(Message message)
        {
            return true;
        }

        public bool OnConsensusMessage(ConsensusPayload payload)
        {
            if (payload.ConsensusMessage.Type == ConsensusMessageType.Commit || 
                 payload.ConsensusMessage.Type == ConsensusMessageType.PrepareResponse)
            {
                return MyRandom.Next() % 2 == 0;
            }
            return true;
        }
    }

Edit : I added dropping prepare response 50% of the time above since this simulates the problem this solves.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 21, 2019

So far testing is going well. No stalls.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 21, 2019

Stressing it harder with dropping other message types now too. Dropping prepare response 50% of the time is a good way to simulate the problem this solves.

@jsolman jsolman requested a review from erikzhang March 22, 2019 09:04
@jsolman
Copy link
Contributor Author

jsolman commented Mar 22, 2019

Testing indicates this is working well so far. I've heard good things from NGD so far also. I look forward to hearing their final result.

@shargon
Copy link
Member

shargon commented Mar 22, 2019

Good test @jsolman !

@vncoelho
Copy link
Member

vncoelho commented Mar 22, 2019

Blocks are quite stable for almost 48hrs (resetting every 12) in a 2s network with low computational resources:

image

Some outliers can be seen but those are mostly due to nodes lagging.
Even one of the CN is sometimes lagging.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 23, 2019

I’ve thought of a better solution than this. Expect an alternative PR within 48 hours that solves the same problem as this PR, while being cleaner and maintaining liveness. Also without needing #643.

@vncoelho
Copy link
Member

vncoelho commented Mar 24, 2019

@jsolman, I like the idea of allowing committed nodes to move to higher views! I think it is a necessary assumption of the liveness of pBFT, as @igormcoelho has been advocating. It is a modification that we need to do after this one.

However, I think we need #643 and also the lock protection for nodes with flag viewchanging that was the main motivation of this current PR.

It is necessary to discuss this "special rule" in detail and think about in order to be sure that it will not break our assumption of never creating double headers when we have less than f+1 Malicious Nodes.

We created an example in the thread above. Let's discuss that.

@shargon
Copy link
Member

shargon commented Mar 25, 2019

Give me a couple of days for testing, i am trying to make some stats.

@vncoelho
Copy link
Member

Go for it, @shargon.
For me this is approved. I think this is the natural solution we have, considering the path we had picked in our previous design of dBFT 2.0.

In addition, PR643 has potential in "helping" (not completely "avoiding") a "crashed" node to not contribute in a lower view instead of the last one it knew.

@erikzhang
Copy link
Member

Are we going to merge this first? @jsolman @vncoelho

@vncoelho
Copy link
Member

vncoelho commented Mar 25, 2019

Such a hard answer...@erikzhang...aehauheauhea
I will let this "box of bees" with you 📦

I am joking, Erik.
This PR has been tested and it solves the critical problem, maybe it is not the best solution (but this is a kind of consequence of our design - optimal or not is something that I do not how to answer right now) because f nodes can stuck the network if some committed guys (from the M remainders) do not contribute for changingview.

I do not see a problem in merging this and later discussing #653, even if it is after 1-2 weeks.

Copy link
Member

@vncoelho vncoelho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikzhang, fell free to merge when NGD finishes tests.
I think this should go to Testnet asap.

If a not satisfactory behavior is detected on Testnet we move in another direction.
But if it is ok let's put on Mainnet and keep discussions going on. Maybe in some more couple of months/weeks/even_days we can achieve an even more polished version.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 25, 2019

@erikzhang I agree with moving forward with this when you are ready. Other solutions like #653 still have issues and need further adjustments before they are ready; arguably they have other issues that can be potentially worse.

@erikzhang erikzhang merged commit 28443ef into master Mar 25, 2019
@vncoelho vncoelho deleted the consensus/preventStall branch March 25, 2019 16:27
@shargon
Copy link
Member

shargon commented Mar 25, 2019

My tests

image

Thacryba pushed a commit to simplitech/neo that referenced this pull request Feb 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
critical Issues (bugs) that need to be fixed ASAP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants