Stage 3 of dBFT (Commit) #320

shargon · 2018-07-16T13:12:40Z

First proposal for the stage 3 of dBFT

TODO:

Be able to reproduce the fork issue
Test that this PR is working well
Test that this PR solve the problem

Fixes #193
Fixes neo-project/neo-node#219

shargon · 2018-07-16T13:14:17Z

Please @belane, @igormcoelho and @vncoelho review this too

vncoelho · 2018-07-16T13:57:17Z

Hi, @shargon, thanks for this.
Could you explain us a brief description about the stage 3 in the original Practical or Delegated BFT (http://pmg.csail.mit.edu/papers/osdi99.pdf)? Or this is an additional step that is now being envisioned?

I checked the code (looks precise to the point) and I got that the Commits array will contain those who agreed with the block.
Perhaps it is now going to be include in the ConsensusMessage and appear in the MinerTransaction? If yes, I suggest the inclusion of other minor info that will help us a lot in the analyzing the network later on.

vncoelho · 2018-07-16T13:45:23Z

neo/Consensus/ConsensusService.cs


-                    if (payload.ValidatorIndex >= context.Validators.Length) return;


Check if payload validator Index could had changed on RequestGetBlocks()

I move this validation to the begining of the method
https://github.com/neo-project/neo/pull/320/files#diff-0285c1c12d1d492897a99ffe07f9fed9R233

shargon · 2018-07-16T14:44:49Z

@vncoelho is exactly this paper , you are a smart guy :)

We need the comit phase for ensure that all blocks have the same hash before they spread the block to the network

vncoelho · 2018-07-16T15:33:09Z

neo/Consensus/ConsensusService.cs

        {
+            if (context.State.HasFlag(ConsensusState.BlockSent)) return;


Is this the "jump of the cat", Shargon? aheuahueahuea

vncoelho · 2018-07-16T15:41:10Z

neo/Consensus/ConsensusService.cs

        {
+            if (context.State.HasFlag(ConsensusState.BlockSent)) return;
+            if (!context.TryToCommit(payload, message)) return;
+
            if (context.Signatures.Count(p => p != null) >= context.M && context.TransactionHashes.All(p => context.Transactions.ContainsKey(p)))


Is this double check necessary? Because CheckSignatures() already checked it and called OnCommitAgreement.

snowypowers · 2018-07-16T15:51:43Z

neo/Consensus/ConsensusContext.cs

+
+            if (Commits[payload.ValidatorIndex] != null)
+            {
+                return false;


Let's be consistent with your single line ifs. I see some with curly bracket and some without.

snowypowers · 2018-07-16T15:55:54Z

neo/Consensus/ConsensusService.cs

+        {
+            if (!context.CommitAgreementSent)
+            {
+                if (context.Signatures.Count(p => p != null) >= context.M && context.TransactionHashes.All(p => context.Transactions.ContainsKey(p)))


merge if loops

snowypowers · 2018-07-16T15:59:28Z

neo/Consensus/ConsensusContext.cs

@@ -28,21 +28,63 @@ internal class ConsensusContext
        public byte[] ExpectedView;
        public KeyPair KeyPair;

+        private UInt256[] Commits;
+        private Block _header = null;
+        public bool CommitAgreementSent = false;


This bool looks like it should belong in ConsensusState

That is true, Snowy. It looks like.

vncoelho · 2018-07-17T02:13:33Z

Haduken. Good moves, Snowy and Shargon.

…ob/master/neo/Network/Payloads/ConsensusPayload.cs#L81

vncoelho · 2018-07-17T18:00:33Z

@shargon brothers, I am confused...
I just met @igormcoelho today and we are talking about that "forks" you mentioned.
Is this PR related to that?
In addition, is this "fork" related to the CN getting stuck and changing view several times without agreement?

If yes, I think that the solution may be not this one.
As far as I knew until now, the phase 3 was already implemented by @erikzhang, right? (originally at CheckSignatures method, https://github.com/neo-project/neo/blob/master/neo/Consensus/ConsensusService.cs#L88)
That problem related to changing view in a loop is more a fine-tuning on the consensus times. We already got quite good results with some parameters adjustments and keeping the same structure #268.

shargon · 2018-07-17T23:42:27Z

This phase (3) is not implemented already, this is for ensure that the other backup nodes have enough signatures for spread the block, before you spread the block. Then you can't spread a block alone, change view, and fork

shargon · 2018-07-17T23:46:50Z

Is compatible with your PR. Because maybe this delay the network. We should add more improves (like you told me) for example:

Compression P2P protocol (Gzip).
Store in the pool, the date of receipt of any TX, to not accept in a block, TX that have just arrived or have arrived in the last 5 seconds (Time TBD).
Furthermore, the system could consider the size of the incoming TX and stipulate weather it should be included or not.
Max block size of 2mb (Size TBD).

belane · 2018-07-18T03:10:22Z

It looks good, but introduces more phases for consensus and will have penalization in the consensus time what can make the change of view more frequent.

we have to do more intense tests because in a local laboratory it will not reflect reality and commit phase will be quick.

*It may be a good idea to increase the timer 2 seconds when the node reaches the commit phase? It may deserve wait two more seconds than change your view and start the whole process again. what do you think?

vncoelho · 2018-07-18T22:34:57Z

@shargon, thanks for this nice explanation, ma frem.
About the compatibility, do not worry, I know it is all compatible. The point was about the necessity of this state even after CheckSignatures and a node broadcasting the relay.
But now that you explained, I see that this phase can surely avoid some unexpected situations.

@belane,
I think that the purpose of this PR should include this point that you mentioned, such as playing with the times.
From my point of view, it is unacceptable to make such changes and do not "carve" the blocks in 15s +- 1s (something like that).
Let's merge the core idea of #268 (which is very simple), include this 2s extra that you mentioned, and advance the Primary speaker with average values from the last blocks. Then, we can fine-tune for achieving the 15s blocks. Everyone knows it is possible, because lower blocks times can be achieved. Thus, let's precisely do it now before this PR is merged.
We should minor redesign the times in order to ensure that change views should only happen in very unexpected cases and not due to delay or network instabilities.

toghrulmaharram · 2018-07-19T08:52:52Z

@vncoelho Something similar to Aardvark can be implemented. Aardvark requires the primary replica to achieve 90% of the throughput achieved during the previous N views. This will allow us to bring the block times down without fine-tuning the strict timing assumptions (a hardcoded maximum threshold should still be kept though). However, some transparency will be required to monitor the performance without the nodes being able to game the system (even without any malicious intent). I would require the nodes to provide signed atomic clock times from time-to-time to prove that the local clocks did not drift and the nodes are synchronized.

vncoelho · 2018-07-20T02:05:09Z

Thanks, @shargon, the magister, for the patience and great explanations.
Soon we gonna test this with careful. @igormcoelho is trying to finish something very interesting that may help considerably.

@toghrulmaharramov, I will try to read it during these next days, thanks for the tip.
We are focusing on some quite subtle adjustments for now, which is surely the best approach for us to advance slowly but surely.
But be sure that your points are always precious.

edwardzpeng · 2018-09-18T08:17:07Z

I think just fully implement the pBFT algorithm will solve this problem.

erikzhang · 2018-09-18T08:25:37Z

Maybe a new consensus message for: hello, i am new here, please tell me where i am?

That's it!

edwardzpeng · 2018-09-18T08:25:57Z

I think the client and replicas model can be used here just like pBFT (But the details protocol should strictly follow the pBFT model, not like what we did before). If that, I think we can get a formal security proof of dBTF.

toghrulmaharram · 2018-09-18T08:28:50Z

@edwardz246003 PBFT was not created for the decentralized platforms as it only requires a leader change in case the majority of nodes (2f + 1) deem the leader to be faulty or Byzantine. The protocol can be very slow in a decentralized network plus it doesn't solve the issue of a Byzantine node propagating a different "valid" block.

shargon · 2018-09-18T08:34:22Z

@erikzhang if you will, i could expose here the flow chart for the fork, Because together is better.

erikzhang · 2018-09-18T08:36:44Z

if you will, i could expose here the flow chart for the fork, Because together is better.

That is good.

shargon · 2018-09-18T08:44:22Z

I will expose here the current problem. All of this could be produced by network errors, IS DIFFICULT, but sometimes happens.

For this example we only have 4 consensus nodes, of course, is hard with more nodes.

Step 1

The first step for produce the fork is the hard part:

Backup2 (B2) is hang
Primary have 3 signatures
B3 and B4 is not connected
B3 and B4 never receive signatures from primary
Primary lost the connection

Result:

Primary have the block B10-H1, but can't release it

Step 2

The second step is automatic:

The view change because the lider don't response
B2 recover the connection
Start new regular consensus
The current consensus without (Primary 1) release the current block (B10-H2)
The previous primary release his well signed block

Result:

We have a fork

shargon · 2018-09-18T08:53:31Z

I think that the solution for all problems, is that the real signature only should be send on commit message

erikzhang · 2018-09-18T08:57:41Z

I think that the solution for all problems, is that the real signature only should be send on commit message

Yes. But if we allow the view to be changed during the commit phase, it may also cause a fork.

shargon · 2018-09-18T09:02:40Z

Maybe we should change the view if we receive more than X view change messages

erikzhang · 2018-09-18T09:17:56Z

Consider a consensus network of 7 nodes with 2 malicious nodes. We name the nodes as A, B, C, D, E, F, and G, which A and B are malicious nodes, and the others are good ones.

When A is primary, it send PrepareRequest₁ to node B, C, and D, and send PrepareRequest₂ to node E, F, and G.

Since B is a malicious node, it will send PrepareResponse to both the [A, B, C, D] group and the [A, B, E, F, G] group. Then the [A, B, E, F, G] group will enter commit phase, and can create a Block₁.

But A and B can hold their commit messages and the Block₁. And then they request a view change. In the next view, they can work together and create a Block₂.

After that, A and B can release Block₁ and Block₂ at the same height.

In order to prevent this attack, we must prohibit view change during the commit phase. In this case, [E, F, G] does not change view, so [A, B, C, D] cannot generate a fork block.

vncoelho · 2018-09-18T11:11:15Z

First of all, as Erik said, thanks for the eagle eyes of @edwardz246003 and his team.
@erikzhang, I was discussing this yesterday with @igormcoelho.
We consider the idea of two layers of signatures as you suggested (one partial and one final), but we thought it would be a layer of layer and more complicated. aehuaheauea
Then, at the end, we thought about that solution of signing the relay message (in essence, also another layer. But very computational expensive).

But, as always, it seems that you are proposing the right thing again. If this solves the problem, it will be better than requiring nodes to double checking 2f +1 blocks before persisting.

Please, check if the following reasoning is right. In summary:

[E, F, G] send the PrepareResponse_2 (only with the partial signatures of PrepareRequest_2)
then, probably a rule that says that: if more than 2f +1 PrepareResponse_2 signatures are received -> send the full signature of the block.
In this sense, entering in the commit phase would be a synonymous of block signed. Thus, if block signed change view is not anymore allowed.
it is not a problem, but, [A,B], Malicious Nodes, would be still able to send their signatures at any time (sending the full signature is optional).

However, the only problem I see, that is not a big problem, is:
*If [E,F] send their signatures of the commit phase and, coincidentally, view changes. Thus, [G] would change view.
*Then, in the next round (view 2), [A,B,C,D,G] would keep generating blocks.

We would probably need to reset [E,F] if they receive a valid block, right?
But, we can imagine that after that, the crazy [A], we just need one malicious node now, will not anymore sign a block.
Thus, even if [B,C,D,G] want to publish a block they will never be able. In this sense, f-1 malicious nodes would be able to stuck/stop the system.
We should need a way to recover [E,F] in that situation.

Maybe, as usual, I missed something...aehauheau Is this right?

erikzhang · 2018-09-18T11:20:53Z

If [E,F] send their signatures of the commit phase and, coincidentally, view changes. Thus, [G] would change view.
Then, in the next round (view 2), [A,B,C,D,G] would keep generating blocks.

This won't result in a block fork.

vncoelho · 2018-09-18T11:22:40Z

Yes, you are right, it won't result in a fork, it would be just a case where network fails (delays, etc..) make possible "f-1 malicious nodes stop block generation". Which is not a problem, because these delays should not, in essence, be really considered a Byzantine fault.
A "manual/automatic" reset to [E,F] would be needed.

I think it is not the point for this discussion, but I wrote this as I way for understanding and documenting the case (if I understood the steps correctly). :D

erikzhang · 2018-09-18T11:30:51Z

A "manual/automatic" reset to [E,F] would be needed.

When [A, B, C, D, G] create a bock, [E, F] will resume normal automatically.

vncoelho · 2018-09-18T11:35:42Z

You are right, if they produce a block, I think that [E,F] should resume normal automatically.

But there was this other case considering that [A] was a Malicious and did not want to sign anymore, then, [B, C, D, G] would not be able to create a block anymore.
But I think that, in the future, some metrics can be designed. A plugin for a RPC can be done for only keep watching the consensus messages. Thus, another tool would be able to monitor these behaviors.

This was just an insight, I think we do not need to worry about this now. But, maybe, with this line of reasoning we find something else... 🗡️ aehuaheuaea

erikzhang · 2018-09-18T11:45:15Z

But [A] is Malicious, Erik. He will not want to sign anymore.

You are right.

In the case of a poor network, a malicious node can stop the consensus of the entire network.

In the above case, G changes the view because it cannot communicate with [E, F].

But this is better than the block fork, and if there is a network failure, it is acceptable to temporarily stop the consensus.

vncoelho · 2018-09-18T11:48:23Z

I also agree with you, Erik. As I said, this is a discussion for a brainstorming, maybe with this we can insight the future.
I will keep thinking about this ongoing solution (which maybe is already final designed). Thanks for the teaching,

shargon · 2018-09-18T12:23:50Z

With a bad guy in commit phase, you only could produce the current situation, don't produce the fork, you need all the requirements mentionen here #320 (comment) (and the bad guy) to produce the fork, so a bad guy is not enought for produce the fork.

vncoelho · 2018-09-18T12:49:56Z

You are right, Sharrrgon. You are a good guy, Shargon, not the bad node guy 🗡️
Without you it would had been hard to understand/design the fork scripts, gracias, tio! :D

@shargon, maybe it is better to implement these changes before merge, right? Or do you think that it should be merged and then modified? If you need any help (and if I am able to contribute) let us know.
But maybe it is easier for you to port it to 2.9.0 and modify with these aforementioned new ideas 📦 aheuaheuaheauea jajajajaj
I think that maybe we should all take some time, and give it some time, and make it almost a "definitive improvement".

igormcoelho · 2018-09-18T16:16:58Z

Such a nice discussion here! At least, even if the current commit phase is kept and no change views are allowed after this point, it guarantees that coordinated network problems won't cause forks anymore (which is rare), unless a bad agent tries to force this fork (by submitting the hash before as @edwardz246003 realized), but this compromises its credibility during the voting process. We still can go to a situation where coordinated hardware issues may cause a fork (by losing the state) together with network issues. This is possible to happen, but at least, I don't see that happening in a near future :) I mean, problem is not fully solved, but it's a huge evolution already, congratulations again Shargon.

shargon · 2018-10-15T08:12:11Z

The next week we will start porting this to AKKA model :)

longfeiWan9 · 2018-10-17T09:52:32Z

Following @erikzhang and @vncoelho 's example, I have noticed an issue with current logic of new commit phase.

Consider a consensus network of 7 nodes with 2 malicious nodes. We name the nodes as A, B, C, D, E, F, and G, which A and B are malicious nodes, and the others are good ones.

When A is primary, it send PrepareRequest1 to node B, C, and D, and send PrepareRequest2 to node E, F, and G.

Since B is a malicious node, it will send PrepareResponse to both the [A, B, C, D] group and the [A, B, E, F, G] group.

[A, B, C, D] will not enter commit phase, but request change view. NOT enough votes to change view.

[A, B, E, F, G] will enter commit phase, but [A, B] may hold their signatures for commit phase. Then [E, F, G]'s commit votes are not enough to publish this block, and [E, F, G] are not allowed to change view.

In this situation, the network stops producing block which I think is a problem, right? Please see the following image.

shargon · 2018-10-23T11:58:29Z

Closed to continue in #422

CommitAgreement

2de2101

shargon requested review from erikzhang and AshRolls July 16, 2018 13:12

shargon added 3 commits July 16, 2018 15:17

Send only one

17b5ad0

Check sigs

66709f6

Merge remote-tracking branch 'origin/dBFT-stage-3' into dBFT-stage-3

e6f52a5

vncoelho reviewed Jul 16, 2018

View reviewed changes

shargon requested a review from snowypowers July 16, 2018 14:58

vncoelho reviewed Jul 16, 2018

View reviewed changes

snowypowers reviewed Jul 16, 2018

View reviewed changes

shargon added 3 commits July 17, 2018 02:50

Readibility

6eb2ae0

Remove enter

1788d01

CommitSent as a flag

97322b0

shargon added 2 commits July 17, 2018 04:17

Add log

fb1f9a1

Signature is already checked on https://github.com/neo-project/neo/bl…

2415611

…ob/master/neo/Network/Payloads/ConsensusPayload.cs#L81

shargon added 2 commits July 19, 2018 12:30

Fix commit log

e08e0b0

log changes

700ff6f

log changes

3d56256

erikzhang mentioned this pull request Sep 19, 2018

Optimized Delegated Byzantine Fault Tolerance (ODBFT) - Part III: Optimizing consensus timing. Efficient prediction and optimization strategies for fine-tune the dBFT (Q2/Q3 2019) #386

Closed

shargon closed this Oct 23, 2018

shargon deleted the dBFT-stage-3 branch October 23, 2018 12:04

shargon mentioned this pull request Oct 23, 2018

Commit phase - AKKA Model #422

Closed

5 tasks

erikzhang added this to the NEO 3.0 milestone Jan 25, 2019

Thacryba pushed a commit to simplitech/neo that referenced this pull request Feb 17, 2020

Create test.md (neo-project#320)

e4df6d2


		if (payload.ValidatorIndex >= context.Validators.Length) return;

		{
		if (context.State.HasFlag(ConsensusState.BlockSent)) return;

Stage 3 of dBFT (Commit) #320

Stage 3 of dBFT (Commit) #320

Conversation

shargon commented Jul 16, 2018 • edited by erikzhang Loading

shargon commented Jul 16, 2018 • edited Loading

vncoelho commented Jul 16, 2018 • edited Loading

vncoelho Jul 16, 2018

Choose a reason for hiding this comment

shargon Jul 16, 2018

Choose a reason for hiding this comment

shargon commented Jul 16, 2018

vncoelho Jul 16, 2018

Choose a reason for hiding this comment

vncoelho Jul 16, 2018

Choose a reason for hiding this comment

snowypowers Jul 16, 2018

Choose a reason for hiding this comment

snowypowers Jul 16, 2018

Choose a reason for hiding this comment

snowypowers Jul 16, 2018

Choose a reason for hiding this comment

vncoelho Jul 16, 2018 • edited Loading

Choose a reason for hiding this comment

vncoelho commented Jul 17, 2018

vncoelho commented Jul 17, 2018 • edited Loading

shargon commented Jul 17, 2018

shargon commented Jul 17, 2018 • edited Loading

belane commented Jul 18, 2018

vncoelho commented Jul 18, 2018

toghrulmaharram commented Jul 19, 2018 • edited Loading

vncoelho commented Jul 20, 2018 • edited Loading

edwardzpeng commented Sep 18, 2018

erikzhang commented Sep 18, 2018

edwardzpeng commented Sep 18, 2018

toghrulmaharram commented Sep 18, 2018 • edited Loading

shargon commented Sep 18, 2018

erikzhang commented Sep 18, 2018

shargon commented Sep 18, 2018 • edited Loading

Step 1

Step 2

shargon commented Sep 18, 2018

erikzhang commented Sep 18, 2018

shargon commented Sep 18, 2018

erikzhang commented Sep 18, 2018

vncoelho commented Sep 18, 2018 • edited Loading

erikzhang commented Sep 18, 2018

vncoelho commented Sep 18, 2018 • edited Loading

erikzhang commented Sep 18, 2018

vncoelho commented Sep 18, 2018 • edited Loading

erikzhang commented Sep 18, 2018

vncoelho commented Sep 18, 2018 • edited Loading

shargon commented Sep 18, 2018

vncoelho commented Sep 18, 2018 • edited Loading

igormcoelho commented Sep 18, 2018 • edited Loading

shargon commented Oct 15, 2018

longfeiWan9 commented Oct 17, 2018

shargon commented Oct 23, 2018 • edited by erikzhang Loading

shargon commented Jul 16, 2018 •

edited by erikzhang

Loading

shargon commented Jul 16, 2018 •

edited

Loading

vncoelho commented Jul 16, 2018 •

edited

Loading

vncoelho Jul 16, 2018 •

edited

Loading

vncoelho commented Jul 17, 2018 •

edited

Loading

shargon commented Jul 17, 2018 •

edited

Loading

toghrulmaharram commented Jul 19, 2018 •

edited

Loading

vncoelho commented Jul 20, 2018 •

edited

Loading

toghrulmaharram commented Sep 18, 2018 •

edited

Loading

shargon commented Sep 18, 2018 •

edited

Loading

vncoelho commented Sep 18, 2018 •

edited

Loading

vncoelho commented Sep 18, 2018 •

edited

Loading

vncoelho commented Sep 18, 2018 •

edited

Loading

vncoelho commented Sep 18, 2018 •

edited

Loading

vncoelho commented Sep 18, 2018 •

edited

Loading

igormcoelho commented Sep 18, 2018 •

edited

Loading

shargon commented Oct 23, 2018 •

edited by erikzhang

Loading