Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Require locking agreement to ChangeView and allowing committed nodes to move to higher views #653

Closed
wants to merge 23 commits into from

Conversation

jsolman
Copy link
Contributor

@jsolman jsolman commented Mar 24, 2019

This change maintains liveness with up to F failed nodes and prevents stalls due to commits getting stuck in lower views. It is an alternative solution to #642 to prevent stalling.

This change adds a new concept/phase of locking change views. >= M change view messages must be known in order for a node to send ChangeView with a Locked flag set. In order for a node to move to a new view, it must see >= M change views with the Locked flag set and with a new view at or above the view to which it is changing.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 24, 2019

It appears the current code in master doesn't pass unit tests. These changes did not break the unit tests.

@igormcoelho
Copy link
Contributor

You can update master now ;)

@vncoelho
Copy link
Member

vncoelho commented Mar 24, 2019

Maybe we need to lock prepare response to ask changing view as well (otherwise it may "fork"), but let's investigate with careful.

…t lock changing view if we have sent prepare response.
@jsolman
Copy link
Contributor Author

jsolman commented Mar 24, 2019

Maybe we need to lock prepare response to ask changing view as well, but let's investigate with careful.

I agree that we need to not allow locking change view after sending prepare response, to prevent the scenario of duplicate blocks, since this change now allows uncommitting from the lower view. I made that change.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 24, 2019

I am testing this code now using:

public class P2PLossyConsensus : Plugin, IP2PPlugin
    {
        private static Random MyRandom = new Random();
        public override void Configure()
        {
        }

        public bool OnP2PMessage(Message message)
        {
            return true;
        }

        public bool OnConsensusMessage(ConsensusPayload payload)
        {
            if (payload.ConsensusMessage.Type == ConsensusMessageType.Commit ||
                payload.ConsensusMessage.Type == ConsensusMessageType.PrepareResponse)
            {
                return MyRandom.Next() % 2 == 0;
            }

            if (payload.ConsensusMessage.Type == ConsensusMessageType.PrepareRequest)
            {
                return MyRandom.Next() % 4 > 0;
            }

            return true;
        }
    }

@jsolman
Copy link
Contributor Author

jsolman commented Mar 24, 2019

I believe this is now a major improvement over #642. In the general case liveness will be maintained with these changes with less than F failed validators. There may be a couple crash scenarios where it could stall with less than F failed validators, but that may be fixed if we introduce saving and restoring the context now on top of these changes.

@vncoelho vncoelho added critical Issues (bugs) that need to be fixed ASAP potential enhancement labels Mar 24, 2019
@jsolman
Copy link
Contributor Author

jsolman commented Mar 24, 2019

It is working reasonably now from testing after the last commit; however it would be better if the recovery messages were requested more often. I'll let tests run for a few hours and see how it is doing.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 24, 2019

Something seems to be broken in prepare request message from the recovery message some of the time. Looking into it.

…ge is inhibitted if a validator has sent its preparation.
if (!knownHashes.Add(payload.Hash)) return;
// If we see a recovery message from a lower view we should recover them to our higher view; they may
// be stuck in commit.
if (!IsRecoveryAllowed(payload.ValidatorIndex, 1, context.F() + 1 ))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double check when ready because maybe this guy +1 is not in the View, thus, anyway would recover the guy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least one of F+1 nodes must be in the correct view or they never would have changed view.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 25, 2019

Testing now seems to be reliable whether running with all nodes or running with F shutdown nodes.

@erikzhang
Copy link
Member

Why it can prevent stalls?

@shargon
Copy link
Member

shargon commented Mar 25, 2019

Give me a couple of days for testing, i am trying to make some stats.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 25, 2019

There is still a flaw. I will have to pick back up on it later. Others testing can abort for now until the issue is fixed.

Why it can prevent stalls?

The thought was that it is useful to have a mechanism to know how long to continue to allow accepting preparations. By requiring >= M ChangeViews to be seen to lock a decision to change view, it can use that as a criteria of when to consider the view changing such that it cannot accept prepare requests any longer. The theory was that this should prevent stalling when F nodes are failed or offline because because since the primary won't ever lock it's change view, there won't be enough nodes to lock change view when F nodes are failed; however, thinking about it now I think this code still has an issue because it allows the primary to send his change view unlocked; which will allow others to lock their change view. The primary probably shouldn't ever send a change view message at all though, unless the view changes and he becomes no longer primary. With that adjustment, the other nodes stay accepting preparations with F nodes failed and it should continue generating blocks. In #642 once F nodes failed, if one node committed and the others had sent change view, they would no longer accept preparations due to ViewChanging and it would stall.

@erikzhang
Copy link
Member

In #642 once F nodes failed, if one node committed and the others had sent change view, they would no longer accept preparations due to ViewChanging and it would stall.

This situation is acceptable. Because if it gets stuck, we will call the node's maintainer and let them restart the node. Then the consensus will be restored.

@vncoelho
Copy link
Member

vncoelho commented Mar 25, 2019

I am afraid that this change may be like "cover the sun with the sieve".
The feeling that nodes gonna stall during PrepareResponse at quite the same conditions as stalling with 2 committed guys and 2 guys on viewchanging looks almost the same to me.

Furthermore, it looks like that commit phase is still needed.

Maybe we can merge #642 and #643 and keep this PR open for a more experiments. I do not see a problem in changing the master later.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 25, 2019

This situation is acceptable. Because if it gets stuck, we will call the node's maintainer and let them restart the node. Then the consensus will be restored

Wouldn’t it be great if consensus still worked while you were trying to get ahold of the node maintainer whose node was failed? This helps achieve that.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 25, 2019

I have no objection to merging #642 and then possibly accept this later if testing shows it is better. #642 does not stall if all nodes are available except in the rare case of nodes stuck in commits in earlier views due to poor network scenarios; in which case logs make it clear who is stuck.

These changes though also should never stall with all nodes available (once the bugs are out). Also, this should not stall with 1 node shut down (except if they shut down in certain edge case states — I still need to document exactly what those states are). In the general case though with this PR if you start the consensus nodes with F nodes shut down, it will create blocks and never stall. However, for #642 with Starting the consensus nodes with F nodes shut down, if F nodes commit, and the other nodes are view changing it stalls.

In general the operational burden and liveness of the system should be better if the algorithm can continue to create blocks with up to F failed (crashed or shut off) nodes.

That being said I believe this change needs more testing and documentation before it will be ready to be accepted; where #642 is basically well understood and tested a this point.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 25, 2019

This still has more issues. Let’s go for #642 and revisit this later.

@jsolman jsolman closed this Mar 25, 2019
@vncoelho vncoelho changed the title Restore Liveness: Prevent Commit after ChangeView. Require locking agreement to ChangeView. [DRAFT] Modify Liveness: Require locking agreement to ChangeView and allowing committed nodes to move to higher views Mar 25, 2019
@vncoelho
Copy link
Member

vncoelho commented Mar 25, 2019

@jsolman, let's keep this open for some time, I think it has good insights.

In the worst case that we do not use anything from here, it will help us to reaffirm the quality of the current implementation.

@vncoelho vncoelho reopened this Mar 25, 2019
@vncoelho vncoelho changed the title [DRAFT] Modify Liveness: Require locking agreement to ChangeView and allowing committed nodes to move to higher views [DRAFT] Require locking agreement to ChangeView (extra phase) and allowing committed nodes to move to higher views Mar 25, 2019
@vncoelho vncoelho changed the title [DRAFT] Require locking agreement to ChangeView (extra phase) and allowing committed nodes to move to higher views [DRAFT] Require locking agreement to ChangeView and allowing committed nodes to move to higher views Mar 25, 2019
@vncoelho
Copy link
Member

vncoelho commented Mar 25, 2019

Summary of the insights of this PR.

Motivation:
Currently, nodes can stall if f nodes are "crashed/non-accessible" and at least one of the M remainders is committed.

Current modifications of PR653:

  • Additional layer of "locked" "unlocked" during change view
  • Primary will not send changeview
  • Primary and responsesent are not going to be locked regarding changeviews, thus, not really contributing for view increase
  • Primary and responsesent are not going to be locked regarding changeviews, thus, not really contributing for view increase
  • Committed nodes are going to move to a higher view if anybody can provide M valid locked changeviews

Detected edge cases:

  • network can still stall if nodes sent unlock changeviews before dying (f nodes dead and 1 one them sent changeview_unlocked, that would stall until some come back by the operators);
  • Other possibilities that need to be analysed.

@vncoelho
Copy link
Member

@erikzhang and @igormcoelho, do you think it is worth to keep this discussion? Or this cause has low change of success?

@vncoelho vncoelho removed the critical Issues (bugs) that need to be fixed ASAP label Mar 25, 2019
@shargon
Copy link
Member

shargon commented Mar 25, 2019

My tests:

image

@vncoelho
Copy link
Member

Very good, @shargon. 🗡️
The idea is exactly this. Now we are all having better tools for analyzing and testing, thus, it is interesting to have this draft PR for us to keep the discussion alive.

This PR still has some problems and it is stalling sometimes.

@vncoelho vncoelho closed this Mar 27, 2019
@vncoelho
Copy link
Member

Let's discuss this off the PR and implement with careful.

@jsolman
Copy link
Contributor Author

jsolman commented Mar 27, 2019

@vncoelho sounds good that was my thought also when I closed it earlier.

@vncoelho
Copy link
Member

Hope is the last to die 💃

@vncoelho vncoelho deleted the consensus/liveness branch May 6, 2019 13:14
Thacryba pushed a commit to simplitech/neo that referenced this pull request Feb 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants