Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invariant MoreUpToDateCorrectInv violated in TLA+ specification #3837

Closed
heidihoward opened this issue May 13, 2022 · 10 comments
Closed

Invariant MoreUpToDateCorrectInv violated in TLA+ specification #3837

heidihoward opened this issue May 13, 2022 · 10 comments
Assignees
Labels
tla TLA+ specifications

Comments

@heidihoward
Copy link
Member

Describe the bug
TLC found a violation of the MoreUpToDateCorrectInv invariant when checking the current TLA+ specification in https://github.com/microsoft/CCF/blob/main/tla/raft_spec/ccfraft.tla.

Note that this is different TLA+ specification (and thus a different issue) from #3828.

To Reproduce
Use TLC to check the current spec in https://github.com/microsoft/CCF/blob/main/tla/raft_spec/MCraft.tla using a term limit of 3. As you can see from the output below, TLC took >2 days to find this bug (after checking 37 billon states) on a well-resourced machine so I'll need to be patient.

Expected behavior
TLC should have completed checking without finding a safety violation.

Environment information
Host: Azure standard_HB120rs_v3
OS: Ubuntu 20.04 LTS
TLC: TLC2 Version 2.17 of 02 February 2022

Additional context
The final output & counter example provided by TLC is copied below:

State 27: <ClientRequest line 351, col 5 to line 362, col 80 of module ccfraft>
/\ ReconfigurationCount = 1
/\ committedLogDecrease = FALSE
/\ messages = { [ mtype |-> RequestVoteRequest,
    mterm |-> 2,
    mlastLogTerm |-> 0,
    mlastLogIndex |-> 0,
    msource |-> NodeTwo,
    mdest |-> NodeThree ] }
/\ clientRequests = 3
/\ matchIndex = ( NodeOne :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeTwo :>
      ( NodeOne :> 2 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeThree :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeFour :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeFive :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) )
/\ messagesSent = ( NodeOne :>
      ( NodeOne :> <<>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) @@
  NodeTwo :>
      ( NodeOne :> <<1, 1>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) @@
  NodeThree :>
      ( NodeOne :> <<>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) @@
  NodeFour :>
      ( NodeOne :> <<>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) @@
  NodeFive :>
      ( NodeOne :> <<>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) )
/\ log = ( NodeOne :>
      << [term |-> 2, contentType |-> TypeReconfiguration, value |-> {NodeTwo}],
         [term |-> 2, contentType |-> TypeSignature, value |-> 0],
         [term |-> 3, contentType |-> TypeEntry, value |-> 2] >> @@
  NodeTwo :>
      << [term |-> 2, contentType |-> TypeReconfiguration, value |-> {NodeTwo}],
         [term |-> 2, contentType |-> TypeSignature, value |-> 0],
         [term |-> 2, contentType |-> TypeEntry, value |-> 1],
         [term |-> 2, contentType |-> TypeSignature, value |-> 1] >> @@
  NodeThree :> <<>> @@
  NodeFour :> <<>> @@
  NodeFive :> <<>> )
/\ state = ( NodeOne :> Leader @@
  NodeTwo :> Leader @@
  NodeThree :> Follower @@
  NodeFour :> Pending @@
  NodeFive :> Pending )
/\ votesRequested = ( NodeOne :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 1 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeTwo :>
      ( NodeOne :> 1 @@
        NodeTwo :> 0 @@
        NodeThree :> 1 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeThree :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeFour :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeFive :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) )
/\ commitIndex = ( NodeOne :> 0 @@
  NodeTwo :> 4 @@
  NodeThree :> 0 @@
  NodeFour :> 0 @@
  NodeFive :> 0 )
/\ currentTerm = ( NodeOne :> 3 @@
  NodeTwo :> 2 @@
  NodeThree :> 3 @@
  NodeFour :> 1 @@
  NodeFive :> 1 )
/\ committedLog = << [term |-> 2, contentType |-> TypeReconfiguration, value |-> {NodeTwo}],
   [term |-> 2, contentType |-> TypeSignature, value |-> 0],
   [term |-> 2, contentType |-> TypeEntry, value |-> 1],
   [term |-> 2, contentType |-> TypeSignature, value |-> 1] >>
/\ nextIndex = ( NodeOne :>
      ( NodeOne :> 3 @@
        NodeTwo :> 3 @@
        NodeThree :> 3 @@
        NodeFour :> 3 @@
        NodeFive :> 3 ) @@
  NodeTwo :>
      ( NodeOne :> 3 @@
        NodeTwo :> 1 @@
        NodeThree :> 1 @@
        NodeFour :> 1 @@
        NodeFive :> 1 ) @@
  NodeThree :>
      ( NodeOne :> 1 @@
        NodeTwo :> 1 @@
        NodeThree :> 1 @@
        NodeFour :> 1 @@
        NodeFive :> 1 ) @@
  NodeFour :>
      ( NodeOne :> 1 @@
        NodeTwo :> 1 @@
        NodeThree :> 1 @@
        NodeFour :> 1 @@
        NodeFive :> 1 ) @@
  NodeFive :>
      ( NodeOne :> 1 @@
        NodeTwo :> 1 @@
        NodeThree :> 1 @@
        NodeFour :> 1 @@
        NodeFive :> 1 ) )
/\ votesGranted = ( NodeOne :> {NodeOne, NodeThree} @@
  NodeTwo :> {NodeOne, NodeTwo} @@
  NodeThree :> {} @@
  NodeFour :> {} @@
  NodeFive :> {} )
/\ votesSent = ( NodeOne :> TRUE @@
  NodeTwo :> TRUE @@
  NodeThree :> FALSE @@
  NodeFour :> FALSE @@
  NodeFive :> FALSE )
/\ votedFor = ( NodeOne :> NodeOne @@
  NodeTwo :> NodeTwo @@
  NodeThree :> NodeOne @@
  NodeFour :> Nil @@
  NodeFive :> Nil )
/\ Configurations = ( NodeOne :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>, <<1, {NodeTwo}>>>> @@
  NodeTwo :> <<<<1, {NodeTwo}>>>> @@
  NodeThree :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>>> @@
  NodeFour :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>>> @@
  NodeFive :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>>> )
/\ commitsNotified = ( NodeOne :> <<0, 0>> @@
  NodeTwo :> <<0, 0>> @@
  NodeThree :> <<0, 0>> @@
  NodeFour :> <<0, 0>> @@
  NodeFive :> <<0, 0>> )

37797101298 states generated, 8813296222 distinct states found, 4094258023 states left on queue.
The depth of the complete state graph search is 29.
The average outdegree of the complete state graph is 2 (minimum is 0, the maximum 31 and the 95th percentile is 2).
Finished in 2d 05h at (2022-05-12 22:38:16)
@heidihoward heidihoward added the tla TLA+ specifications label May 13, 2022
@heidihoward heidihoward self-assigned this May 13, 2022
@achamayou
Copy link
Member

I am probably missing something here, but

/\ votesRequested = ( NodeOne :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 1 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@

suggests node 3 is running for election. But

/\ votesGranted = ( NodeOne :> {NodeOne, NodeThree} @@
  NodeTwo :> {NodeOne, NodeTwo} @@
  NodeThree :> {} @@
  NodeFour :> {} @@
  NodeFive :> {} )

suggests that other nodes are receiving votes!?

@heidihoward
Copy link
Member Author

votesRequested is first indexed by candidate sending the RequestVote message and then by the node which received the message (it ignores when candidates vote for themselves).

So the following state means that node 1 requested a vote from node 3

/\ votesRequested = ( NodeOne :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 1 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@

Likewise, votesGranted is indexed by the candidate who was granted the vote so the following means that Node 1 received a vote from itself and from node 3.

/\ votesGranted = ( NodeOne :> {NodeOne, NodeThree} @@

I can update the current documentation of votesRequested to make this clearer.

\* State space limitation: Restrict each node to send a limited amount
\* of requests to other nodes
VARIABLE votesRequested

I could also add an invariant that checks that if a vote was granted (except by the candidate itself) then it must have been requested.

@heidihoward
Copy link
Member Author

This issue seems to have been caused by a problem with the logic for determining what set of servers constitutes a quorum when a candidate is trying to become a leader whilst having an uncommitted but signed reconfiguration tx in its log.

In this particular case, node 1 becomes leader in term 3 after receiving votes from itself and node 3. However, the first entry in its log is a reconfiguration transaction which changes the set of servers to just node 2. This transaction has been committed by node 2 (and thus the configuration has changed) but node 1 does not know that yet. The full trace is here (ccf-reconfig-bug.txt) if anyone is interested.

I believe a candidate should get a quorum in each of the current configurations before becoming a leader. The TLA+ spec calculates the set of servers in all active configurations and requires a quorum across this set instead. The code is below:

/\ \* To become leader, a Quorum of _all_ known nodes must have voted for this server (across configurations)
LET relevantServers == GetServerSet(i)
IN (votesGranted[i] \cap relevantServers) \in CalculateQuorum(relevantServers)

This explains why node 1 was able to become a leader without a vote from node 2. This approach could cause other issues as well, for instance, if a system was switching from nodes 1, 2 and 3 to nodes 4, 5, 6 then a new leader could be elected with only the support of nodes 3, 4, 5 and 6, leaving a majority quorum (node 1 and 2) in the old configuration who might not know about the new configuration.

I have some thoughts about how to patch this in the spec but would like to discuss this to work out what CCF is doing in practice

@achamayou
Copy link
Member

@heidihoward the implementation does use a quorum across all active nodes:

as opposed to a quorum in each active configuration. I have not been able to find any notes about why that choice was made, but I remember it was discussed at least once.

I suspect MoreUpToDateCorrectInv does not hold with the currently implemented election scheme, which is based on committable watermarks rather than committed watermarks: https://github.com/microsoft/CCF/blob/main/src/consensus/aft/raft.h#L1642

The best entry point for history on why this change was made is probably #589.

@heidihoward
Copy link
Member Author

So I think there are two separate issues here which are worth considering separately:

  1. I am curious about the idea of quorums across all active configurations instead of a quorum in each active configuration. It is not obvious to me why this safe. As demonstrated by node 1 in the example, this violated the constraint that: if a node becomes a leader in some term t then its log contains all entries which were committed in terms <t. This constraint is useful as it is the basis for why it's safe for followers to overwrite log entries if they are told to do so by the leader (as the leader will already know all the committed entries).
  2. I agree that MoreUpToDateCorrectInv doesn't match the current election scheme. In my mind, txs after last signed txs are provisional and are ignored for the purpose of leader election. This invariant can fixed by using last signed index instead of last index.

@achamayou
Copy link
Member

achamayou commented May 26, 2022

Where C represents the commit level (last commit_idx received from primary) and c represents the committable level (last_idx containing a valid signature from the primary).

R represents a reconfiguration transaction from (0, 1, 2) to (2).

Node 2 was primary, until a partition occurs.

Node 0: |--------C
Node 1: |--------C-R-c
--------------------------------------
Node 2: |----------R-C----c

In the each election mode, Node 2 elects itself easily, continues. Node 0 and Node 1 continuously run elections, unsuccessfully, since they can never get a majority in (2).

In the across election mode, Node 2 also elects itself and continues. But Node 0 and Node 1 elect Node 1, which replicates its log. R commits, which signals the removal of Node 1 and Node 2. Retirement details.

We think the latter is preferable, because:

  1. It does not seem to create a possibility for a fork, either R is committable, will happen eventually and the minority partition will shut down when it's committed there.
  2. The minority partition does catch up. If Node 2 was to crash immediately post partition, after having advertised a commit level further than Node 0 and Node 1 have advertised, the minority partition would never reach it if it used each election. But with across it will.

Edit: while I think this example is useful in illustrating why it seems good for elections to work this way, it does not illustrate a breach of MoreUpToDateCorrectInv as I understand it, because both post-partition primaries have in their respective logs all the entries that their electors believe are committed. So I need to go read the trace again.

@achamayou
Copy link
Member

achamayou commented May 26, 2022

Ok, so it seems that:

/\ log = ( NodeOne :>
      << [term |-> 2, contentType |-> TypeReconfiguration, value |-> {NodeTwo}],
         [term |-> 2, contentType |-> TypeSignature, value |-> 0],
         [term |-> 3, contentType |-> TypeEntry, value |-> 2] >> @@

despite:

/\ Configurations = ( NodeOne :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>, <<1, {NodeTwo}>>>> @@

Node 1 is emitting an Entry after observing its reconfiguration becoming committable! That can't happen in the current implementation: https://github.com/microsoft/CCF/blob/main/src/consensus/aft/raft.h#L251

@heidihoward
Copy link
Member Author

I think at the heart of this discussion is the definition of committable. My understanding is that a committable transaction (one that is followed by a signed transaction) is not guaranteed to eventually be committed (and is thus not guaranteed to be on a majority of replicas). This differs from your claim if R is committable then its will eventually happen.

I think the problem with quorums across committable configurations is that a leader can be elected without the support of a majority quorum in the current configuration. Consider the set of replicas {n0,n1,n2}:

  1. n0 adds a reconfiguration tx for switching to the configuration {n0,n3,n4}, followed by a signature.
  2. n0 stops being the leader before replicating this reconfiguration tx to n1 or n2
  3. n0 becomes a leader. n0 now believes that it can become a leader with just the support of itself, n3 and n4 (as the full set of active nodes is {n0,n1,n2,n3,n4}). In other words, n0 can become a leader without a majority quorum in the old configuration {n0,n1.n2}.
  4. Meanwhile, n1 and n2 (which have no knowledge of the reconfiguration) elect node n1 to be the leader in same term.

@achamayou
Copy link
Member

I think at the heart of this discussion is the definition of committable. My understanding is that a committable transaction (one that is followed by a signed transaction) is not guaranteed to eventually be committed (and is thus not guaranteed to be on a majority of replicas).

Yes, that's correct.

This differs from your claim if R is committable then its will eventually happen.

That claim on its own is definitely incorrect. I think it holds in the context of the example I gave, but I agree not in general, and in particular not in the example you have given.

As discussed, this has unpleasant implications for the convergence back to a commit previously advertised in at least some scenarios, like the one I've given in example. Clearly though, ending up with two actual leaders (not just a caretaker and a real leader) is much worse!

@heidihoward
Copy link
Member Author

Fixed by #3965

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tla TLA+ specifications
Projects
None yet
Development

No branches or pull requests

2 participants