Invariant `MoreUpToDateCorrectInv` violated in TLA+ specification #3837

heidihoward · 2022-05-13T14:08:46Z

Describe the bug
TLC found a violation of the MoreUpToDateCorrectInv invariant when checking the current TLA+ specification in https://github.com/microsoft/CCF/blob/main/tla/raft_spec/ccfraft.tla.

Note that this is different TLA+ specification (and thus a different issue) from #3828.

To Reproduce
Use TLC to check the current spec in https://github.com/microsoft/CCF/blob/main/tla/raft_spec/MCraft.tla using a term limit of 3. As you can see from the output below, TLC took >2 days to find this bug (after checking 37 billon states) on a well-resourced machine so I'll need to be patient.

Expected behavior
TLC should have completed checking without finding a safety violation.

Environment information
Host: Azure standard_HB120rs_v3
OS: Ubuntu 20.04 LTS
TLC: TLC2 Version 2.17 of 02 February 2022

Additional context
The final output & counter example provided by TLC is copied below:

State 27: <ClientRequest line 351, col 5 to line 362, col 80 of module ccfraft>
/\ ReconfigurationCount = 1
/\ committedLogDecrease = FALSE
/\ messages = { [ mtype |-> RequestVoteRequest,
    mterm |-> 2,
    mlastLogTerm |-> 0,
    mlastLogIndex |-> 0,
    msource |-> NodeTwo,
    mdest |-> NodeThree ] }
/\ clientRequests = 3
/\ matchIndex = ( NodeOne :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeTwo :>
      ( NodeOne :> 2 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeThree :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeFour :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeFive :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) )
/\ messagesSent = ( NodeOne :>
      ( NodeOne :> <<>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) @@
  NodeTwo :>
      ( NodeOne :> <<1, 1>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) @@
  NodeThree :>
      ( NodeOne :> <<>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) @@
  NodeFour :>
      ( NodeOne :> <<>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) @@
  NodeFive :>
      ( NodeOne :> <<>> @@
        NodeTwo :> <<>> @@
        NodeThree :> <<>> @@
        NodeFour :> <<>> @@
        NodeFive :> <<>> ) )
/\ log = ( NodeOne :>
      << [term |-> 2, contentType |-> TypeReconfiguration, value |-> {NodeTwo}],
         [term |-> 2, contentType |-> TypeSignature, value |-> 0],
         [term |-> 3, contentType |-> TypeEntry, value |-> 2] >> @@
  NodeTwo :>
      << [term |-> 2, contentType |-> TypeReconfiguration, value |-> {NodeTwo}],
         [term |-> 2, contentType |-> TypeSignature, value |-> 0],
         [term |-> 2, contentType |-> TypeEntry, value |-> 1],
         [term |-> 2, contentType |-> TypeSignature, value |-> 1] >> @@
  NodeThree :> <<>> @@
  NodeFour :> <<>> @@
  NodeFive :> <<>> )
/\ state = ( NodeOne :> Leader @@
  NodeTwo :> Leader @@
  NodeThree :> Follower @@
  NodeFour :> Pending @@
  NodeFive :> Pending )
/\ votesRequested = ( NodeOne :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 1 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeTwo :>
      ( NodeOne :> 1 @@
        NodeTwo :> 0 @@
        NodeThree :> 1 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeThree :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeFour :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@
  NodeFive :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 0 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) )
/\ commitIndex = ( NodeOne :> 0 @@
  NodeTwo :> 4 @@
  NodeThree :> 0 @@
  NodeFour :> 0 @@
  NodeFive :> 0 )
/\ currentTerm = ( NodeOne :> 3 @@
  NodeTwo :> 2 @@
  NodeThree :> 3 @@
  NodeFour :> 1 @@
  NodeFive :> 1 )
/\ committedLog = << [term |-> 2, contentType |-> TypeReconfiguration, value |-> {NodeTwo}],
   [term |-> 2, contentType |-> TypeSignature, value |-> 0],
   [term |-> 2, contentType |-> TypeEntry, value |-> 1],
   [term |-> 2, contentType |-> TypeSignature, value |-> 1] >>
/\ nextIndex = ( NodeOne :>
      ( NodeOne :> 3 @@
        NodeTwo :> 3 @@
        NodeThree :> 3 @@
        NodeFour :> 3 @@
        NodeFive :> 3 ) @@
  NodeTwo :>
      ( NodeOne :> 3 @@
        NodeTwo :> 1 @@
        NodeThree :> 1 @@
        NodeFour :> 1 @@
        NodeFive :> 1 ) @@
  NodeThree :>
      ( NodeOne :> 1 @@
        NodeTwo :> 1 @@
        NodeThree :> 1 @@
        NodeFour :> 1 @@
        NodeFive :> 1 ) @@
  NodeFour :>
      ( NodeOne :> 1 @@
        NodeTwo :> 1 @@
        NodeThree :> 1 @@
        NodeFour :> 1 @@
        NodeFive :> 1 ) @@
  NodeFive :>
      ( NodeOne :> 1 @@
        NodeTwo :> 1 @@
        NodeThree :> 1 @@
        NodeFour :> 1 @@
        NodeFive :> 1 ) )
/\ votesGranted = ( NodeOne :> {NodeOne, NodeThree} @@
  NodeTwo :> {NodeOne, NodeTwo} @@
  NodeThree :> {} @@
  NodeFour :> {} @@
  NodeFive :> {} )
/\ votesSent = ( NodeOne :> TRUE @@
  NodeTwo :> TRUE @@
  NodeThree :> FALSE @@
  NodeFour :> FALSE @@
  NodeFive :> FALSE )
/\ votedFor = ( NodeOne :> NodeOne @@
  NodeTwo :> NodeTwo @@
  NodeThree :> NodeOne @@
  NodeFour :> Nil @@
  NodeFive :> Nil )
/\ Configurations = ( NodeOne :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>, <<1, {NodeTwo}>>>> @@
  NodeTwo :> <<<<1, {NodeTwo}>>>> @@
  NodeThree :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>>> @@
  NodeFour :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>>> @@
  NodeFive :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>>> )
/\ commitsNotified = ( NodeOne :> <<0, 0>> @@
  NodeTwo :> <<0, 0>> @@
  NodeThree :> <<0, 0>> @@
  NodeFour :> <<0, 0>> @@
  NodeFive :> <<0, 0>> )

37797101298 states generated, 8813296222 distinct states found, 4094258023 states left on queue.
The depth of the complete state graph search is 29.
The average outdegree of the complete state graph is 2 (minimum is 0, the maximum 31 and the 95th percentile is 2).
Finished in 2d 05h at (2022-05-12 22:38:16)

The text was updated successfully, but these errors were encountered:

achamayou · 2022-05-13T15:44:23Z

I am probably missing something here, but

/\ votesRequested = ( NodeOne :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 1 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@

suggests node 3 is running for election. But

/\ votesGranted = ( NodeOne :> {NodeOne, NodeThree} @@
  NodeTwo :> {NodeOne, NodeTwo} @@
  NodeThree :> {} @@
  NodeFour :> {} @@
  NodeFive :> {} )

suggests that other nodes are receiving votes!?

heidihoward · 2022-05-16T08:18:09Z

votesRequested is first indexed by candidate sending the RequestVote message and then by the node which received the message (it ignores when candidates vote for themselves).

So the following state means that node 1 requested a vote from node 3

/\ votesRequested = ( NodeOne :>
      ( NodeOne :> 0 @@
        NodeTwo :> 0 @@
        NodeThree :> 1 @@
        NodeFour :> 0 @@
        NodeFive :> 0 ) @@

Likewise, votesGranted is indexed by the candidate who was granted the vote so the following means that Node 1 received a vote from itself and from node 3.

/\ votesGranted = ( NodeOne :> {NodeOne, NodeThree} @@

I can update the current documentation of votesRequested to make this clearer.

CCF/tla/raft_spec/ccfraft.tla

Lines 119 to 121 in 2a048e2

    
           \* State space limitation: Restrict each node to send a limited amount  
        
           \* of requests to other nodes 
        
           VARIABLE votesRequested

I could also add an invariant that checks that if a vote was granted (except by the candidate itself) then it must have been requested.

heidihoward · 2022-05-25T17:04:12Z

This issue seems to have been caused by a problem with the logic for determining what set of servers constitutes a quorum when a candidate is trying to become a leader whilst having an uncommitted but signed reconfiguration tx in its log.

In this particular case, node 1 becomes leader in term 3 after receiving votes from itself and node 3. However, the first entry in its log is a reconfiguration transaction which changes the set of servers to just node 2. This transaction has been committed by node 2 (and thus the configuration has changed) but node 1 does not know that yet. The full trace is here (ccf-reconfig-bug.txt) if anyone is interested.

I believe a candidate should get a quorum in each of the current configurations before becoming a leader. The TLA+ spec calculates the set of servers in all active configurations and requires a quorum across this set instead. The code is below:

CCF/tla/raft_spec/ccfraft.tla

Lines 328 to 330 in e4362ab

    
           /\  \* To become leader, a Quorum of _all_ known nodes must have voted for this server (across configurations) 
        
               LET relevantServers == GetServerSet(i) 
        
               IN (votesGranted[i] \cap relevantServers) \in CalculateQuorum(relevantServers)

This explains why node 1 was able to become a leader without a vote from node 2. This approach could cause other issues as well, for instance, if a system was switching from nodes 1, 2 and 3 to nodes 4, 5, 6 then a new leader could be elected with only the support of nodes 3, 4, 5 and 6, leaving a majority quorum (node 1 and 2) in the old configuration who might not know about the new configuration.

I have some thoughts about how to patch this in the spec but would like to discuss this to work out what CCF is doing in practice

achamayou · 2022-05-25T20:10:59Z

@heidihoward the implementation does use a quorum across all active nodes:

as opposed to a quorum in each active configuration. I have not been able to find any notes about why that choice was made, but I remember it was discussed at least once.

I suspect MoreUpToDateCorrectInv does not hold with the currently implemented election scheme, which is based on committable watermarks rather than committed watermarks: https://github.com/microsoft/CCF/blob/main/src/consensus/aft/raft.h#L1642

The best entry point for history on why this change was made is probably #589.

heidihoward · 2022-05-26T13:29:28Z

So I think there are two separate issues here which are worth considering separately:

I am curious about the idea of quorums across all active configurations instead of a quorum in each active configuration. It is not obvious to me why this safe. As demonstrated by node 1 in the example, this violated the constraint that: if a node becomes a leader in some term t then its log contains all entries which were committed in terms <t. This constraint is useful as it is the basis for why it's safe for followers to overwrite log entries if they are told to do so by the leader (as the leader will already know all the committed entries).
I agree that MoreUpToDateCorrectInv doesn't match the current election scheme. In my mind, txs after last signed txs are provisional and are ignored for the purpose of leader election. This invariant can fixed by using last signed index instead of last index.

achamayou · 2022-05-26T14:17:59Z

Where C represents the commit level (last commit_idx received from primary) and c represents the committable level (last_idx containing a valid signature from the primary).

R represents a reconfiguration transaction from (0, 1, 2) to (2).

Node 2 was primary, until a partition occurs.

Node 0: |--------C
Node 1: |--------C-R-c
--------------------------------------
Node 2: |----------R-C----c

In the each election mode, Node 2 elects itself easily, continues. Node 0 and Node 1 continuously run elections, unsuccessfully, since they can never get a majority in (2).

In the across election mode, Node 2 also elects itself and continues. But Node 0 and Node 1 elect Node 1, which replicates its log. R commits, which signals the removal of Node 1 and Node 2. Retirement details.

We think the latter is preferable, because:

It does not seem to create a possibility for a fork, either R is committable, will happen eventually and the minority partition will shut down when it's committed there.
The minority partition does catch up. If Node 2 was to crash immediately post partition, after having advertised a commit level further than Node 0 and Node 1 have advertised, the minority partition would never reach it if it used each election. But with across it will.

Edit: while I think this example is useful in illustrating why it seems good for elections to work this way, it does not illustrate a breach of MoreUpToDateCorrectInv as I understand it, because both post-partition primaries have in their respective logs all the entries that their electors believe are committed. So I need to go read the trace again.

achamayou · 2022-05-26T15:20:55Z

Ok, so it seems that:

/\ log = ( NodeOne :>
      << [term |-> 2, contentType |-> TypeReconfiguration, value |-> {NodeTwo}],
         [term |-> 2, contentType |-> TypeSignature, value |-> 0],
         [term |-> 3, contentType |-> TypeEntry, value |-> 2] >> @@

despite:

/\ Configurations = ( NodeOne :> <<<<0, {NodeOne, NodeTwo, NodeThree}>>, <<1, {NodeTwo}>>>> @@

Node 1 is emitting an Entry after observing its reconfiguration becoming committable! That can't happen in the current implementation: https://github.com/microsoft/CCF/blob/main/src/consensus/aft/raft.h#L251

heidihoward · 2022-05-27T11:01:16Z

I think at the heart of this discussion is the definition of committable. My understanding is that a committable transaction (one that is followed by a signed transaction) is not guaranteed to eventually be committed (and is thus not guaranteed to be on a majority of replicas). This differs from your claim if R is committable then its will eventually happen.

I think the problem with quorums across committable configurations is that a leader can be elected without the support of a majority quorum in the current configuration. Consider the set of replicas {n0,n1,n2}:

n0 adds a reconfiguration tx for switching to the configuration {n0,n3,n4}, followed by a signature.
n0 stops being the leader before replicating this reconfiguration tx to n1 or n2
n0 becomes a leader. n0 now believes that it can become a leader with just the support of itself, n3 and n4 (as the full set of active nodes is {n0,n1,n2,n3,n4}). In other words, n0 can become a leader without a majority quorum in the old configuration {n0,n1.n2}.
Meanwhile, n1 and n2 (which have no knowledge of the reconfiguration) elect node n1 to be the leader in same term.

achamayou · 2022-05-27T12:46:44Z

I think at the heart of this discussion is the definition of committable. My understanding is that a committable transaction (one that is followed by a signed transaction) is not guaranteed to eventually be committed (and is thus not guaranteed to be on a majority of replicas).

Yes, that's correct.

This differs from your claim if R is committable then its will eventually happen.

That claim on its own is definitely incorrect. I think it holds in the context of the example I gave, but I agree not in general, and in particular not in the example you have given.

As discussed, this has unpleasant implications for the convergence back to a commit previously advertised in at least some scenarios, like the one I've given in example. Clearly though, ending up with two actual leaders (not just a caretaker and a real leader) is much worse!

heidihoward · 2022-10-12T08:23:53Z

Fixed by #3965

heidihoward added the tla TLA+ specifications label May 13, 2022

heidihoward self-assigned this May 13, 2022

heidihoward mentioned this issue May 13, 2022

Updating URL of tla2tools in TLA+ install script #3838

Merged

heidihoward mentioned this issue May 24, 2022

Fixes to reduced state space TLA+ specification #3881

Merged

achamayou mentioned this issue Jun 17, 2022

Leader election must wait for quorum in each active config, rather than across all active nodes #3948

Closed

heidihoward mentioned this issue Jun 21, 2022

Fixing TLA+ specifications #3965

Merged

15 tasks

heidihoward closed this as completed Oct 12, 2022

achamayou mentioned this issue Nov 4, 2022

Invariant violation for MoreUpToDateCorrectInv in TLA+ spec #4472

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invariant `MoreUpToDateCorrectInv` violated in TLA+ specification #3837

Invariant `MoreUpToDateCorrectInv` violated in TLA+ specification #3837

heidihoward commented May 13, 2022

achamayou commented May 13, 2022

heidihoward commented May 16, 2022

heidihoward commented May 25, 2022

achamayou commented May 25, 2022

heidihoward commented May 26, 2022

achamayou commented May 26, 2022 •

edited

Loading

achamayou commented May 26, 2022 •

edited

Loading

heidihoward commented May 27, 2022

achamayou commented May 27, 2022

heidihoward commented Oct 12, 2022

Invariant MoreUpToDateCorrectInv violated in TLA+ specification #3837

Invariant MoreUpToDateCorrectInv violated in TLA+ specification #3837

Comments

heidihoward commented May 13, 2022

achamayou commented May 13, 2022

heidihoward commented May 16, 2022

heidihoward commented May 25, 2022

achamayou commented May 25, 2022

heidihoward commented May 26, 2022

achamayou commented May 26, 2022 • edited Loading

achamayou commented May 26, 2022 • edited Loading

heidihoward commented May 27, 2022

achamayou commented May 27, 2022

heidihoward commented Oct 12, 2022

Invariant `MoreUpToDateCorrectInv` violated in TLA+ specification #3837

Invariant `MoreUpToDateCorrectInv` violated in TLA+ specification #3837

achamayou commented May 26, 2022 •

edited

Loading

achamayou commented May 26, 2022 •

edited

Loading