Cardano ThreadNet test temporary network partition can be too long when d=1 #2554

nfrisby · 2020-08-28T01:27:31Z

I ran https://buildkite.com/input-output-hk/ouroboros-network-nightly/builds/211#90daaa25-f7ff-407e-aaec-ecffb2e319e4 expecting failures. There were multiple, most expected. But this one was not expected:

==> /tmp/tmp.BPkjiFiO4G/logs/005-Cardano.log <==
cardano
  Cardano ThreadNet
    simple convergence: FAIL (227.41s)
      *** Failed! (after 192 tests):
      Exception:
        Assertion failed
        CallStack (from HasCallStack):
          assert, called at test/Test/ThreadNet/Cardano.hs:906:5 in main:Test.ThreadNet.Cardano
      TestSetup {setupByronLowerBound = True, setupD = DecentralizationParam (1 % 1), setupHardFork = True, setupInitialNonce = Nonce "0158b05591a3aaf3ab052ec3dfa473b3a51c3b682247df8b58de561e7e7d2cbe", setupK = SecurityParam 5, setupPartition = Partition (SlotNo 146) (NumSlots 8), setupSlotLengthByron = SlotLength 16.199s, setupSlotLengthShelley = SlotLength 0.562s, setupTestConfig = TestConfig {initSeed = Seed 2990599849885521969, nodeTopology = NodeTopology (fromList [(CoreNodeId 0,fromList []),(CoreNodeId 1,fromList [CoreNodeId 0])]), numCoreNodes = NumCoreNodes 2, numSlots = NumSlots 325}, setupVersion = (NodeToNodeV_3,HardForkNodeToNodeEnabled (EraNodeToNodeEnabled ByronNodeToNodeVersion2 :* EraNodeToNodeEnabled ShelleyNodeToNodeVersion1 :* Nil))}
      Use --quickcheck-replay=45201 to reproduce.

That assertion corresponded (PR 2550 made it a proper failure) to a wedge in the final chains.

Note that d=1. This means every slot is overlay, which means it's a completely round-robin schedule. Thus the the partition of 8 slots must somehow be responsible (indeed the two final chains' last shared block is from slot 144).

My main suspect is the genPartition function.

This Issue is to debug the above failure (and probably to revise the generator to avoid it).

... And I just noticed that the common prefix only has blocks from every other slot. That's also unexpected -- it suggests one of the nodes is (effectively) never (successfully) leading. There might be more going on here then I thought at first.

The text was updated successfully, but these errors were encountered:

nfrisby · 2020-08-30T21:59:39Z

When I rebased this repro onto master (= 01f1d62 - Merge #2543 (2 days ago) <iohk-bors[bot]>) and reverted 1aeab35 (cardano: tune ThreadNet parameters to avoid wedges) the failure changed to a fatal error OverlayFailure (VRFKeyUnknown (KeyHash "a12ba2194b0b3e80fdad43e4eed374b52a76cfab55fdc66937fa1f5a")).

This makes me suspect that this error was also being masked by the too-coarse catch-and-restart logic that PR #2548 recently refined.

Moreover, if I then also cherry-pick my local rough draft patch for Issue #2559, this repros runs successfully.

This suggests to me that my previous careful calculations recorded in detail in the comments of genPartition were indeed correct, and the failure (due to Issue #2559) was causing the mini protocols to crash and restart (nearly?) every slot. That could effectively add one slot worth of unanticipated delay to the partition, which would justify the original wedge error.

I'm now looking for evidence to support that theory.

Edit: Yep, if I revert back to the earlier repro and enable the mini-protocol-restarting tracer, I see the same VRFKeyUnknown (KeyHash "a12ba...") error as the reason for the first protocol restart (in slot 154, which is the end of the repro's scheduled partition). I don't have any ideas where VRFKeyUnkown would come from, so my leading hunch is Issue #2558/#2559. Moreover, this is Byron+Shelley with k=5 f=0.5, so the first slot in each epoch is 0, 50, and 150 -- the epoch transition happens during the network partition. That is indeed the necessary recipe for Issue #2559.

I'm closing this as Duplicate of #2559.

nfrisby added testing protocol testing labels Aug 28, 2020

nfrisby self-assigned this Aug 28, 2020

mrBliss added the consensus issues related to ouroboros-consensus label Aug 28, 2020

mrBliss added this to the S21 2020-09-10 milestone Aug 28, 2020

nfrisby mentioned this issue Aug 28, 2020

RealTPraos ThreadNet nightly tests failing with genTx: insufficient spending balance #2557

Closed

nfrisby closed this as completed Aug 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cardano ThreadNet test temporary network partition can be too long when d=1 #2554

Cardano ThreadNet test temporary network partition can be too long when d=1 #2554

nfrisby commented Aug 28, 2020 •

edited

Loading

nfrisby commented Aug 30, 2020 •

edited

Loading

Cardano ThreadNet test temporary network partition can be too long when d=1 #2554

Cardano ThreadNet test temporary network partition can be too long when d=1 #2554

Comments

nfrisby commented Aug 28, 2020 • edited Loading

nfrisby commented Aug 30, 2020 • edited Loading

nfrisby commented Aug 28, 2020 •

edited

Loading

nfrisby commented Aug 30, 2020 •

edited

Loading