Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cardano ThreadNet test temporary network partition can be too long when d=1 #2554

Closed
nfrisby opened this issue Aug 28, 2020 · 1 comment
Closed
Assignees
Labels
consensus issues related to ouroboros-consensus testing

Comments

@nfrisby
Copy link
Contributor

nfrisby commented Aug 28, 2020

I ran https://buildkite.com/input-output-hk/ouroboros-network-nightly/builds/211#90daaa25-f7ff-407e-aaec-ecffb2e319e4 expecting failures. There were multiple, most expected. But this one was not expected:

==> /tmp/tmp.BPkjiFiO4G/logs/005-Cardano.log <==
cardano
  Cardano ThreadNet
    simple convergence: FAIL (227.41s)
      *** Failed! (after 192 tests):
      Exception:
        Assertion failed
        CallStack (from HasCallStack):
          assert, called at test/Test/ThreadNet/Cardano.hs:906:5 in main:Test.ThreadNet.Cardano
      TestSetup {setupByronLowerBound = True, setupD = DecentralizationParam (1 % 1), setupHardFork = True, setupInitialNonce = Nonce "0158b05591a3aaf3ab052ec3dfa473b3a51c3b682247df8b58de561e7e7d2cbe", setupK = SecurityParam 5, setupPartition = Partition (SlotNo 146) (NumSlots 8), setupSlotLengthByron = SlotLength 16.199s, setupSlotLengthShelley = SlotLength 0.562s, setupTestConfig = TestConfig {initSeed = Seed 2990599849885521969, nodeTopology = NodeTopology (fromList [(CoreNodeId 0,fromList []),(CoreNodeId 1,fromList [CoreNodeId 0])]), numCoreNodes = NumCoreNodes 2, numSlots = NumSlots 325}, setupVersion = (NodeToNodeV_3,HardForkNodeToNodeEnabled (EraNodeToNodeEnabled ByronNodeToNodeVersion2 :* EraNodeToNodeEnabled ShelleyNodeToNodeVersion1 :* Nil))}
      Use --quickcheck-replay=45201 to reproduce.

That assertion corresponded (PR 2550 made it a proper failure) to a wedge in the final chains.

Note that d=1. This means every slot is overlay, which means it's a completely round-robin schedule. Thus the the partition of 8 slots must somehow be responsible (indeed the two final chains' last shared block is from slot 144).

My main suspect is the genPartition function.

This Issue is to debug the above failure (and probably to revise the generator to avoid it).

... And I just noticed that the common prefix only has blocks from every other slot. That's also unexpected -- it suggests one of the nodes is (effectively) never (successfully) leading. There might be more going on here then I thought at first.

@nfrisby
Copy link
Contributor Author

nfrisby commented Aug 30, 2020

When I rebased this repro onto master (= 01f1d62 - Merge #2543 (2 days ago) <iohk-bors[bot]>) and reverted 1aeab35 (cardano: tune ThreadNet parameters to avoid wedges) the failure changed to a fatal error OverlayFailure (VRFKeyUnknown (KeyHash "a12ba2194b0b3e80fdad43e4eed374b52a76cfab55fdc66937fa1f5a")).

This makes me suspect that this error was also being masked by the too-coarse catch-and-restart logic that PR #2548 recently refined.

Moreover, if I then also cherry-pick my local rough draft patch for Issue #2559, this repros runs successfully.

This suggests to me that my previous careful calculations recorded in detail in the comments of genPartition were indeed correct, and the failure (due to Issue #2559) was causing the mini protocols to crash and restart (nearly?) every slot. That could effectively add one slot worth of unanticipated delay to the partition, which would justify the original wedge error.

I'm now looking for evidence to support that theory.

Edit: Yep, if I revert back to the earlier repro and enable the mini-protocol-restarting tracer, I see the same VRFKeyUnknown (KeyHash "a12ba...") error as the reason for the first protocol restart (in slot 154, which is the end of the repro's scheduled partition). I don't have any ideas where VRFKeyUnkown would come from, so my leading hunch is Issue #2558/#2559. Moreover, this is Byron+Shelley with k=5 f=0.5, so the first slot in each epoch is 0, 50, and 150 -- the epoch transition happens during the network partition. That is indeed the necessary recipe for Issue #2559.

I'm closing this as Duplicate of #2559.

@nfrisby nfrisby closed this as completed Aug 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus issues related to ouroboros-consensus testing
Projects
None yet
Development

No branches or pull requests

2 participants