Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug the mini protocol restarts #2553

Closed
wants to merge 4 commits into from

Conversation

mrBliss
Copy link
Contributor

@mrBliss mrBliss commented Aug 27, 2020

No description provided.

As of #2525, the ChainSyncClient just returns when a shutdown is expected,
instead of throwing an exception. This means the ThreadNet tests need to restart
the terminated mini-protocol application after a certain amount of time (e.g.,
at the start of the next slot).

All exceptions thrown by mini-protocol apps are now unexpected and fatal.
@mrBliss mrBliss added the consensus issues related to ouroboros-consensus label Aug 27, 2020
@mrBliss mrBliss requested a review from nfrisby August 27, 2020 14:37
@mrBliss mrBliss force-pushed the mrBliss/mini-protocol-restart-debugging branch from 046f1b0 to 31408fe Compare August 27, 2020 14:44
@mrBliss
Copy link
Contributor Author

mrBliss commented Aug 27, 2020

cabal run ouroboros-consensus-mock -- -p "BUG" --quickcheck-replay=0 --quickcheck-tests=1

ouroboros-consensus
  BFT
    BUG: Starting ChainSync from CoreNodeId 1 to CoreNodeId 0
Starting ChainSync from CoreNodeId 0 to CoreNodeId 1
Starting BlockFetch from CoreNodeId 1 to CoreNodeId 0
Starting BlockFetch from CoreNodeId 0 to CoreNodeId 1
Starting TxSubmission from CoreNodeId 1 to CoreNodeId 0
Starting TxSubmission from CoreNodeId 0 to CoreNodeId 1
Starting KeepAlive from CoreNodeId 1 to CoreNodeId 0
Starting KeepAlive from CoreNodeId 0 to CoreNodeId 1
Registered peer: CoreId (CoreNodeId 0)
Starting ChainSync from CoreNodeId 0 to CoreNodeId 1
Registered peer: CoreId (CoreNodeId 1)
Starting ChainSync from CoreNodeId 1 to CoreNodeId 0
CS start: CoreId (CoreNodeId 0)
Starting BlockFetch from CoreNodeId 0 to CoreNodeId 1
CS start: CoreId (CoreNodeId 1)
Starting BlockFetch from CoreNodeId 1 to CoreNodeId 0
Starting TxSubmission from CoreNodeId 0 to CoreNodeId 1
Starting TxSubmission from CoreNodeId 1 to CoreNodeId 0
Starting KeepAlive from CoreNodeId 0 to CoreNodeId 1
Starting KeepAlive from CoreNodeId 1 to CoreNodeId 0
CS done: CoreId (CoreNodeId 1)
Terminated ChainSync from CoreNodeId 1 to CoreNodeId 0 in SlotNo 4
(SlotNo 4,MiniProtocolDelayed)
CS really done: CoreId (CoreNodeId 1)
Unregistered peer: CoreId (CoreNodeId 1)
Terminated ChainSync from CoreNodeId 0 to CoreNodeId 1 in SlotNo 4
(SlotNo 4,MiniProtocolDelayed)
(SlotNo 5,MiniProtocolRestarting)
Restarting ChainSync from CoreNodeId 1 to CoreNodeId 0 in SlotNo 5
Starting ChainSync from CoreNodeId 1 to CoreNodeId 0
(SlotNo 5,MiniProtocolRestarting)
Restarting ChainSync from CoreNodeId 0 to CoreNodeId 1 in SlotNo 5
Starting ChainSync from CoreNodeId 0 to CoreNodeId 1
Registered peer: CoreId (CoreNodeId 1)
CS start: CoreId (CoreNodeId 1)
FAIL (0.01s)
      *** Failed! (after 1 test):
      Exception:
        peerChains: fromList [CoreId (CoreNodeId 1)] peerStates: fromList []
        CallStack (from HasCallStack):
          error, called at src/Ouroboros/Network/BlockFetch/State.hs:201:9 in ouroboros-network-0.1.0.0-inplace:Ouroboros.Network.BlockFetch.State
      ..

@nfrisby
Copy link
Contributor

nfrisby commented Aug 27, 2020

@mrBliss I pushed up a commit that addresses it, I think. 76fa5b9

ecc2e73e2 - (origin/nfrisby/mini-protocol-restart-debugging) WIP minimal failing test (4 minutes ago) <Thomas Winant>
76fa5b9b6 - test-infra: simplify mini-protocol restart logic (4 minutes ago) <Nicolas Frisby>
51283af36 - ChainSyncClient: gracefully terminate if the chain is uninteresting (82 minutes ago) <Thomas Winant>
64351aaa3 - (origin/master, origin/bors/staging, origin/HEAD) Merge #2539 (4 hours ago) <iohk-bors[bot]>

It's been passing your test, and also the rest of mock and RealPBFT.

I tweaked a few comments while I was in there to hopefully make it clear why the RestartCause data type remains necessary.

HTH. Let me know if you'd like more from me here. Really glad to see this overall simplification!

P.S. - This diff loses a little information, but that's because aChainSyncClient returns () instead of the ChainSyncClientResult. Previously we were able to record which of the possible CS client exceptions caused the restart.

@mrBliss
Copy link
Contributor Author

mrBliss commented Aug 28, 2020

Great, thanks for helping me out! I cherry picked that commit into #2548.

@mrBliss mrBliss closed this Aug 28, 2020
@mrBliss mrBliss deleted the mrBliss/mini-protocol-restart-debugging branch August 28, 2020 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus issues related to ouroboros-consensus
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants