Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious CI failure in lts_compatibility test #3043

Closed
letmaik opened this issue Oct 1, 2021 · 1 comment
Closed

Spurious CI failure in lts_compatibility test #3043

letmaik opened this issue Oct 1, 2021 · 1 comment
Labels

Comments

@letmaik
Copy link
Member

letmaik commented Oct 1, 2021

https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=33592&view=logs&jobId=8f3dc89c-3708-5926-47e7-27120a268dab&j=8f3dc89c-3708-5926-47e7-27120a268dab&t=d37f6062-24ee-5eff-15b2-0eb74651c675

"TimeoutError: A new primary was not elected after 16.0 seconds"

Some possibly relevant node log output:

Node 0:

53: 09:44:19.863 | ERROR | infra.remote:log_errors:63 - /__w/1/s/build/workspace/lts_compatibility_0/out: 2021-10-01T09:43:53.426591Z -0.009 0 [fail ] ../src/node/node_to_node.h:332 | Exception in virtual void ccf::NodeToNodeImpl::recv_message(const ccf::NodeId &, OArray &&)

Node 1:

2021-10-01T09:43:55.823981Z 100 [fail ] ../src/host/node_connections.h:379 | Unknown node connection d55030e68cb1d26cbaa27d2d1b02faf60c1db454d98fe6f5aea121784cd14efb

Node 4:

2021-10-01T09:43:53.422348Z -0.002 0 [fail ] ../src/node/channels.h:548 | <- n[d55030e68cb1d26cbaa27d2d1b02faf60c1db454d98fe6f5aea121784cd14efb] (WAITING_FOR_FINAL): Peer certificate verification failed

Full logs:
lts_compatibility.zip

@letmaik letmaik added the bug label Oct 1, 2021
@eddyashton
Copy link
Member

I believe this is resolved by #3035.

The diagnostic I claimed in that PR is

35:17.470 |04|         Node has now joined the network as node n[4=e75f]: all domains (node_state.h:578)
35:17.508 |00| Initiating node channel with 4=e75f. (channels.h:713)
35:17.614 |00| Initiating node channel with 4=e75f. (channels.h:713)

But I think the first line is just an indicator, and the real issue is if we have multiple initiation attempts after "network connections now accepted", which is present in these logs:

43:53.241 |04|         Network TLS connections now accepted (node_state.h:1603)
43:53.264 |00| Initiating node channel with 4=a5ba. (channels.h:713)
...
43:53.384 |00| Initiating node channel with 4=a5ba. (channels.h:713)
43:53.395 |00| Node channel with 4=a5ba is now established. (channels.h:690)
43:53.422 |04|         <- n[d55030e68cb1d26cbaa27d2d1b02faf60c1db454d98fe6f5aea121784cd14efb] (WAITING_FOR_FINAL): Peer certificate verification failed (channels.h:548)
43:53.422 |04|         Node channel with n[d55030e68cb1d26cbaa27d2d1b02faf60c1db454d98fe6f5aea121784cd14efb] cannot receive authenticated message: not established, status=WAITING_FOR_FINAL (channels.h:833)
43:53.422 |04|         Dropped invalid message from n[d55030e68cb1d26cbaa27d2d1b02faf60c1db454d98fe6f5aea121784cd14efb] (raft.h:986)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants