Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: TestPeersTotal #8135

Closed
aschmahmann opened this issue May 13, 2021 · 13 comments · Fixed by #8577
Closed

Flaky test: TestPeersTotal #8135

aschmahmann opened this issue May 13, 2021 · 13 comments · Fixed by #8577
Assignees
Labels
kind/bug A bug in existing code (including security flaws) kind/test Testing work need/triage Needs initial labeling and prioritization status/blocked Unable to be worked further until needs are met topic/test failure Topic test failure

Comments

@aschmahmann
Copy link
Contributor

https://app.circleci.com/pipelines/github/ipfs/go-ipfs/4610/workflows/d1451a93-9637-482c-b9e1-f1914dfaa687/jobs/51152/tests#failed-test-0

=== RUN   TestPeersTotal
2021/05/12 20:33:38 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/lucas-clemente/quic-go/wiki/UDP-Receive-Buffer-Size for details.
    TestPeersTotal: metrics_test.go:45: expected 1 peers transport, got 2
--- FAIL: TestPeersTotal (0.16s)

After re-running it is green, so filing this issue so we don't forget to fix this).

@aschmahmann aschmahmann added kind/bug A bug in existing code (including security flaws) topic/test failure Topic test failure kind/test Testing work need/triage Needs initial labeling and prioritization labels May 13, 2021
@masih
Copy link
Member

masih commented Jul 22, 2021

Looking at CircleCi config, it looks like gotest job runs on Docker. To set sysctl we need to run that on a machine executor; See docs.

Running that job on machine will have longer start-up time, and will potentially involve installing go manually.
@aschmahmann Do we want to make this change?

@aschmahmann
Copy link
Contributor Author

@masih the UDP buffer size should be an independent problem from the flaky test, it's just a data transfer efficiency thing.

At the moment it doesn't seem worthwhile to change the setup to run as a machine, especially given that we'd like to move towards using GitHub Actions #8348

@schomatis schomatis self-assigned this Sep 14, 2021
@schomatis
Copy link
Contributor

The test on which this one is based

https://github.com/ipfs/go-ipfs/blob/ef0428af407410d7f7e98ec0b3e57afd8c828031/core/corehttp/metrics_test.go#L15

has been fixed for the same error in libp2p/go-libp2p-swarm@0538806.

@aschmahmann I'm not sure how to fix this as I don't know the original design rationale behind the test nor the function we are testing. The original fix changes its test to track peers instead of open connections (because of what's explained in the commit message) but in our function being tested we track connections and I imagine changing that to tracking peers instead would have unintended side effects in the callers consuming this information.

@BigLep
Copy link
Contributor

BigLep commented Oct 1, 2021

Next step is for @petar to look to identify next step. We may not need this if this test is a copy from libp2p.

@petar
Copy link
Contributor

petar commented Oct 4, 2021

I've added more logging to the test for now, so that next time it breaks we can understand why the connections are established over different transports.

@BigLep BigLep added the status/blocked Unable to be worked further until needs are met label Oct 8, 2021
@BigLep
Copy link
Contributor

BigLep commented Oct 8, 2021

This is blocked until we find another occurrence of this so we can look at the more verbose logs.

@petar : I assume you haven't seen another instance?

@petar
Copy link
Contributor

petar commented Oct 8, 2021 via email

@schomatis
Copy link
Contributor

@petar Following up on this:

On the extra logging, we will normally dial both TCP and QUIC/UDP simultaneously and add both connections to the list, see the swarm dial worker logic for more details.

As mentioned above this is already acknowledged upstream and that is why libp2p/go-libp2p-swarm@0538806 no longer checks the list of connections but explicitly the list of peers in its own test (because one peer might have more than one active connection).

I think we should port this fix here as well, consulting the peer list instead of the connection list where appropriate. My main doubt is understanding the actual purpose of PeersTotalValues to decide how to best do that.

@petar
Copy link
Contributor

petar commented Nov 11, 2021

@schomatis I agree. The ipfs test should use peer count, and not be dependent on the underlying implementation of swarm (which can vary the connection count per peer).

@schomatis
Copy link
Contributor

Ok, will push something for this next week.

@BigLep
Copy link
Contributor

BigLep commented Nov 12, 2021

Per 2021-11-12 verbal and slack: the test should be updated to assert on the number of peers (not the number of connections).

@galargh: if you have other comments/context from Slack, feel free to add.

@galargh
Copy link
Contributor

galargh commented Nov 12, 2021

Nothing to add, #8135 (comment) says it all.

@schomatis
Copy link
Contributor

Picking this up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) kind/test Testing work need/triage Needs initial labeling and prioritization status/blocked Unable to be worked further until needs are met topic/test failure Topic test failure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants