Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky libp2p tests #105

Merged
merged 3 commits into from
Aug 18, 2023
Merged

Fix flaky libp2p tests #105

merged 3 commits into from
Aug 18, 2023

Conversation

guillaumemichel
Copy link
Contributor

Addresses #103

@iand
Copy link
Contributor

iand commented Aug 17, 2023

Hit the race that is fixed in #102

@@ -425,7 +430,7 @@ func TestReqTimeout(t *testing.T) {
go func() {
// timeout is queued in the scheduler 0
for !scheds[0].RunOne(ctx) {
time.Sleep(1 * time.Millisecond)
time.Sleep(10 * time.Millisecond)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is failing because the remote peer (sched[1]) likely isn't receiving the request within 100ms (!!!), so it isn't adding anything to its queue, and hence has nothing to run at the next step and fails the test.

I am not able to reproduce this on my machine. I assume that if this thread sleeps longer, it gives more time to other threads, and the message can be received within the 100 ms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic and means we don't have deterministic tests. If we're tweaking test timings then we're not in a better position than using raw goroutines.

We should try and remove sleeps from all tests. As a side issue, why aren't these sleeps using the scheduler's clock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iand unfortunately we cannot get deterministic testing when using go-libp2p. When using go-libp2p to send messages over the network stack, multiple go routines are at play, and we cannot use the scheduler to obtain sequentiality.

We should try and remove sleeps from all tests

I wish we were able to do it, but we need to wait for the message to be passed from one go routine to another over the network stack.

This test is testing the request timeout. A is sending a request to B. No action is run in B's scheduler until A's request times out. After A has timed out, B will send a late reply, and A should discard it. However, sometimes B's scheduler has no action to run after A's timeout of 100 ms (!!!).

There are 3 go routines at play:

  1. The single worker, running both A and B schedulers.
  2. A's go-libp2p go routine sending the request.
  3. B's go-libp2p go routine listening at incoming requests.

It means that either (2) is not sending the request within 100ms or that (3) isn't processing the request within 100ms (, or that the request got lost in the local network stack, but it seems unlikely). It is very hard to understand what is wrong, because I cannot reproduce the failure on my machine (that is certainly faster than GH CI), and it only happens once in a while on GH CI.

I will try to add verbose when the test fails, and run it many times in GH CI.

As a side issue, why aren't these sleeps using the scheduler's clock?

You are right! In this case however it won't solve the flakiness, because the scheduler's clock is a real clock.

@guillaumemichel
Copy link
Contributor Author

I am confident that TestConnections is fixed. The issue may have been introduced with go-libp2p 0.28, because loopback addresses may not be advertised anymore.

I am less confident with TestReqTimeout, because I am unable to reproduce the failure on my machine. I gave it a shot, and for now, it isn't failing on the CI (even though it only failed rarely). I will keep running the CI many times, and if the test isn't failing I would suggest that we merge it. If the same test is failing again in the future, we can open a new PR.

@guillaumemichel guillaumemichel marked this pull request as ready for review August 17, 2023 15:28
@iand
Copy link
Contributor

iand commented Aug 17, 2023

You could run the following to reproduce the flake:

go test -c . && while ./libp2p.test -test.shuffle=on -test.count=100 -test.failfast; do date; done

@guillaumemichel
Copy link
Contributor Author

@guillaumemichel
Copy link
Contributor Author

You could run the following to reproduce the flake:

go test -c . && while ./libp2p.test -test.shuffle=on -test.count=100 -test.failfast; do date; done

Unfortunately it doesn't help, the test keeps passing on my machine.

I have even run the following script for quite some time without getting the test to fail.

while true; do
    output=$(go1.20 test -run ^TestReqTimeout$)

    # Count the number of lines in the output
    line_count=$(echo "$output" | wc -l)

    # If the line count is greater than 2, the test failed. break out of the loop
    if (( line_count > 2 )); then
        echo "$output"
        break
    fi
done

@guillaumemichel guillaumemichel marked this pull request as draft August 18, 2023 06:09
@guillaumemichel
Copy link
Contributor Author

TestReqTimeout alone took more than 1s to run on Windows Run Tests (32-bit). The issue probably comes from the timing, will try to reproduce the failure.

ok  	github.com/plprobelab/go-kademlia/libp2p	1.136s

@guillaumemichel guillaumemichel marked this pull request as ready for review August 18, 2023 09:43
@guillaumemichel
Copy link
Contributor Author

The tests have run successfully 15 times in addition to the displayed tests.

IMO this is good to merge

Copy link
Contributor

@iand iand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@guillaumemichel guillaumemichel merged commit 070b692 into main Aug 18, 2023
@guillaumemichel guillaumemichel deleted the fix-libp2p-flakiness branch August 18, 2023 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants