Fix flaky libp2p tests #105

guillaumemichel · 2023-08-17T14:57:05Z

Addresses #103

iand · 2023-08-17T15:06:37Z

Hit the race that is fixed in #102

guillaumemichel · 2023-08-17T15:23:10Z

libp2p/libp2pendpoint_test.go

@@ -425,7 +430,7 @@ func TestReqTimeout(t *testing.T) {
 	go func() {
 		// timeout is queued in the scheduler 0
 		for !scheds[0].RunOne(ctx) {
-			time.Sleep(1 * time.Millisecond)
+			time.Sleep(10 * time.Millisecond)


Test is failing because the remote peer (sched[1]) likely isn't receiving the request within 100ms (!!!), so it isn't adding anything to its queue, and hence has nothing to run at the next step and fails the test.

I am not able to reproduce this on my machine. I assume that if this thread sleeps longer, it gives more time to other threads, and the message can be received within the 100 ms.

This is problematic and means we don't have deterministic tests. If we're tweaking test timings then we're not in a better position than using raw goroutines.

We should try and remove sleeps from all tests. As a side issue, why aren't these sleeps using the scheduler's clock?

@iand unfortunately we cannot get deterministic testing when using go-libp2p. When using go-libp2p to send messages over the network stack, multiple go routines are at play, and we cannot use the scheduler to obtain sequentiality.

We should try and remove sleeps from all tests

I wish we were able to do it, but we need to wait for the message to be passed from one go routine to another over the network stack.

This test is testing the request timeout. A is sending a request to B. No action is run in B's scheduler until A's request times out. After A has timed out, B will send a late reply, and A should discard it. However, sometimes B's scheduler has no action to run after A's timeout of 100 ms (!!!).

There are 3 go routines at play:

The single worker, running both A and B schedulers.

A's go-libp2p go routine sending the request.

B's go-libp2p go routine listening at incoming requests.

It means that either (2) is not sending the request within 100ms or that (3) isn't processing the request within 100ms (, or that the request got lost in the local network stack, but it seems unlikely). It is very hard to understand what is wrong, because I cannot reproduce the failure on my machine (that is certainly faster than GH CI), and it only happens once in a while on GH CI.

I will try to add verbose when the test fails, and run it many times in GH CI.

As a side issue, why aren't these sleeps using the scheduler's clock?

You are right! In this case however it won't solve the flakiness, because the scheduler's clock is a real clock.

guillaumemichel · 2023-08-17T15:28:07Z

I am confident that TestConnections is fixed. The issue may have been introduced with go-libp2p 0.28, because loopback addresses may not be advertised anymore.

I am less confident with TestReqTimeout, because I am unable to reproduce the failure on my machine. I gave it a shot, and for now, it isn't failing on the CI (even though it only failed rarely). I will keep running the CI many times, and if the test isn't failing I would suggest that we merge it. If the same test is failing again in the future, we can open a new PR.

iand · 2023-08-17T15:55:55Z

You could run the following to reproduce the flake:

go test -c . && while ./libp2p.test -test.shuffle=on -test.count=100 -test.failfast; do date; done

guillaumemichel · 2023-08-18T04:56:56Z

TestReqTimeout isn't fixed. See https://github.com/plprobelab/go-kademlia/actions/runs/5892795414/job/16000386471#step:14:1

guillaumemichel · 2023-08-18T06:07:48Z

You could run the following to reproduce the flake:

go test -c . && while ./libp2p.test -test.shuffle=on -test.count=100 -test.failfast; do date; done

Unfortunately it doesn't help, the test keeps passing on my machine.

I have even run the following script for quite some time without getting the test to fail.

while true; do
    output=$(go1.20 test -run ^TestReqTimeout$)

    # Count the number of lines in the output
    line_count=$(echo "$output" | wc -l)

    # If the line count is greater than 2, the test failed. break out of the loop
    if (( line_count > 2 )); then
        echo "$output"
        break
    fi
done

guillaumemichel · 2023-08-18T06:45:20Z

TestReqTimeout alone took more than 1s to run on Windows Run Tests (32-bit). The issue probably comes from the timing, will try to reproduce the failure.

ok  	github.com/plprobelab/go-kademlia/libp2p	1.136s

guillaumemichel · 2023-08-18T09:50:12Z

The tests have run successfully 15 times in addition to the displayed tests.

IMO this is good to merge

iand

LGTM

fixed TestConnections

1ae9bf3

attempt at correcting TestReqTimeout

4298787

guillaumemichel commented Aug 17, 2023

View reviewed changes

guillaumemichel marked this pull request as ready for review August 17, 2023 15:28

guillaumemichel requested review from iand and dennis-tra as code owners August 17, 2023 15:28

replaced time.Sleep with sched.Clock.Sleep

ce31ec6

guillaumemichel marked this pull request as draft August 18, 2023 06:09

guillaumemichel force-pushed the fix-libp2p-flakiness branch from c009efc to ce31ec6 Compare August 18, 2023 09:42

guillaumemichel marked this pull request as ready for review August 18, 2023 09:43

iand approved these changes Aug 18, 2023

View reviewed changes

guillaumemichel merged commit 070b692 into main Aug 18, 2023

guillaumemichel deleted the fix-libp2p-flakiness branch August 18, 2023 12:03

guillaumemichel mentioned this pull request Aug 23, 2023

Test flake: libp2p.TestReqTimeout #109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky libp2p tests #105

Fix flaky libp2p tests #105

guillaumemichel commented Aug 17, 2023

iand commented Aug 17, 2023

guillaumemichel Aug 17, 2023

iand Aug 17, 2023

guillaumemichel Aug 18, 2023

guillaumemichel commented Aug 17, 2023

iand commented Aug 17, 2023

guillaumemichel commented Aug 18, 2023

guillaumemichel commented Aug 18, 2023

guillaumemichel commented Aug 18, 2023

guillaumemichel commented Aug 18, 2023

iand left a comment

Fix flaky libp2p tests #105

Fix flaky libp2p tests #105

Conversation

guillaumemichel commented Aug 17, 2023

iand commented Aug 17, 2023

guillaumemichel Aug 17, 2023

Choose a reason for hiding this comment

iand Aug 17, 2023

Choose a reason for hiding this comment

guillaumemichel Aug 18, 2023

Choose a reason for hiding this comment

guillaumemichel commented Aug 17, 2023

iand commented Aug 17, 2023

guillaumemichel commented Aug 18, 2023

guillaumemichel commented Aug 18, 2023

guillaumemichel commented Aug 18, 2023

guillaumemichel commented Aug 18, 2023

iand left a comment

Choose a reason for hiding this comment