-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky libp2p tests #105
Conversation
Hit the race that is fixed in #102 |
libp2p/libp2pendpoint_test.go
Outdated
@@ -425,7 +430,7 @@ func TestReqTimeout(t *testing.T) { | |||
go func() { | |||
// timeout is queued in the scheduler 0 | |||
for !scheds[0].RunOne(ctx) { | |||
time.Sleep(1 * time.Millisecond) | |||
time.Sleep(10 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test is failing because the remote peer (sched[1]
) likely isn't receiving the request within 100ms (!!!), so it isn't adding anything to its queue, and hence has nothing to run at the next step and fails the test.
I am not able to reproduce this on my machine. I assume that if this thread sleeps longer, it gives more time to other threads, and the message can be received within the 100 ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is problematic and means we don't have deterministic tests. If we're tweaking test timings then we're not in a better position than using raw goroutines.
We should try and remove sleeps from all tests. As a side issue, why aren't these sleeps using the scheduler's clock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iand unfortunately we cannot get deterministic testing when using go-libp2p. When using go-libp2p to send messages over the network stack, multiple go routines are at play, and we cannot use the scheduler to obtain sequentiality.
We should try and remove sleeps from all tests
I wish we were able to do it, but we need to wait for the message to be passed from one go routine to another over the network stack.
This test is testing the request timeout. A
is sending a request to B
. No action is run in B
's scheduler until A
's request times out. After A
has timed out, B
will send a late reply, and A
should discard it. However, sometimes B
's scheduler has no action to run after A
's timeout of 100 ms (!!!).
There are 3 go routines at play:
- The single worker, running both
A
andB
schedulers. A
's go-libp2p go routine sending the request.B
's go-libp2p go routine listening at incoming requests.
It means that either (2) is not sending the request within 100ms or that (3) isn't processing the request within 100ms (, or that the request got lost in the local network stack, but it seems unlikely). It is very hard to understand what is wrong, because I cannot reproduce the failure on my machine (that is certainly faster than GH CI), and it only happens once in a while on GH CI.
I will try to add verbose when the test fails, and run it many times in GH CI.
As a side issue, why aren't these sleeps using the scheduler's clock?
You are right! In this case however it won't solve the flakiness, because the scheduler's clock is a real clock.
I am confident that I am less confident with |
You could run the following to reproduce the flake:
|
|
Unfortunately it doesn't help, the test keeps passing on my machine. I have even run the following script for quite some time without getting the test to fail. while true; do
output=$(go1.20 test -run ^TestReqTimeout$)
# Count the number of lines in the output
line_count=$(echo "$output" | wc -l)
# If the line count is greater than 2, the test failed. break out of the loop
if (( line_count > 2 )); then
echo "$output"
break
fi
done |
|
c009efc
to
ce31ec6
Compare
The tests have run successfully 15 times in addition to the displayed tests. IMO this is good to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Addresses #103