Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration creation with configurable replica count and wait time #159

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

hanish520
Copy link

Tests to form configuration with a configurable number of replicas and wait time.

Hanish Gogada and others added 12 commits February 27, 2022 14:58
This fixes various problems with error reporting when running
tests with goroutines. Unfortunately, it still seems to fail for
Style1 for any replica count, but Style2 does not fail even for
replica counts up to 180.
To avoid that the process create to many connections (open files)
each test should stop their servers.
This still contains the old versions that don't work, but I've
disabled them. Things should be cleaned up now that the new
version (seems to) work. Have been able to start 200 servers.
Previously there was no way to know whether or not a goroutine had
in fact been started; henceforth, the errChan would be closed by
the defer since the createReplicas() call would reach the return
before all goroutines had the chance to send their err, causing a
panic. In this new version, we wait for all goroutines to start
before we check for errors on the channel.
@meling meling requested a review from johningve March 7, 2022 19:14
nodeMap[replica.address] = replica.id
}
for _, replica := range replicas {
replica.createConfiguration(nodeMap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error returned by this method is not checked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, this seems to bring back the bug we are fighting with here. If we run the code with:

$ go test -v -run TestAllToAllConfiguration -replicas 10
=== RUN   TestAllToAllConfiguration
--- PASS: TestAllToAllConfiguration (1.06s)

vs

go test -v -run TestAllToAllConfiguration -replicas 11
=== RUN   TestAllToAllConfiguration
    alltoallconfig_test.go:33: could not create configuration: connection failed for addr: 127.0.0.1:65204: dialing node failed: context deadline exceeded
--- FAIL: TestAllToAllConfiguration (1.16s)

And this is consistently working for up to 10 replicas, but not for 11 or above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added TestGrpcDial to try to understand better what's going on. This version fails consistently with 84 replicas (but not with fewer on my machine that is):

go test -v -run TestGrpc -replicas 84
=== RUN   TestGrpcDial
    alltoallconfig_test.go:133: context deadline exceeded
--- FAIL: TestGrpcDial (2.05s)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't reproduce this on my machine

❯ ./all2all.test -test.v -test.run="TestGrpcDial" -replicas=10000
=== RUN   TestGrpcDial
--- PASS: TestGrpcDial (4.84s)
PASS

It only fails for me when the kernel runs out of ports 😅

❯ ./all2all.test -test.v -test.run="TestGrpcDial" -replicas=50000 
=== RUN   TestGrpcDial
    alltoallconfig_test.go:123: listen tcp 127.0.0.1:0: bind: address already in use
--- FAIL: TestGrpcDial (9.77s)
FAIL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants