-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to fix flaky tests by waiting for subscriptions & mesh to be ready #203
Conversation
gossipsub_test.go
Outdated
@@ -23,6 +23,26 @@ func getGossipsubs(ctx context.Context, hs []host.Host, opts ...Option) []*PubSu | |||
return psubs | |||
} | |||
|
|||
func waitForMeshConstruction(topic string, psubs []*PubSub) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Races with the heartbeat thread for GossipsubRouter.mesh[T]. Let me think of something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, if your function is very computationally inexpensive you can generally use something along the lines of this:
done := make(chan resultType,1)
p.eval <- func() {
done <- myFn()
}
return <-done
Note: the channel closure is only going to be needed if your function needs to be synchronous. Also, you may want to add in select
statements that check for an expired context if you want to reuse these functions a bunch.
gossipsub_test.go
Outdated
|
||
// wait for all incoming subscription messages to be processed | ||
// this method should ONLY be called if all connected peers subscribe to the same topic | ||
func waitForSubscriptionProcessing(topic string, psubs []*PubSub) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Races with PubSub.handleIncomingRPC when it tries to update Pubsub.topics[T].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above: PubSub is currently setup to basically have a few long running goroutines with event loops in them instead of using mutexes. There's an internal p.eval
channel that will run arbitrary functions for you inside of the event loop so you don't run into races.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha... that is brilliant. Let me fix this.
@@ -704,7 +726,7 @@ func TestGossipsubGraftPruneRetry(t *testing.T) { | |||
} | |||
|
|||
func TestGossipsubControlPiggyback(t *testing.T) { | |||
t.Skip("travis regularly fails on this test") | |||
//t.Skip("travis regularly fails on this test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aschmahmann I'm not being able to understand the last line of this test & of TestGossipsubGossip
// and wait for some gossip flushing
time.Sleep(time.Second * 2)
Please can you explain what this is for ? Why is it important to flush the remaining gossip/control messages once the test is complete ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
closing as it doesn't fix the tests and needs quite a bit of work. |
For #202
@aschmahmann
In spite of many repeated runs, I was unable to reproduce the flakiness. However, I can intuitively see that the mesh overlay is not guaranteed to be in a steady state if we just sleep & hope for the best.
Two main changes made to all the 3 tests mentioned in the issue are:
PubSub.topics
reflects the correct stateDlow
peers rather than sleeping for 2 seconds & hoping the corresponding heartbeats get processed in the meantimeLet me know what you think. If this approach sounds reasonable, we can do something similar for the other tests.