Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to fix flaky tests by waiting for subscriptions & mesh to be ready #203

Closed

Conversation

aarshkshah1992
Copy link
Contributor

For #202

@aschmahmann

In spite of many repeated runs, I was unable to reproduce the flakiness. However, I can intuitively see that the mesh overlay is not guaranteed to be in a steady state if we just sleep & hope for the best.

Two main changes made to all the 3 tests mentioned in the issue are:

  1. Poll for the incoming subscription messages on all peers to be processed so that PubSub.topics reflects the correct state
  2. Poll for the overlay mesh to have ATLEAST Dlow peers rather than sleeping for 2 seconds & hoping the corresponding heartbeats get processed in the meantime

Let me know what you think. If this approach sounds reasonable, we can do something similar for the other tests.

@@ -23,6 +23,26 @@ func getGossipsubs(ctx context.Context, hs []host.Host, opts ...Option) []*PubSu
return psubs
}

func waitForMeshConstruction(topic string, psubs []*PubSub) {
Copy link
Contributor Author

@aarshkshah1992 aarshkshah1992 Oct 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Races with the heartbeat thread for GossipsubRouter.mesh[T]. Let me think of something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, if your function is very computationally inexpensive you can generally use something along the lines of this:

        done := make(chan resultType,1)
	p.eval <- func() {
               done <- myFn()
	}
        return <-done

Note: the channel closure is only going to be needed if your function needs to be synchronous. Also, you may want to add in select statements that check for an expired context if you want to reuse these functions a bunch.


// wait for all incoming subscription messages to be processed
// this method should ONLY be called if all connected peers subscribe to the same topic
func waitForSubscriptionProcessing(topic string, psubs []*PubSub) {
Copy link
Contributor Author

@aarshkshah1992 aarshkshah1992 Oct 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Races with PubSub.handleIncomingRPC when it tries to update Pubsub.topics[T].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above: PubSub is currently setup to basically have a few long running goroutines with event loops in them instead of using mutexes. There's an internal p.eval channel that will run arbitrary functions for you inside of the event loop so you don't run into races.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha... that is brilliant. Let me fix this.

@@ -704,7 +726,7 @@ func TestGossipsubGraftPruneRetry(t *testing.T) {
}

func TestGossipsubControlPiggyback(t *testing.T) {
t.Skip("travis regularly fails on this test")
//t.Skip("travis regularly fails on this test")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aschmahmann I'm not being able to understand the last line of this test & of TestGossipsubGossip

// and wait for some gossip flushing
	time.Sleep(time.Second * 2)

Please can you explain what this is for ? Why is it important to flush the remaining gossip/control messages once the test is complete ?

Copy link
Contributor

@aschmahmann aschmahmann Oct 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure, perhaps it's just supposed to help close things down when running consecutive tests, even though that should be covered by cancelling the context 🤷‍♂.

Hopefully either @vyzo can shed some light or perhaps there's an answer to be found in the first gossipsub PR #67.

@aarshkshah1992 aarshkshah1992 changed the title Fix flaky tests by waiting for subscriptions & mesh to be ready Try to fix flaky tests by waiting for subscriptions & mesh to be ready Oct 6, 2019
@vyzo
Copy link
Collaborator

vyzo commented Mar 23, 2020

closing as it doesn't fix the tests and needs quite a bit of work.

@vyzo vyzo closed this Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants