Fix data races in fetcher tests #1121

janos · 2019-01-15T12:54:15Z

This PR addresses #1109 issue and fixes another set of data races.

This PR introduces no changes to function signatures, but adds one if statement to Fetcher.run with insignificant performance degradation on production code.

Fetcher.run function is controlled by the provided Context, but there is no synchronization between the termination of Fetcher.run goroutine and test termination. Some of the fetcher tests change the global variable searchTimeout and Fetcher.run reads it. If Fetcher.run goroutine is not terminated before serachTimeout is set back to the original value (with defer), or another tests starts, there is a race between reading and writing serachTimeout.

The first race is described in #1109, and the second type of race is go test -race -count 1 -v github.com/ethereum/go-ethereum/swarm/network

=== RUN   TestDiscovery
--- PASS: TestDiscovery (0.02s)
=== RUN   TestFetcherSingleRequest
--- PASS: TestFetcherSingleRequest (0.10s)
=== RUN   TestFetcherCancelStopsFetcher
--- PASS: TestFetcherCancelStopsFetcher (0.21s)
=== RUN   TestFetcherCancelStopsRequest
--- PASS: TestFetcherCancelStopsRequest (0.31s)
=== RUN   TestFetcherOfferUsesSource
--- PASS: TestFetcherOfferUsesSource (0.60s)
=== RUN   TestFetcherOfferAfterRequestUsesSourceFromContext
--- PASS: TestFetcherOfferAfterRequestUsesSourceFromContext (0.21s)
=== RUN   TestFetcherRetryOnTimeout
==================
WARNING: DATA RACE
Write at 0x000005136cf8 by goroutine 102:
  github.com/ethereum/go-ethereum/swarm/network.TestFetcherRetryOnTimeout()
      /Users/janos/go/src/github.com/ethereum/go-ethereum/swarm/network/fetcher_test.go:294 +0x3f7
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:827 +0x162

Previous read at 0x000005136cf8 by goroutine 92:
  github.com/ethereum/go-ethereum/swarm/network.(*Fetcher).run()
      /Users/janos/go/src/github.com/ethereum/go-ethereum/swarm/network/fetcher.go:239 +0x4d5

Goroutine 102 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:878 +0x659
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:1119 +0xa8
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:827 +0x162
  testing.runTests()
      /usr/local/go/src/testing/testing.go:1117 +0x4ee
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:1034 +0x2ee
  main.main()
      _testmain.go:106 +0x221

Goroutine 92 (finished) created at:
  github.com/ethereum/go-ethereum/swarm/network.TestFetcherCancelStopsRequest()
      /Users/janos/go/src/github.com/ethereum/go-ethereum/swarm/network/fetcher_test.go:153 +0x4b8
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:827 +0x162
==================
--- FAIL: TestFetcherRetryOnTimeout (0.51s)
    testing.go:771: race detected during execution of test

where two goroutines from two different tests are racing on the same global value.

Synchronization is achieved by providing a test hook function that waits for a channel to be closed when Fetcher.run goroutine is done. The right order of defer functions in Fetcher tests is important to avoid races on changing searchTimeout and testHookFetcherRun. Function testHookFetcherRun is added to avoid changes to FetcherFactory.New function signature which calls Fetcher.runs in a goroutine in some of the tests.

With Fetcher.run goroutine synchronization, a deadlock in mockRequester.doRequest function was detected when context is canceled and nothing is receiving on requestC channel.

All races and problems handled in this PR are not affecting production code, just tests.

All Fetcher tests should finish without races:

frncmx

Nice catch! I missed that multiple tests running after each other can cause an issue, too. As the previous test's goroutine did not stop yet.

I think the test hooks introduce a lot of complexity. I wonder if that necessary. Have you considered using sync.WaitGroup? Something like:

	// set searchTimeOut to low value so the test is quicker
	originalSearchTimeout := searchTimeout
	ctx, cancel := context.WithCancel(context.Background())
	var wg sync.WaitGroup
	defer func() {
		cancel()
		wg.Wait()
		searchTimeout = originalSearchTimeout
	}()
	searchTimeout = 250 * time.Millisecond

	wg.Add(1)
	go func() {
		fetcher.run(ctx, peersToSkip)
		wg.Done()
	}()

That is still a lot of code, but I guess familiarity helps to understand, i.e., wait groups and context should be familiar to people working with Go. Also, the above would not require production code change.

That being said, there would 1 case remain where it's impossible to use the above concept. Namely TestFetcherFactory. I think FetcherFactory.New must be changed as it starts a goroutine but does not provide any way for clear termination (possibility to wait for shutdown).

janos · 2019-01-16T12:09:07Z

@frncmx Thanks for the suggestion, but you already answered why I did not use wait group approach. There has to be a change in FetcherFactory.New which would accept a synchronization argument (wait group or a channel) which would actually only be used in test, not in production. I do not think that test hook is complex, it is even used in standard library, and there is also a test for it in this pr. My first attempt was to actually pass a channel fetcher.run, but wait group is also a nice trick. Would you prefer to change FetcherFactory.New to accept the wait group and remove the test hook?

frncmx · 2019-01-16T14:31:20Z

I looked into what effort would it take to change the interface of FetcherFactory.New. Well. Not simple; not cheap. I'm not sure that would worth it.

Clearly:

Fetcher's shutdown is not clean. The program may exit sooner after a call to cancel() than the goroutine finishes. I'm not sure, but I think that may not cause problems in this case.
contextdoes not provide a way to wait for cancellation as it is. There are some articles criticizing that, some going as far as Context isn't for cancellation. However, in the original/official context blogpost they use basically quit/done channels to signal back.
I'm on the same side with you about not liking production code changes for test, or at least really minimize them.

That being said what would you think about adding a quit channel to Fetcher?
That would require a trivial change in the NewFetcher() and on type Fetcher. Then when the context is done we can close that channel and in TestFetcherFactory we could listen to the event.

In code;

diff --git a/swarm/network/fetcher.go b/swarm/network/fetcher.go
index 6aed57e22..80524b4e4 100644
--- a/swarm/network/fetcher.go
+++ b/swarm/network/fetcher.go
@@ -215,6 +215,7 @@ func (f *Fetcher) run(ctx context.Context, peers *sync.Map) {
                        // all Fetcher context closed, can quit
                case <-ctx.Done():
                        log.Trace("terminate fetcher", "request addr", f.addr)
+                       close(f.quitC)
                        // TODO: send cancelations to all peers left over in peers map (i.e., those we requested from)
                        return
                }

frncmx · 2019-01-16T14:33:53Z

mockNetFetcher actually has the same approach.
swarm/storage/netstore_test.go:37

frncmx · 2019-01-16T14:39:22Z

I still don't like the hook that much even though I seen something similar in standard library once and I admit you tested the hook. I don't like the hook because the usage of the hook doesn't feel clean and simple to me in the tests.

I actually liked your first approach better. Changing a private function signature for a test (even though I know that should not be the default approach). But then I realized that did not fix all the data races.

So, please bare with me. :)

janos · 2019-01-16T14:52:52Z

@frncmx thanks for the explanations, they are very helpful. Yes, quit channel is cleaner approach, but we will have that field in fetcher that is actually not used in production, only in tests. If it is ok to have that overhead in production, I will change the implementation, but we would still need to change FetcherFactory.New function to be able to wait in tests until internal fetcher quit channel is closed. That can be done in a few ways, by passing the channel as the argument, having a callback, or extending NetFetcher interface. What would you prefer? Or maybe you would like to take over this PR, given the amount of work you invested in it?

janos · 2019-01-17T16:01:46Z

Rejected in favour to ethereum/go-ethereum#18469.

janos added 3 commits January 15, 2019 12:22

swarm/network: fix TestFetcherRetryOnTimeout race

4d82225

swarm/network: fix races between different fetcher tests

f35879a

swarm/network: prevent mockRequester.doDequest deadlock

f8b4dbb

janos added bug test stability labels Jan 15, 2019

janos requested review from zelig and frncmx January 15, 2019 12:54

zelig assigned janos Jan 15, 2019

janos mentioned this pull request Jan 15, 2019

Data race in storage.(*mockNetFetcher).Request() #1117

Closed

zelig approved these changes Jan 15, 2019

View reviewed changes

frncmx self-assigned this Jan 16, 2019

frncmx suggested changes Jan 16, 2019

View reviewed changes

frncmx mentioned this pull request Jan 17, 2019

swarm/network: fix data race in fetcher_test.go ethereum/go-ethereum#18469

Merged

zelig added the submitted to ethereum/go-ethereum label Jan 17, 2019

janos closed this Jan 17, 2019

janos deleted the race-fetcher branch January 17, 2019 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data races in fetcher tests #1121

Fix data races in fetcher tests #1121

janos commented Jan 15, 2019

frncmx left a comment •

edited

Loading

janos commented Jan 16, 2019

frncmx commented Jan 16, 2019 •

edited

Loading

frncmx commented Jan 16, 2019

frncmx commented Jan 16, 2019 •

edited

Loading

janos commented Jan 16, 2019

janos commented Jan 17, 2019

Fix data races in fetcher tests #1121

Fix data races in fetcher tests #1121

Conversation

janos commented Jan 15, 2019

frncmx left a comment • edited Loading

Choose a reason for hiding this comment

janos commented Jan 16, 2019

frncmx commented Jan 16, 2019 • edited Loading

frncmx commented Jan 16, 2019

frncmx commented Jan 16, 2019 • edited Loading

janos commented Jan 16, 2019

janos commented Jan 17, 2019

frncmx left a comment •

edited

Loading

frncmx commented Jan 16, 2019 •

edited

Loading

frncmx commented Jan 16, 2019 •

edited

Loading