Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] Fix race condition in freeport #7567

Merged
merged 6 commits into from
Apr 1, 2020

Conversation

pierresouchay
Copy link
Contributor

This removes a race condition in reset since pendingPorts can be set to nil in reset()

If ticker is hit at wrong time, it would crash the unit test.

We ensure in reset to avoid this race condition by cancelling the goroutine using
killTicker chan.

We also properly clean up eveything, so garbage collector can work as expected.

To reproduce existing bug:
while go test -timeout 30s github.com/hashicorp/consul/sdk/freeport -run '^(Test.*)$'; do go clean -testcache; done

Will crash after a few 10s runs on my machine.

Error could be seen in unit tests sometimes:

[INFO] freeport: resetting the freeport package state
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x1125536]

goroutine 25 [running]:
container/list.(*List).Len(...)
/usr/local/Cellar/go/1.14/libexec/src/container/list/list.go:66
github.com/hashicorp/consul/sdk/freeport.checkFreedPortsOnce()
/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:157 +0x86
github.com/hashicorp/consul/sdk/freeport.checkFreedPorts()
/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:147 +0x71
created by github.com/hashicorp/consul/sdk/freeport.initialize
/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:113 +0x2cf
FAIL github.com/hashicorp/consul/sdk/freeport 1.607s

Copy link
Member

@hanshasselberg hanshasselberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a race :)

sdk/freeport/freeport.go Outdated Show resolved Hide resolved
@pierresouchay
Copy link
Contributor Author

@i0rek I fixed with your suggestion

@rboyer
Copy link
Member

rboyer commented Apr 1, 2020

Ah I see the issue now. I think there's still a slight problem with goroutine lifespans with your suggestion. I'm testing locally and will report back.

@pierresouchay
Copy link
Contributor Author

@rboyer the test is running for 1h+ now without any failure with

while go test -timeout 30s github.com/hashicorp/consul/sdk/freeport -run '^(Test.*)$'; do go clean -testcache; done

@rboyer
Copy link
Member

rboyer commented Apr 1, 2020

It's a very thin edge case condition, but this version in the PR only signals to the goroutine to stop, but does not actually wait for that to happen. I suspect that you could get the same panic if the machine was very burdened causing time to slow down a lot.

Here's an alternate implementation that actually waits:

diff --git sdk/freeport/freeport.go sdk/freeport/freeport.go
index 4ddc93737..af46b24fe 100644
--- sdk/freeport/freeport.go
+++ sdk/freeport/freeport.go
@@ -66,6 +66,14 @@ var (
 
 	// total is the total number of available ports in the block for use.
 	total int
+
+	// stopCh is used to signal to background goroutines to terminate. Only
+	// really exists for the safety of reset() during unit tests.
+	stopCh chan struct{}
+
+	// stopWg is used to keep track of background goroutines that are still
+	// alive. Only really exists for the safety of reset() during unit tests.
+	stopWg sync.WaitGroup
 )
 
 // initialize is used to initialize freeport.
@@ -108,7 +116,12 @@ func initialize() {
 	}
 	total = freePorts.Len()
 
-	go checkFreedPorts()
+	stopWg.Add(1)
+	stopCh = make(chan struct{})
+	// Note: we pass this param explicitly to the goroutine so that we can
+	// freely recreate the underlying stop channel during reset() after closing
+	// the original.
+	go checkFreedPorts(stopCh)
 }
 
 // reset will reverse the setup from initialize() and then redo it (for tests)
@@ -118,6 +131,17 @@ func reset() {
 
 	logf("INFO", "resetting the freeport package state")
 
+	// Terminate background goroutine and wait for it to complete.
+	if stopCh != nil {
+		now := time.Now()
+		close(stopCh)
+		stopCh = nil
+
+		stopWg.Wait()
+		dur := time.Since(now)
+		logf("INFO", "took %s to wait for freeport goroutine to die", dur)
+	}
+
 	effectiveMaxBlocks = 0
 	firstPort = 0
 	if lockLn != nil {
@@ -132,11 +156,18 @@ func reset() {
 	total = 0
 }
 
-func checkFreedPorts() {
+func checkFreedPorts(stopCh <-chan struct{}) {
+	defer stopWg.Done()
+
 	ticker := time.NewTicker(250 * time.Millisecond)
 	for {
-		<-ticker.C
-		checkFreedPortsOnce()
+		select {
+		case <-stopCh:
+			logf("INFO", "Closing checkFreedPorts()")
+			return
+		case <-ticker.C:
+			checkFreedPortsOnce()
+		}
 	}
 }

@pierresouchay
Copy link
Contributor Author

@rboyer your solution is more elegant and probably safer, I agree. I applied your patch

@pierresouchay
Copy link
Contributor Author

@rboyer While I was not able to crash it for more than 1h with previous implementation, with you patch it fails after ~100 tests being run in a timeout:

[INFO] freeport: resetting the freeport package state
panic: test timed out after 30s

goroutine 42 [running]:
testing.(*M).startAlarm.func1()
	/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:1460 +0xdf
created by time.goFunc
	/usr/local/Cellar/go/1.14/libexec/src/time/sleep.go:168 +0x44

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000154480, 0x117d05e, 0xe, 0x1185ab8, 0x107d801)
	/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:1044 +0x37e
testing.runTests.func1(0xc000154240)
	/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:1285 +0x78
testing.tRunner(0xc000154240, 0xc000165e10)
	/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:992 +0xdc
testing.runTests(0xc000136160, 0x12a1fe0, 0x3, 0x3, 0x0)
	/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:1283 +0x2a7
testing.(*M).Run(0xc00016e000, 0x0)
	/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:1200 +0x15f
main.main()
	_testmain.go:48 +0x135

goroutine 22 [semacquire]:
sync.runtime_Semacquire(0x12cf2c8)
	/usr/local/Cellar/go/1.14/libexec/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0x12cf2c0)
	/usr/local/Cellar/go/1.14/libexec/src/sync/waitgroup.go:130 +0x64
github.com/hashicorp/consul/sdk/freeport.reset()
	/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:139 +0x295
github.com/hashicorp/consul/sdk/freeport.TestTakeReturn(0xc000154480)
	/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport_test.go:206 +0x184
testing.tRunner(0xc000154480, 0x1185ab8)
	/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:992 +0xdc
created by testing.(*T).Run
	/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:1043 +0x357

goroutine 25 [semacquire]:
sync.runtime_SemacquireMutex(0x12cf13c, 0x0, 0x1)
	/usr/local/Cellar/go/1.14/libexec/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0x12cf138)
	/usr/local/Cellar/go/1.14/libexec/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/Cellar/go/1.14/libexec/src/sync/mutex.go:81
github.com/hashicorp/consul/sdk/freeport.checkFreedPortsOnce()
	/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:174 +0x4be
github.com/hashicorp/consul/sdk/freeport.checkFreedPorts(0xc00010e480)
	/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:168 +0x86
created by github.com/hashicorp/consul/sdk/freeport.initialize
	/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:124 +0x316
FAIL	github.com/hashicorp/consul/sdk/freeport	30.011s
FAIL

@pierresouchay
Copy link
Contributor Author

@rboyer Found the issue: in you patch, you did watch for goroutine to end while having the mutex.
In the last patch, I wait without the mutex to avoid possible deadlock

@rboyer
Copy link
Member

rboyer commented Apr 1, 2020

I'm having terrifying memories come back to me around agent service registration locking.

Relevant subset of changed code from the last snippet:

 // reset will reverse the setup from initialize() and then redo it (for tests)
 func reset() {
+	logf("INFO", "resetting the freeport package state")
+
+	shutdownGoroutine()
+
 	mu.Lock()
 	defer mu.Unlock()
 
-	logf("INFO", "resetting the freeport package state")
-
 	effectiveMaxBlocks = 0
 	firstPort = 0
 	if lockLn != nil {
@@ -132,11 +147,32 @@ func reset() {
 	total = 0
 }
 
+func shutdownGoroutine() {
+	mu.Lock()
+	if stopCh == nil {
+		mu.Unlock()
+		return
+	}
+
+	close(stopCh)
+	stopCh = nil
+	mu.Unlock()
+
+	stopWg.Wait()
+}

Contractually it shouldn't be true that there are initialize calls happening while we call reset so in between the two locked sections stopCh won't be reassigned.

This removes a race condition in reset since pendingPorts can be set to nil in reset()

If ticker is hit at wrong time, it would crash the unit test.

We ensure in reset to avoid this race condition by cancelling the goroutine using
killTicker chan.

We also properly clean up eveything, so garbage collector can work as expected.

To reproduce existing bug:
`while go test -timeout 30s github.com/hashicorp/consul/sdk/freeport -run '^(Test.*)$'; do go clean -testcache; done`

Will crash after a few 10s runs on my machine.

Error could be seen in unit tests sometimes:

[INFO] freeport: resetting the freeport package state
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x1125536]

goroutine 25 [running]:
container/list.(*List).Len(...)
	/usr/local/Cellar/go/1.14/libexec/src/container/list/list.go:66
github.com/hashicorp/consul/sdk/freeport.checkFreedPortsOnce()
	/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:157 +0x86
github.com/hashicorp/consul/sdk/freeport.checkFreedPorts()
	/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:147 +0x71
created by github.com/hashicorp/consul/sdk/freeport.initialize
	/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:113 +0x2cf
FAIL	github.com/hashicorp/consul/sdk/freeport	1.607s
@pierresouchay
Copy link
Contributor Author

@rboyer DONE as you suggested in last commit

@pierresouchay
Copy link
Contributor Author

grrrr.. unstable tests

Copy link
Member

@rboyer rboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@rboyer rboyer merged commit 5703aad into hashicorp:master Apr 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants