-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent panic errors in tombstone GC #2582
Conversation
This patch fixes the race condition caused by concurrent call to the "SetEnabled" method and call of the timer handler. If the "SetEnabled" will acquire the lock earlier, it will erase the content of the "expires" map. So when the timer handler will acquire the lock, it will access to the non-existing interval (which is evaluated to nil), thus cause the panic. I added a check, that ensures the tombstone GC is enabled before sending anything to expire channel.
I made the following replacements in the testutil/server.go file to preserve the logs of the started diff --git a/testutil/server.go b/testutil/server.go
index 8ab196e..7af8c95 100644
--- a/testutil/server.go
+++ b/testutil/server.go
@@ -174,7 +173,6 @@ func NewTestServerConfig(t TestingT, cb ServerConfigCallback) *TestServer {
configFile, err := ioutil.TempFile(dataDir, "config")
if err != nil {
- defer os.RemoveAll(dataDir)
t.Fatalf("err: %s", err)
}
@@ -195,6 +193,9 @@ func NewTestServerConfig(t TestingT, cb ServerConfigCallback) *TestServer {
}
configFile.Close()
+ consulConfig.Stdout, _ = ioutil.TempFile(dataDir, "log")
+ consulConfig.Stderr = consulConfig.Stdout
+
stdout := io.Writer(os.Stdout)
if consulConfig.Stdout != nil {
stdout = consulConfig.Stdout
@@ -256,7 +257,6 @@ func NewTestServerConfig(t TestingT, cb ServerConfigCallback) *TestServer {
// Stop stops the test Consul server, and removes the Consul data
// directory once we are done.
func (s *TestServer) Stop() {
- defer os.RemoveAll(s.Config.DataDir)
if err := s.cmd.Process.Kill(); err != nil {
s.t.Errorf("err: %s", err)
After that each spawned go test --tags=consul ./api
--- FAIL: TestClientPutGetDelete (10.36s)
server.go:300: Request failed: Get http://127.0.0.1:37001/v1/catalog/nodes: dial tcp 127.0.0.1:37001: getsockopt: connection refused
server.go:300: Request failed: Get http://127.0.0.1:37001/v1/catalog/nodes: dial tcp 127.0.0.1:37001: getsockopt: connection refused
server.go:300: Request failed: Get http://127.0.0.1:37001/v1/catalog/nodes: dial tcp 127.0.0.1:37001: getsockopt: connection refused Looked into the persisted logs of the consul agent: # fgrep -l 37001 /tmp
/tmp/consul984933444/config063953430 cat /tmp/consul984933444/config063953430 | jq .
{
"node_name": "node34761",
"performance": {
"raft_multiplier": 1
},
"bootstrap": true,
"server": true,
"data_dir": "/tmp/consul984933444",
"disable_update_check": true,
"log_level": "debug",
"bind_addr": "127.0.0.1",
"addresses": {},
"ports": {
"dns": 39441,
"http": 37001,
"rpc": 33613,
"serf_lan": 37127,
"serf_wan": 45291,
"server": 33417
}
} cat /tmp/consul984933444/log113046967
==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Error starting dns server: dns tcp setup failed: listen tcp 127.0.0.1:39441: bind: address already in use |
Sorry, accidentally closed the pull-request. So to my mind, this is happening due to the way used to allocate the ports for the testing server: // randomPort asks the kernel for a random port to use.
func randomPort() int {
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
panic(err)
}
defer l.Close()
return l.Addr().(*net.TCPAddr).Port
} As the listener will be closed immediately, this approach does not guarantee, that an allocated port would not be used by some different process. |
To prove this theory I used the following script to search another consul that used the same port: fgrep -rl --include="*config*" 39411 /tmp
/tmp/consul760041623/config283621642
/tmp/consul114789517/config198625416 And I've found that two use them used the same port fgrep -rl --include="*config*" 39411 /tmp | xargs -I{} cat {} | jq . | grep 39411 -C 2
"addresses": {},
"ports": {
"dns": 39411,
"http": 33507,
"rpc": 40427,
--
"serf_lan": 45281,
"serf_wan": 39465,
"server": 39411
}
} |
@mckennajones, looks like your pull request #2577 failed for the same reason. |
This patch create a global reservation mapping used to remember the ports that have been already allocated for the consul agents. Also to release the reservations when the server will be stopped it ships a new function "releasePorts" that removes the ports from the mapping.
l, err := net.Listen("tcp", "127.0.0.1:0") | ||
if err != nil { | ||
panic(err) | ||
makeport := func() (int, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethvargo, could you, please, say if the changes acceptable to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ybubnov
This is the testutil server and should never be used for production uses. Pushing the ports into a map does not solve the issue though (and I've talked with DC about this). Global varies are not shared across multiple packages: https://gist.github.com/sethvargo/ebc696c9ed8f541a9887307115918181.
If more than one package imports this package, they both start with empty, independent maps. It's really a terrible problem that does not have a good solution 😦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I understand this play with ports allocation is necessary to speed up the tests invocation only, definitely it is not targeted for production usage.
But I did not precisely get the point with global variables that are not shared across multiple packages (I assume you are talking about the testing packages).
If so, then go test
command will sequentially (one package by one) compile a standalone binary for each testing package and launch the tests in parallel only within the single package:
go test --help
...
-parallel n
Allow parallel execution of test functions that call t.Parallel.
The value of this flag is the maximum number of tests to run
simultaneously; by default, it is set to the value of GOMAXPROCS.
Note that -parallel only applies within a single test binary.
The 'go test' command may run tests for different packages
in parallel as well, according to the setting of the -p flag
(see 'go help build').
...
So this map will be shared within a single binary. I also agree that proposed changes does not solve the problem (because there is clearly no solution). These changes to my mind reduce the probability of the case, when we allocate two same ports while testing the single package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, as long as we accept the tradeoffs, I'm okay with this.
This patch fixes the race condition caused by concurrent call to the
SetEnabled
method and call of the timer handler. If theSetEnabled
will acquire the lock earlier, it will erase the content of theexpires
map.So when the timer handler will acquire the lock, it will access to the non-existing interval (which is evaluated to nil), thus cause the panic.
I added a check, that ensures the tombstone GC is enabled before sending anything to expire channel.
fixes #2087