Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent panic errors in tombstone GC #2582

Closed
wants to merge 2 commits into from
Closed

Prevent panic errors in tombstone GC #2582

wants to merge 2 commits into from

Conversation

ybubnov
Copy link

@ybubnov ybubnov commented Dec 8, 2016

This patch fixes the race condition caused by concurrent call to the SetEnabled method and call of the timer handler. If the SetEnabled will acquire the lock earlier, it will erase the content of the expires map.

So when the timer handler will acquire the lock, it will access to the non-existing interval (which is evaluated to nil), thus cause the panic.

I added a check, that ensures the tombstone GC is enabled before sending anything to expire channel.

fixes #2087

This patch fixes the race condition caused by concurrent call to the
"SetEnabled" method and call of the timer handler. If the "SetEnabled"
will acquire the lock earlier, it will erase the content of the
"expires" map. So when the timer handler will acquire the lock, it will
access to the non-existing interval (which is evaluated to nil), thus
cause the panic.

I added a check, that ensures the tombstone GC is enabled before
sending anything to expire channel.
@ybubnov
Copy link
Author

ybubnov commented Dec 9, 2016

I made the following replacements in the testutil/server.go file to preserve the logs of the started consul daemon in order to trace down the error in the CI:

diff --git a/testutil/server.go b/testutil/server.go
index 8ab196e..7af8c95 100644
--- a/testutil/server.go
+++ b/testutil/server.go
@@ -174,7 +173,6 @@ func NewTestServerConfig(t TestingT, cb ServerConfigCallback) *TestServer {
 
        configFile, err := ioutil.TempFile(dataDir, "config")
        if err != nil {
-               defer os.RemoveAll(dataDir)
                t.Fatalf("err: %s", err)
        }
 
@@ -195,6 +193,9 @@ func NewTestServerConfig(t TestingT, cb ServerConfigCallback) *TestServer {
        }
        configFile.Close()
 
+       consulConfig.Stdout, _ = ioutil.TempFile(dataDir, "log")
+       consulConfig.Stderr = consulConfig.Stdout
+
        stdout := io.Writer(os.Stdout)
        if consulConfig.Stdout != nil {
                stdout = consulConfig.Stdout
@@ -256,7 +257,6 @@ func NewTestServerConfig(t TestingT, cb ServerConfigCallback) *TestServer {
 // Stop stops the test Consul server, and removes the Consul data
 // directory once we are done.
 func (s *TestServer) Stop() {
-       defer os.RemoveAll(s.Config.DataDir)
 
        if err := s.cmd.Process.Kill(); err != nil {
                s.t.Errorf("err: %s", err)

After that each spawned consul saved all the data into separate sub-directory of tmp/consul*. Then I tried multiple times to execute the following command to reproduce the test failues:

go test --tags=consul ./api
--- FAIL: TestClientPutGetDelete (10.36s)
        server.go:300: Request failed: Get http://127.0.0.1:37001/v1/catalog/nodes: dial tcp 127.0.0.1:37001: getsockopt: connection refused
        server.go:300: Request failed: Get http://127.0.0.1:37001/v1/catalog/nodes: dial tcp 127.0.0.1:37001: getsockopt: connection refused
        server.go:300: Request failed: Get http://127.0.0.1:37001/v1/catalog/nodes: dial tcp 127.0.0.1:37001: getsockopt: connection refused

Looked into the persisted logs of the consul agent:

# fgrep -l 37001 /tmp
/tmp/consul984933444/config063953430
cat /tmp/consul984933444/config063953430 | jq .
{
  "node_name": "node34761",
  "performance": {
    "raft_multiplier": 1
  },
  "bootstrap": true,
  "server": true,
  "data_dir": "/tmp/consul984933444",
  "disable_update_check": true,
  "log_level": "debug",
  "bind_addr": "127.0.0.1",
  "addresses": {},
  "ports": {
    "dns": 39441,
    "http": 37001,
    "rpc": 33613,
    "serf_lan": 37127,
    "serf_wan": 45291,
    "server": 33417
  }
}
cat /tmp/consul984933444/log113046967 
==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Error starting dns server: dns tcp setup failed: listen tcp 127.0.0.1:39441: bind: address already in use

@ybubnov ybubnov closed this Dec 9, 2016
@ybubnov ybubnov reopened this Dec 9, 2016
@ybubnov
Copy link
Author

ybubnov commented Dec 9, 2016

Sorry, accidentally closed the pull-request.

So to my mind, this is happening due to the way used to allocate the ports for the testing server:

// randomPort asks the kernel for a random port to use.
func randomPort() int {
        l, err := net.Listen("tcp", "127.0.0.1:0")
        if err != nil {
                panic(err)
        }
        defer l.Close()
        return l.Addr().(*net.TCPAddr).Port
}

As the listener will be closed immediately, this approach does not guarantee, that an allocated port would not be used by some different process.

@ybubnov
Copy link
Author

ybubnov commented Dec 9, 2016

To prove this theory I used the following script to search another consul that used the same port:

fgrep -rl --include="*config*" 39411 /tmp
/tmp/consul760041623/config283621642
/tmp/consul114789517/config198625416

And I've found that two use them used the same port 39411:

fgrep -rl --include="*config*" 39411 /tmp | xargs -I{} cat {} | jq . | grep 39411 -C 2
  "addresses": {},
  "ports": {
    "dns": 39411,
    "http": 33507,
    "rpc": 40427,
--
    "serf_lan": 45281,
    "serf_wan": 39465,
    "server": 39411
  }
}

@ybubnov
Copy link
Author

ybubnov commented Dec 9, 2016

@mckennajones, looks like your pull request #2577 failed for the same reason.

This patch create a global reservation mapping used to remember
the ports that have been already allocated for the consul agents.

Also to release the reservations when the server will be stopped
it ships a new function "releasePorts" that removes the ports
from the mapping.
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
panic(err)
makeport := func() (int, bool) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethvargo, could you, please, say if the changes acceptable to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ybubnov

This is the testutil server and should never be used for production uses. Pushing the ports into a map does not solve the issue though (and I've talked with DC about this). Global varies are not shared across multiple packages: https://gist.github.com/sethvargo/ebc696c9ed8f541a9887307115918181.

If more than one package imports this package, they both start with empty, independent maps. It's really a terrible problem that does not have a good solution 😦

Copy link
Author

@ybubnov ybubnov Dec 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I understand this play with ports allocation is necessary to speed up the tests invocation only, definitely it is not targeted for production usage.

But I did not precisely get the point with global variables that are not shared across multiple packages (I assume you are talking about the testing packages).

If so, then go test command will sequentially (one package by one) compile a standalone binary for each testing package and launch the tests in parallel only within the single package:

go test --help
...
        -parallel n
            Allow parallel execution of test functions that call t.Parallel.
            The value of this flag is the maximum number of tests to run
            simultaneously; by default, it is set to the value of GOMAXPROCS.
            Note that -parallel only applies within a single test binary.
            The 'go test' command may run tests for different packages
            in parallel as well, according to the setting of the -p flag
            (see 'go help build').
...

So this map will be shared within a single binary. I also agree that proposed changes does not solve the problem (because there is clearly no solution). These changes to my mind reduce the probability of the case, when we allocate two same ports while testing the single package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, as long as we accept the tradeoffs, I'm okay with this.

@ybubnov ybubnov deleted the prevent-panic-in-tombstone-gc branch April 28, 2017 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic in tombstone_gc.go
2 participants