Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

teamcity: failed tests on master: acceptance/bank/node-restart #29149

Closed
cockroach-teamcity opened this issue Aug 27, 2018 · 23 comments
Closed
Assignees
Labels
A-sql-execution Relating to SQL execution. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

The following tests appear to have failed:

#864595:

--- FAIL: acceptance/TestNodeRestart (8.210s)
test_log_scope.go:81: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/logTestNodeRestart083184425
test_log_scope.go:62: use -show-logs to present logs inline
--- FAIL: acceptance/TestNodeRestart: TestNodeRestart/runMode=local (8.210s)
zchaos_test.go:152: pq: unable to dial n1: breaker open
------- Stdout: -------
CockroachDB node starting at 2018-08-27 20:44:25.653667204 +0000 UTC (took 0.7s)
build:               CCL v2.1.0-alpha.20180702-2035-gf47e44b @ 2018/08/27 20:34:40 (go1.10.3)
webui:               http://127.0.0.1:26258
sql:                 postgresql://root@127.0.0.1:26257?sslmode=disable
client flags:        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach --host=127.0.0.1:26257 --insecure
logs:                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestNodeRestart/runMode=local/1
temp dir:            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1/cockroach-temp269295897
external I/O path:   /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1/extern
store[0]:            path=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1
status:              initialized new cluster
clusterID:           c57929c9-0394-4a97-b769-f24308b3e766
nodeID:              1
CockroachDB node starting at 2018-08-27 20:44:25.784217118 +0000 UTC (took 0.8s)
build:               CCL v2.1.0-alpha.20180702-2035-gf47e44b @ 2018/08/27 20:34:40 (go1.10.3)
webui:               http://127.0.0.1:26262
sql:                 postgresql://root@127.0.0.1:26261?sslmode=disable
client flags:        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach --host=127.0.0.1:26261 --insecure
logs:                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestNodeRestart/runMode=local/3
temp dir:            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/3/cockroach-temp395953582
external I/O path:   /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/3/extern
store[0]:            path=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/3
status:              initialized new node, joined pre-existing cluster
clusterID:           c57929c9-0394-4a97-b769-f24308b3e766
nodeID:              2
CockroachDB node starting at 2018-08-27 20:44:25.79403994 +0000 UTC (took 0.8s)
build:               CCL v2.1.0-alpha.20180702-2035-gf47e44b @ 2018/08/27 20:34:40 (go1.10.3)
webui:               http://127.0.0.1:26260
sql:                 postgresql://root@127.0.0.1:26259?sslmode=disable
client flags:        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach --host=127.0.0.1:26259 --insecure
logs:                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestNodeRestart/runMode=local/2
temp dir:            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/2/cockroach-temp206654075
external I/O path:   /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/2/extern
store[0]:            path=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/2
status:              initialized new node, joined pre-existing cluster
clusterID:           c57929c9-0394-4a97-b769-f24308b3e766
nodeID:              4
CockroachDB node starting at 2018-08-27 20:44:25.843164199 +0000 UTC (took 0.9s)
build:               CCL v2.1.0-alpha.20180702-2035-gf47e44b @ 2018/08/27 20:34:40 (go1.10.3)
webui:               http://127.0.0.1:26264
sql:                 postgresql://root@127.0.0.1:26263?sslmode=disable
client flags:        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach --host=127.0.0.1:26263 --insecure
logs:                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestNodeRestart/runMode=local/4
temp dir:            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/4/cockroach-temp941819280
external I/O path:   /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/4/extern
store[0]:            path=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/4
status:              initialized new node, joined pre-existing cluster
clusterID:           c57929c9-0394-4a97-b769-f24308b3e766
nodeID:              3
CockroachDB node starting at 2018-08-27 20:44:28.052850002 +0000 UTC (took 0.7s)
build:               CCL v2.1.0-alpha.20180702-2035-gf47e44b @ 2018/08/27 20:34:40 (go1.10.3)
webui:               http://127.0.0.1:26258
sql:                 postgresql://root@127.0.0.1:26257?sslmode=disable
client flags:        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach --host=127.0.0.1:26257 --insecure
logs:                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestNodeRestart/runMode=local/1
temp dir:            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1/cockroach-temp034657623
external I/O path:   /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1/extern
store[0]:            path=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1
status:              restarted pre-existing node
clusterID:           c57929c9-0394-4a97-b769-f24308b3e766
nodeID:              1
E180827 20:44:31.426595 6963 acceptance/zchaos_test.go:272  round 1: failed to do consistency check against node 3: Consistency checking is unimplmented and should be re-implemented using SQL
CockroachDB node starting at 2018-08-27 20:44:32.133032435 +0000 UTC (took 0.6s)
build:               CCL v2.1.0-alpha.20180702-2035-gf47e44b @ 2018/08/27 20:34:40 (go1.10.3)
webui:               http://127.0.0.1:26258
sql:                 postgresql://root@127.0.0.1:26257?sslmode=disable
client flags:        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach --host=127.0.0.1:26257 --insecure
logs:                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestNodeRestart/runMode=local/1
temp dir:            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1/cockroach-temp004689993
external I/O path:   /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1/extern
store[0]:            path=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1
status:              restarted pre-existing node
clusterID:           c57929c9-0394-4a97-b769-f24308b3e766
nodeID:              1
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/logTestNodeRestart083184425
--- FAIL: acceptance/TestNodeRestart (8.210s)
test_log_scope.go:81: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/logTestNodeRestart083184425
test_log_scope.go:62: use -show-logs to present logs inline
--- FAIL: acceptance/TestNodeRestart: TestNodeRestart/runMode=local (8.210s)
zchaos_test.go:152: pq: unable to dial n1: breaker open
------- Stdout: -------
CockroachDB node starting at 2018-08-27 20:44:25.653667204 +0000 UTC (took 0.7s)
build:               CCL v2.1.0-alpha.20180702-2035-gf47e44b @ 2018/08/27 20:34:40 (go1.10.3)
webui:               http://127.0.0.1:26258
sql:                 postgresql://root@127.0.0.1:26257?sslmode=disable
client flags:        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach --host=127.0.0.1:26257 --insecure
logs:                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestNodeRestart/runMode=local/1
temp dir:            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1/cockroach-temp269295897
external I/O path:   /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1/extern
store[0]:            path=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1
status:              initialized new cluster
clusterID:           c57929c9-0394-4a97-b769-f24308b3e766
nodeID:              1
CockroachDB node starting at 2018-08-27 20:44:25.784217118 +0000 UTC (took 0.8s)
build:               CCL v2.1.0-alpha.20180702-2035-gf47e44b @ 2018/08/27 20:34:40 (go1.10.3)
webui:               http://127.0.0.1:26262
sql:                 postgresql://root@127.0.0.1:26261?sslmode=disable
client flags:        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach --host=127.0.0.1:26261 --insecure
logs:                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestNodeRestart/runMode=local/3
temp dir:            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/3/cockroach-temp395953582
external I/O path:   /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/3/extern
store[0]:            path=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/3
status:              initialized new node, joined pre-existing cluster
clusterID:           c57929c9-0394-4a97-b769-f24308b3e766
nodeID:              2
CockroachDB node starting at 2018-08-27 20:44:25.79403994 +0000 UTC (took 0.8s)
build:               CCL v2.1.0-alpha.20180702-2035-gf47e44b @ 2018/08/27 20:34:40 (go1.10.3)
webui:               http://127.0.0.1:26260
sql:                 postgresql://root@127.0.0.1:26259?sslmode=disable
client flags:        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach --host=127.0.0.1:26259 --insecure
logs:                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestNodeRestart/runMode=local/2
temp dir:            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/2/cockroach-temp206654075
external I/O path:   /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/2/extern
store[0]:            path=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/2
status:              initialized new node, joined pre-existing cluster
clusterID:           c57929c9-0394-4a97-b769-f24308b3e766
nodeID:              4
CockroachDB node starting at 2018-08-27 20:44:25.843164199 +0000 UTC (took 0.9s)
build:               CCL v2.1.0-alpha.20180702-2035-gf47e44b @ 2018/08/27 20:34:40 (go1.10.3)
webui:               http://127.0.0.1:26264
sql:                 postgresql://root@127.0.0.1:26263?sslmode=disable
client flags:        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach --host=127.0.0.1:26263 --insecure
logs:                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestNodeRestart/runMode=local/4
temp dir:            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/4/cockroach-temp941819280
external I/O path:   /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/4/extern
store[0]:            path=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/4
status:              initialized new node, joined pre-existing cluster
clusterID:           c57929c9-0394-4a97-b769-f24308b3e766
nodeID:              3
CockroachDB node starting at 2018-08-27 20:44:28.052850002 +0000 UTC (took 0.7s)
build:               CCL v2.1.0-alpha.20180702-2035-gf47e44b @ 2018/08/27 20:34:40 (go1.10.3)
webui:               http://127.0.0.1:26258
sql:                 postgresql://root@127.0.0.1:26257?sslmode=disable
client flags:        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach --host=127.0.0.1:26257 --insecure
logs:                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestNodeRestart/runMode=local/1
temp dir:            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1/cockroach-temp034657623
external I/O path:   /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1/extern
store[0]:            path=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1
status:              restarted pre-existing node
clusterID:           c57929c9-0394-4a97-b769-f24308b3e766
nodeID:              1
E180827 20:44:31.426595 6963 acceptance/zchaos_test.go:272  round 1: failed to do consistency check against node 3: Consistency checking is unimplmented and should be re-implemented using SQL
CockroachDB node starting at 2018-08-27 20:44:32.133032435 +0000 UTC (took 0.6s)
build:               CCL v2.1.0-alpha.20180702-2035-gf47e44b @ 2018/08/27 20:34:40 (go1.10.3)
webui:               http://127.0.0.1:26258
sql:                 postgresql://root@127.0.0.1:26257?sslmode=disable
client flags:        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach --host=127.0.0.1:26257 --insecure
logs:                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/TestNodeRestart/runMode=local/1
temp dir:            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1/cockroach-temp004689993
external I/O path:   /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1/extern
store[0]:            path=/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/acceptance/.localcluster413238900/1
status:              restarted pre-existing node
clusterID:           c57929c9-0394-4a97-b769-f24308b3e766
nodeID:              1
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/logTestNodeRestart083184425

Please assign, take a look and update the issue accordingly.

@cockroach-teamcity cockroach-teamcity added this to the 2.1 milestone Aug 27, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Aug 27, 2018
@tbg tbg self-assigned this Aug 28, 2018
@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label Aug 28, 2018
@petermattis petermattis assigned petermattis and unassigned tbg Sep 6, 2018
@petermattis petermattis changed the title teamcity: failed tests on master: acceptance/TestNodeRestart teamcity: failed tests on master: acceptance/bank/node-restart Sep 6, 2018
@petermattis
Copy link
Collaborator

This is reproducible with the new acceptance/bank/node-restart roachtest via: roachtest run -l --count 100 acceptance/bank/node-restart.

@tbg
Copy link
Member

tbg commented Sep 25, 2018

This is the same underlying issue as #30613 (comment).

@tbg tbg assigned jordanlewis and unassigned petermattis Sep 25, 2018
@tbg tbg added A-sql-execution Relating to SQL execution. and removed A-kv-client Relating to the KV client and the KV interface. labels Sep 25, 2018
@petermattis
Copy link
Collaborator

In trying to reproduce / debug #29678, I'm instead hitting this bug. I'm adding some instrumentation to verify the error is occurring when distsql is attempt to dial a connection.

@tschottdorf It looks to me that Dialer.DialInternalClient (used by distsender) is not using the circuit breaker. Am I missing something there?

@petermattis
Copy link
Collaborator

The instrumentation shows that the error is being generated from:

pkg/sql/distsql_running.go
65:	conn, err := req.nodeDialer.Dial(req.ctx, req.nodeID)

This roachtest is frequently stopping and restarting nodes. I'm guessing we're trying to schedule a flow on a recently restarted node.

@petermattis
Copy link
Collaborator

Note that for the node-restart test, I believe the only query that will use distsql is run when all of the nodes are up.

@jordanlewis or @RaduBerinde Does DistSQL take into consideration node health during planning? I don't see any reference to ConnHealth in the planning code (but I admittedly am not familiar with that code and might have missed how that is done).

@RaduBerinde
Copy link
Member

We do, see checkNodeHealth:

if err := dsp.checkNodeHealth(ctx, nodeID, addr); err != nil {

@tbg
Copy link
Member

tbg commented Oct 1, 2018

@tschottdorf It looks to me that Dialer.DialInternalClient (used by distsender) is not using the circuit breaker. Am I missing something there?

Yeah that looks right. Don't think this follows some system, see:

// TODO(bdarnell): Reconcile the different health checks and circuit
// breaker behavior in this file

@petermattis
Copy link
Collaborator

@tschottdorf Yep, that is exactly the discrepancy. Interestingly, if I move distsql in the other direction and add a Dialer.Dial variant that doesn't check the breaker (similar to DialInternalClient), then the breaker open error goes away for me.

@petermattis
Copy link
Collaborator

@RaduBerinde Thanks. I missed that. I'll enable some of that vlogging to try and find out why distsql is attempting to talk to a node that just restarted but which has an open breaker. I think it might be possible for Dialer.DialInternalClient to open a conn causing conn health to be ok even though the breaker is still open.

@petermattis
Copy link
Collaborator

@tschottdorf I'm going to try the following diff:

--- a/pkg/rpc/nodedialer/nodedialer.go
+++ b/pkg/rpc/nodedialer/nodedialer.go
@@ -144,8 +144,10 @@ func (n *Dialer) DialInternalClient(
        if err != nil {
                return nil, nil, err
        }
-       // TODO(bdarnell): Reconcile the different health checks and circuit
-       // breaker behavior in this file
+       // TODO(bdarnell): Reconcile the different health checks and circuit breaker
+       // behavior in this file. Marking the breaker as open here is a stop-gap
+       // before that reconciliation.
+       n.getBreaker(nodeID).Success()
        if err := grpcutil.ConnectionReady(conn); err != nil {
                return nil, nil, err
        }

@petermattis
Copy link
Collaborator

Hmm, took 85 runs, but the breaker open happened again. Time for some more instrumentation.

@petermattis
Copy link
Collaborator

FYI, I'm trying to reproduce with some additional instrumentation which will make it clearer when this breaker open situation is occurring.

@petermattis
Copy link
Collaborator

petermattis commented Oct 3, 2018

My additional instrumentation is showing that my theory above seems correct:

I181003 01:47:43.091994 393 rpc/nodedialer/nodedialer.go:154  [intExec=get-table-statistics,n2,txn=9c8dd3e8] n1: breaker open: but dialed anyways
I181003 01:47:43.092986 55 rpc/nodedialer/nodedialer.go:100  [intExec=get-table-statistics] unable to dial n1: breaker open

So one goroutine (the first log message) is using Dialer.DialInternalClient and constructing a gRPC connection while another goroutine (distsql in this instance, though not shown in the message) is using Dialer.Dial and failing to use the existing connection because the breaker is open. The distsql planning code attempts to not plan processors on down nodes by checking both node liveness and connection health. The connection health check does not examine the breaker. Distsql planning actually goes directly to the RPC context and avoids using the rpc/nodedialer, which feels like a bug in itself, though even if it used Dialer.ConnHealth the problem would occur because that method does not check the breaker either.

So now I'm confused as to why my diff above didn't fix the issue. Or perhaps it fixed one issue but there is another issue that has the same symptoms. I wonder if it would be better to fix this from the other end and have distsql not plan processors on nodes where the connection breaker is open.

@tschottdorf, @RaduBerinde Thoughts?

Oh, looking at the circuit breaker code just now and the problem might be calling breaker.Success when I should be calling breaker.Reset. The former only workers if the breaker is in the halfOpen state which occurs when a retry was requested.

@jordanlewis
Copy link
Member

To add a little additional context, this was discovered in #28659, resulting in #28704, but it hasn't been prioritized as DistSQL isn't resilient to node failure during flow execution either.

@tbg
Copy link
Member

tbg commented Oct 3, 2018

@petermattis it seems like the right thing for DistSQL to do is to check the breaker when determining node health, and to retry on breaker errors (i.e. not return them to clients) which can still occur if the breaker flips after DistSQL has determined that it thinks the node is healthy.

Then this test would still expect errors that occur during DistSQL execution to bubble up because I don't think those are handled correctly (according to Jordan's post).

I think what shoves all of these DistSQL problems under the rug is this (which the bank test also calls):

// IsSQLRetryableError returns true if err is retryable. This is true
// for errors that show a connection issue or an issue with the node
// itself. This can occur when a node is restarting or is unstable in
// some other way. Note that retryable errors may occur event in cases
// where the SQL execution ran to completion.
//
// TODO(bdarnell): Why are RPC errors in this list? These should
// generally be retried on the server side or transformed into
// ambiguous result errors ("connection reset/refused" are needed for
// the pgwire connection, but anything RPC-related should be handled
// within the cluster).
func IsSQLRetryableError(err error) bool {
	// Don't forget to update the corresponding test when making adjustments
	// here.
	return IsError(err, "(no inbound stream connection|connection reset by peer|connection refused|failed to send RPC|rpc error: code = Unavailable|EOF|result is ambiguous)")
}

@petermattis
Copy link
Collaborator

@tschottdorf Is that the right thing to do in the short term? The breaker is open incorrectly as another part of the system disregarded the breaker and successfully opened a connection. Seems to me that the breaker should be reset when that occurs. Medium to long term we should see about unifying the usage of breakers across all dial attempts, but I'm very nervous about making that change now.

I'm currently running with this diff which is similar to the one above except s/Success/Reset/g. So far I haven't seen the breaker open error across 250 runs.

--- a/pkg/rpc/nodedialer/nodedialer.go
+++ b/pkg/rpc/nodedialer/nodedialer.go
@@ -144,8 +144,10 @@ func (n *Dialer) DialInternalClient(
        if err != nil {
                return nil, nil, err
        }
-       // TODO(bdarnell): Reconcile the different health checks and circuit
-       // breaker behavior in this file
+       // TODO(bdarnell): Reconcile the different health checks and circuit breaker
+       // behavior in this file. Marking the breaker as open here is a stop-gap
+       // before that reconciliation.
+       n.getBreaker(nodeID).Reset()
        if err := grpcutil.ConnectionReady(conn); err != nil {
                return nil, nil, err
        }

To add a little additional context, this was discovered in #28659, resulting in #28704, but it hasn't been prioritized as DistSQL isn't resilient to node failure during flow execution either.

@jordanlewis Making DistSQL resilient to node failure seems like something we give more priority to. This feels like a decent size project.

@tbg
Copy link
Member

tbg commented Oct 3, 2018

This looks good (since Connect() above came back). I'm not sure when you would get that far but ConnectionReady would return false. Perhaps the breaker should only be reset when ConnectionReady is also cleared (but that's me FUDing).

@jordanlewis
Copy link
Member

@jordanlewis Making DistSQL resilient to node failure seems like something we give more priority to. This feels like a decent size project.

See #28703 for one idea, and #15637 for more details.

@petermattis
Copy link
Collaborator

Ugh, just got a breaker open failure. This one is coming from a different area of the distsql code:

I181003 14:52:50.756940 113 rpc/nodedialer/nodedialer.go:100  [n4,client=127.0.0.1:60164,user=root] unable to dial n2: b
reaker open
goroutine 113 [running]:
runtime/debug.Stack(0x6a96c80, 0xc4202e2640, 0x6581644)
        /Users/pmattis/Development/go1.10/src/runtime/debug/stack.go:24 +0xa7
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).Dial(0xc4209dda40, 0x6ac5380, 0xc4289fbad0, 0xc400000002,
0x0, 0x0, 0x0)
        /Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:100 +0x366
github.com/cockroachdb/cockroach/pkg/sql.runnerRequest.run(0x6ac5380, 0xc4289fbad0, 0xc4209dda40, 0xc4283520e0, 0x2, 0xc
4264cad80)
        /Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:65 +0x65
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).initRunners.func1(0x6ac5380, 0xc4207c2390)
        /Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:92 +0x9a
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4202e3c80, 0xc42058a750, 0xc4202e3c70)
        /Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad

I think my patch above helped and this is another symptom of the problem. Investigating.

@petermattis
Copy link
Collaborator

Ok, got another repro. Here's the timeline:

21:14:58.357833 - n4 is killed and restarted
21:14:59.028001 - n4 acquires the lease for r24 (the bank.accounts range)
21:14:59.?????? - n3 is killed
21:14:59.606433 - n4 breaker to n3 is opened (due to ct-client failing to dial n3)
21:14:59.757865 - n3 is restarted
21:15:00.660740 - n4 reports unable to dial n3: breaker open from distsql_running.go

My suspicion as to what is happening is that distsql planning is seeing an open conn to n3 that has not been heartbeated. The lack of a heartbeat is allowing it to plan a processor on n3, but it subsequently discovers n3 is down when it goes to run the processor. This can, of course, happen during normal operation. But its affecting the test because distsql allows planning on nodes which have not had a successful heartbeat:

		if err != nil && err != rpc.ErrNotHeartbeated {
			// This host is known to be unhealthy. Don't use it (use the gateway
			// instead). Note: this can never happen for our nodeID (which
			// always has its address in the nodeMap).
			log.VEventf(ctx, 1, "marking n%d as unhealthy for this plan: %v", nodeID, err)
			return err
		}

Looks like this code originated in 3e2446f. @RaduBerinde do you remember why you included ErrNotHeartbeated here? Also, I'm still speculating above. I'm going to verify that this speculation is true.

@RaduBerinde
Copy link
Member

I don't think I had a good reason, I was probably being conservative.

@petermattis
Copy link
Collaborator

Ack. I added more instrumentation that was supposed to highlight this as the problem, but so far that instrumentation has not pointed a finger at this code. So I'm adding even more instrumentation to see which nodes distsql is planning on.

@petermattis
Copy link
Collaborator

Ok, my additional instrumentation is pointing to a problem with considering rpc.ErrNotHeartbeated as healthy:

I181004 16:18:12.312907 327 sql/distsql_physical_planner.go:621  [n2,client=127.0.0.1:65122,user=root] n1: checking node health
I181004 16:18:12.312944 327 sql/distsql_physical_planner.go:658  [n2,client=127.0.0.1:65122,user=root] n1: node health: not yet heartbeated
I181004 16:18:12.313086 89 rpc/nodedialer/nodedialer.go:100  [n2,client=127.0.0.1:65122,user=root] unable to dial n1: breaker open
goroutine 89 [running]:
runtime/debug.Stack(0x6a97bc0, 0xc4202b0f90, 0x6582416)
	/Users/pmattis/Development/go1.10/src/runtime/debug/stack.go:24 +0xa7
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).Dial(0xc42088b350, 0x6ac62c0, 0xc425ae12f0, 0xc400000001, 0x0, 0x0, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:100 +0x57c
github.com/cockroachdb/cockroach/pkg/sql.runnerRequest.run(0x6ac62c0, 0xc425ae12f0, 0xc42088b350, 0xc429824000, 0x1, 0xc427e489c0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:65 +0x65
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).initRunners.func1(0x6ac62c0, 0xc42053db90)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:92 +0x9a
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4202fef20, 0xc420569200, 0xc4202fef10)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad

An interesting bit about this failure is that the breaker was opened before distsql planning even began:

I181004 16:18:10.905143 720 rpc/nodedialer/nodedialer.go:112  [ct-client] n1: breaker opened

So perhaps @tschottdorf's suggestion to not plan on nodes with an open breaker is the right approach here. That's straightforward to implement.

petermattis added a commit to petermattis/cockroach that referenced this issue Oct 4, 2018
`Dialer.DialInternalClient` does not check the circuit breaker but
blindly attempts a connection and can succeed, leaving the system in a
state where there is a healthy connection to a node, but the circuit
breaker used for dialing is open. DistSQL checks for connection health
when scheduling processors, but the connection health check does not
examine the breaker. So DistSQL will proceed to schedule a processor on
a node but then be unable to use the connection to that node because
`Dialer.Dial` will return with a `breaker open` error. The code contains
a TODO to reconcile the handling of circuit breakers in the various
`Dialer` methods, but changing the handling is risky in the short
term. As a stop-gap, we reset the breaker after a connection is
successfully opened.

Fixes cockroachdb#29149

Release note: None
craig bot pushed a commit that referenced this issue Oct 5, 2018
30987: sql,rpc/nodedialer: improve distsql node health checks r=tschottdorf a=petermattis

Improve distsql node health checks so that the presence of an open
circuit breaker is consider. Previously it was possible for distsql to
plan a processor on a node with an open circuit breaker which ensured an
"unable to dial" error when the plan was run.

Fixes #29149
Fixes #28704

Release note: None

Co-authored-by: Peter Mattis <petermattis@gmail.com>
@craig craig bot closed this as completed in #30987 Oct 5, 2018
petermattis added a commit to petermattis/cockroach that referenced this issue Oct 5, 2018
`Dialer.DialInternalClient` does not check the circuit breaker but
blindly attempts a connection and can succeed, leaving the system in a
state where there is a healthy connection to a node, but the circuit
breaker used for dialing is open. DistSQL checks for connection health
when scheduling processors, but the connection health check does not
examine the breaker. So DistSQL will proceed to schedule a processor on
a node but then be unable to use the connection to that node because
`Dialer.Dial` will return with a `breaker open` error. The code contains
a TODO to reconcile the handling of circuit breakers in the various
`Dialer` methods, but changing the handling is risky in the short
term. As a stop-gap, we reset the breaker after a connection is
successfully opened.

Fixes cockroachdb#29149

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

5 participants