distsql: panic 'flow already registered' #12876

RaduBerinde · 2017-01-12T14:32:49Z

Ran into the panic below the first time I tried to run a distsql query on a node. Did not run into it again after restart. Nothing useful in the logs, except a long spam of messages like the one below (prior to the query being run).

I170112 14:11:20.066619 5270719 vendor/google.golang.org/grpc/server.go:723  grpc: Server.processUnaryRPC failed to write status: stream error: code = 4 desc = "context deadline exceeded"

I looked through the code and the only explanation I can come up with is if the distsql planner's nodeID was not set correctly. Then we may end up sending a setup flow request to ourselves AND setup a sync flow locally with the same id.

Investigate the nodeID issue and make this into an error not a panic.

panic: flow already registered [recovered]
	panic: SELECT COUNT(*) FROM comments;: flow already registered [recovered]
	panic: SELECT COUNT(*) FROM comments;: flow already registered

goroutine 5064283 [running]:
panic(0x1767a00, 0xc4f5d471e0)
	/home/radu/goroot/src/runtime/panic.go:500 +0x1a1
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc4200b0870)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:185 +0x6e
panic(0x1767a00, 0xc4f5d471e0)
	/home/radu/goroot/src/runtime/panic.go:458 +0x243
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).ExecuteStatements.func1(0xc4abcf31b7, 0x1e)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:455 +0x149
panic(0x1722600, 0xc4f5d471c0)
	/home/radu/goroot/src/runtime/panic.go:458 +0x243
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*flowRegistry).RegisterFlow(0xc4202285d0, 0xe94ba1002c773502, 0x5fd584bb82d464b0, 0xc4f5a11680, 0xc4c729cc30)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow_registry.go:130 +0x328
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Start(0xc4f5a11680, 0x1af3fe0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:350 +0x1f7
github.com/cockroachdb/cockroach/pkg/sql.(*distSQLPlanner).PlanAndRun(0xc4203e6310, 0x7fd97b5218e8, 0xc48798a000, 0xc4ea2e4000, 0x249cfc0, 0xc4af580300, 0xc46fe48000, 0x0, 0xc4d6dc90c0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_physical_planner.go:1474 +0x1366
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).execDistSQL(0xc4201ba9a0, 0xc43c922d70, 0x249cfc0, 0xc4af580300, 0xc50af37ee8, 0x0, 0x0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:1222 +0x113
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).execStmt(0xc4201ba9a0, 0x2494da0, 0xc4dca0bfb0, 0xc43c922d70, 0xc50af30001, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:1343 +0x49d
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).execStmtInOpenTxn(0xc4201ba9a0, 0x2494da0, 0xc4dca0bfb0, 0xc43c922d70, 0x101, 0xc43c922c68, 0x0, 0x0, 0x0, 0x0, ...)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:1113 +0x2ab
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).execStmtsInCurrentTxn(0xc4201ba9a0, 0xc4ed3f0c50, 0x1, 0x1, 0xc43c922d70, 0xc43c922c68, 0x101, 0x0, 0x6, 0xaf38b48, ...)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:850 +0x7db
github.com/cockroachdb/cockroach/pkg/sql.runTxnAttempt(0xc4201ba9a0, 0xc43c922d70, 0x1, 0xc43c922c68, 0xc4ed3f0c90, 0xc4ed3f0c50, 0x1, 0x1, 0x0, 0x0, ...)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:768 +0xf8
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).execRequest.func2(0xc4ea2e4000, 0xc4ed3f0c90, 0x0, 0x0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:621 +0x2b7
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Exec(0xc4ea2e4000, 0xc4ad550101, 0xc420018140, 0xc4ad555880, 0x0, 0x0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:520 +0x215
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).execRequest(0xc4201ba9a0, 0xc43c922c00, 0xc4abcf31b7, 0x1e, 0x0, 0x0, 0x0, 0x0, 0xc4e6bad300)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:643 +0x5b0
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).ExecuteStatements(0xc4201ba9a0, 0xc43c922c00, 0xc4abcf31b7, 0x1e, 0x0, 0x0, 0x0, 0x0, 0xc4abcf3100)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:461 +0x11f
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*v3Conn).executeStatements(0xc49b401800, 0x7fd97b5218e8, 0xc4b0e53950, 0xc4abcf31b7, 0x1e, 0x0, 0x0, 0x0, 0x0, 0x701601, ...)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/v3.go:792 +0x94
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*v3Conn).handleSimpleQuery(0xc49b401800, 0x7fd97b5218e8, 0xc4b0e53950, 0xc49b401828, 0x23, 0x0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/v3.go:470 +0xb3
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*v3Conn).serve(0xc49b401800, 0x7fd97b5218e8, 0xc4b0e53950, 0x5400, 0xc420250138, 0x7fd97b5218e8, 0xc4b0e53920, 0x0, 0x0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/v3.go:397 +0xa62
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*Server).ServeConn(0xc420250000, 0x7fd97b5218e8, 0xc4b0e53920, 0x249c280, 0xc4d66a0000, 0x0, 0x0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/server.go:309 +0x999
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start.func8.1(0x249c280, 0xc4d66a0000)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:522 +0x106
github.com/cockroachdb/cockroach/pkg/util/netutil.(*Server).ServeWith.func1(0xc4200b0870, 0xc420042030, 0x249c280, 0xc4d66a0000, 0xc4209c5780)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/util/netutil/net.go:136 +0x95
created by github.com/cockroachdb/cockroach/pkg/util/netutil.(*Server).ServeWith
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/util/netutil/net.go:138 +0x277

The text was updated successfully, but these errors were encountered:

RaduBerinde · 2017-02-24T18:01:19Z

Unassigning issues that someone should look into while I'm gone.

dianasaur323 · 2017-04-18T15:07:29Z

@RaduBerinde do you still want to take a look at this now that you are back?

andreimatei · 2017-08-23T18:41:34Z

https://sentry.io/cockroach-labs/cockroachdb/issues/333253424/

Two flows with the same id seem to be scheduled on the same node which used to cause panics. How that can happen is currently unclear; cosmic rays? This patch turns the panic into an RPC or query error (depending on sync/async flow), adds some more sanity checks and adds a representation of the flow to the error and the error will also invite users to report on cockroachdb#12876. Touches cockroachdb#12876.

RaduBerinde · 2017-08-28T15:23:54Z

This is now converted to an error. Moving to 1.2 since we don't have much to go on in terms of reproducing the condition.

dt · 2017-09-07T15:41:48Z

hello!

on sapphire-1, right now:

root@localhost:26257/> IMPORT TABLE lineitem (l_orderkey INTEGER NOT NULL, l_partkey INTEGER NOT NULL, l_suppkey INTEGER NOT NULL, l_linenumber INTEGER NOT NULL, l_quantity DECIMAL(15,2) NOT NULL, l_extendedprice DECIMAL(15,2) NOT NULL, l_discount DECIMAL(15,2) NOT NULL, l_tax DECIMAL(15,2) NOT NULL, l_returnflag CHAR(1) NOT NULL, l_linestatus CHAR(1) NOT NULL, l_shipdate DATE NOT NULL, l_commitdate DATE NOT NULL, l_receiptdate DATE NOT NULL, l_shipinstruct CHAR(25) NOT NULL, l_shipmode CHAR(10) NOT NULL, l_comment VARCHAR(44) NOT NULL, PRIMARY KEY (l_orderkey, l_linenumber), INDEX l_ok (l_orderkey ASC), INDEX l_pk (l_partkey ASC), INDEX l_sk (l_suppkey ASC), INDEX l_sd (l_shipdate ASC), INDEX l_cd (l_commitdate ASC), INDEX l_rd (l_receiptdate ASC), INDEX l_pk_sk (l_partkey ASC, l_suppkey ASC), INDEX l_sk_pk (l_suppkey ASC, l_partkey ASC)) CSV DATA ('http://192.168.1.4:2015/csv/sf-5/lineitem.tbl.1', 'http://192.168.1.5:2015/csv/sf-5/lineitem.tbl.2', 'http://192.168.1.6:2015/csv/sf-5/lineitem.tbl.3', 'http://192.168.1.7:2015/csv/sf-5/lineitem.tbl.4', 'http://192.168.1.4:2015/csv/sf-5/lineitem.tbl.5', 'http://192.168.1.5:2015/csv/sf-5/lineitem.tbl.6', 'http://192.168.1.6:2015/csv/sf-5/lineitem.tbl.7', 'http://192.168.1.7:2015/csv/sf-5/lineitem.tbl.8') WITH temp = 'http://192.168.1.4:2015,192.168.1.5:2015,192.168.1.6:2015,192.168.1.7:2015/import-temp', delimiter = '|', distributed, transform_only;
pq: unexpected error: different nodes with the same address: [3 2] and %!d(MISSING) (we've been trying to track this particular issue down; please report your reproduction at https://github.com/cockroachdb/cockroach/issues/12876)

We were incorrectly passing all the args as a single slice arg. Also improving the error message for cockroachdb#12876.

RaduBerinde · 2017-09-07T15:58:47Z

n2 and n3 indeed look like they have the same address (though n2 doesn't look live in gossip). Possibly some misconfiguration? We should either detect this and fail earlier or handle it gracefully in distsql.

We were incorrectly passing all the args as a single slice arg. Also improving the error message for cockroachdb#12876.

a-robinson · 2018-04-27T18:15:18Z

Looks like this problem is CSV-specific. The normal DistSQL code path nicely health-checks and version-checks nodes before planning on them:

cockroach/pkg/sql/distsql_physical_planner.go

Lines 896 to 906 in 1700c40

 addr := replInfo.NodeDesc.Address.String() 

 if err := dsp.checkNodeHealth(planCtx.ctx, nodeID, addr); err != nil { 

 log.Eventf(planCtx.ctx, "not planning on node %d. unhealthy", nodeID) 

 return dsp.nodeDesc.NodeID, nil 

 } 

 if !dsp.nodeVersionIsCompatible(nodeID, dsp.planVersion) { 

 log.Eventf(planCtx.ctx, "not planning on node %d. incompatible version", nodeID) 

 return dsp.nodeDesc.NodeID, nil 

 } 

 planCtx.nodeAddresses[nodeID] = addr

The CSV code, on the other hand, uses a different code path that just grabs all nodes that have ever existed, including decommissioned nodes, seemingly without doing any checks on them:

cockroach/pkg/ccl/importccl/csv.go

Lines 927 to 935 in 1700c40

 // TODO(dan): Filter out unhealthy nodes. 

 resp, err := p.ExecCfg().StatusServer.Nodes(ctx, &serverpb.NodesRequest{}) 

 if err != nil { 

 return err 

 } 

 var nodes []roachpb.NodeDescriptor 

 for _, node := range resp.Nodes { 

 nodes = append(nodes, node.Desc) 

 }

cockroach/pkg/sql/distsql_plan_csv.go

Lines 215 to 219 in 1700c40

 // Because we're not going through the normal pathways, we have to set up 

 // the nodeID -> nodeAddress map ourselves. 

 for _, node := range nodes { 

 planCtx.nodeAddresses[node.NodeID] = node.Address.String() 

 }

Passing back to @mjibson since this doesn't appear to be a gossip-related issue and I'm not super familiar with the code paths involved.

andreimatei · 2018-04-27T18:17:21Z

Ah, missed the part where both repros were for imports. Good catch.

25154: opt: add WITH ORDINALITY via a RowNumber operator r=justinj a=justinj WITH ORDINALITY is has the behaviour of a special case of the window function RANK with the empty set of partitioning columns. This commit leaves most of window functions unimplemented, only implementing enough to do WITH ORDINALITY. WITH ORDINALITY introduces a new column `ordinality` to the input set, whose value corresponds to a given row's position in that input set. The semantics of the ordering are briefly discussed at https://www.cockroachlabs.com/docs/stable/query-order.html#order-preservation. This commit introduces window functions in a very limited form - only as required for WITH ORDINALITY, so no partitioning or window functions besides ROW_NUMBER() are supported. In addition, a given window function can only have a single windowing operator. Release note: None 25162: importccl: check node health and compatibility during IMPORT planning r=mjibson a=mjibson Simplify the LoadCSV signature by taking just a PlanHookState for any argument that can be fetched from it. Determine the node list using this new health check function. We can remove the rand.Shuffle call because the map iteration should produce some level of randomness. Fixes #12876 Release note (bug fix): Fix problems with imports sometimes failing after node decommissioning. 25226: opt: Hoist EXISTS and ANY operators r=andy-kimball a=andy-kimball This PR contains several commits that flatten EXISTS and ANY subqueries by hoisting them up and joining them with the higher level relational query. Hoisting and flattening subqueries into a single, simple tree of relational operators is the preparatory step to decorrelation. Next will come rules which will try to eliminate correlation in the flattened tree. Co-authored-by: Justin Jaffray <justin@cockroachlabs.com> Co-authored-by: Matt Jibson <matt.jibson@gmail.com> Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com>

25307: release-2.0: importccl: check node health and compatibility during IMPORT planning r=mjibson a=mjibson Backport 1/1 commits from #25162. /cc @cockroachdb/release --- Simplify the LoadCSV signature by taking just a PlanHookState for any argument that can be fetched from it. Determine the node list using this new health check function. We can remove the rand.Shuffle call because the map iteration should produce some level of randomness. Fixes #12876 Release note (bug fix): Fix problems with imports sometimes failing after node decommissioning. Co-authored-by: Matt Jibson <matt.jibson@gmail.com>

SJAnderson · 2018-05-19T00:14:46Z

@andreimatei just got this. it happened after i rebooted my server and my PD SSD didn't attach itself before my daemon started cockroach again.

Since the respective issue has been closed. Touches cockroachdb#12876 Release note: None

30563: sql: remove an UnexpectedWithIssueErrorf r=andreimatei a=andreimatei Since the respective issue has been closed. Touches #12876 Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

RaduBerinde self-assigned this Jan 12, 2017

cuongdo added this to the Q1 2017 milestone Feb 22, 2017

RaduBerinde removed their assignment Feb 24, 2017

RaduBerinde added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 24, 2017

RaduBerinde self-assigned this Apr 18, 2017

dianasaur323 modified the milestones: 1.0, Q1 2017 Apr 19, 2017

cuongdo modified the milestones: 1.1, 1.0 Apr 25, 2017

andreimatei mentioned this issue Aug 23, 2017

distsqlrun: turn "flow already registered" into an error #17876

Merged

RaduBerinde modified the milestones: 1.2, 1.1 Aug 28, 2017

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Sep 7, 2017

util: fix UnexpectedWithIssueErrorf

0eec20d

We were incorrectly passing all the args as a single slice arg. Also improving the error message for cockroachdb#12876.

RaduBerinde mentioned this issue Sep 7, 2017

util: fix UnexpectedWithIssueErrorf #18325

Merged

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Sep 7, 2017

util: fix UnexpectedWithIssueErrorf

9946d6a

We were incorrectly passing all the args as a single slice arg. Also improving the error message for cockroachdb#12876.

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Sep 7, 2017

util: fix UnexpectedWithIssueErrorf

c894da6

We were incorrectly passing all the args as a single slice arg. Also improving the error message for cockroachdb#12876.

RaduBerinde mentioned this issue Sep 7, 2017

cherrypick-1.1: util: fix UnexpectedWithIssueErrorf #18328

Merged

andreimatei removed this from the 1.2 milestone Sep 19, 2017

a-robinson assigned maddyblue and unassigned a-robinson Apr 27, 2018

a-robinson added the A-disaster-recovery label Apr 27, 2018

maddyblue modified the milestones: 1.1, 2.1 Apr 30, 2018

maddyblue mentioned this issue Apr 30, 2018

importccl: check node health and compatibility during IMPORT planning #25162

Merged

craig bot closed this as completed in #25162 May 3, 2018

maddyblue mentioned this issue May 4, 2018

release-2.0: importccl: check node health and compatibility during IMPORT planning #25307

Merged

andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 24, 2018

sql: remove an UnexpectedWithIssueErrorf

bad7b57

Since the respective issue has been closed. Touches cockroachdb#12876 Release note: None

andreimatei mentioned this issue Sep 24, 2018

sql: remove an UnexpectedWithIssueErrorf #30563

Merged

andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 24, 2018

sql: remove an UnexpectedWithIssueErrorf

0b1954f

Since the respective issue has been closed. Touches cockroachdb#12876 Release note: None

andreimatei mentioned this issue Sep 24, 2018

release-2.1: sql: remove an UnexpectedWithIssueErrorf #30609

Closed

andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 25, 2018

sql: remove an UnexpectedWithIssueErrorf

2a3a88e

Since the respective issue has been closed. Touches cockroachdb#12876 Release note: None

andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 25, 2018

sql: remove an UnexpectedWithIssueErrorf

9dcae4a

Since the respective issue has been closed. Touches cockroachdb#12876 Release note: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsql: panic 'flow already registered' #12876

distsql: panic 'flow already registered' #12876

RaduBerinde commented Jan 12, 2017

RaduBerinde commented Feb 24, 2017

dianasaur323 commented Apr 18, 2017

andreimatei commented Aug 23, 2017

RaduBerinde commented Aug 28, 2017

dt commented Sep 7, 2017

RaduBerinde commented Sep 7, 2017

a-robinson commented Apr 27, 2018

andreimatei commented Apr 27, 2018

SJAnderson commented May 19, 2018

distsql: panic 'flow already registered' #12876

distsql: panic 'flow already registered' #12876

Comments

RaduBerinde commented Jan 12, 2017

RaduBerinde commented Feb 24, 2017

dianasaur323 commented Apr 18, 2017

andreimatei commented Aug 23, 2017

RaduBerinde commented Aug 28, 2017

dt commented Sep 7, 2017

RaduBerinde commented Sep 7, 2017

a-robinson commented Apr 27, 2018

andreimatei commented Apr 27, 2018

SJAnderson commented May 19, 2018