Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql: panic 'flow already registered' #12876

Closed
RaduBerinde opened this issue Jan 12, 2017 · 11 comments · Fixed by #25162
Closed

distsql: panic 'flow already registered' #12876

RaduBerinde opened this issue Jan 12, 2017 · 11 comments · Fixed by #25162
Assignees
Labels
A-disaster-recovery A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@RaduBerinde
Copy link
Member

Ran into the panic below the first time I tried to run a distsql query on a node. Did not run into it again after restart. Nothing useful in the logs, except a long spam of messages like the one below (prior to the query being run).

I170112 14:11:20.066619 5270719 vendor/google.golang.org/grpc/server.go:723  grpc: Server.processUnaryRPC failed to write status: stream error: code = 4 desc = "context deadline exceeded"

I looked through the code and the only explanation I can come up with is if the distsql planner's nodeID was not set correctly. Then we may end up sending a setup flow request to ourselves AND setup a sync flow locally with the same id.

Investigate the nodeID issue and make this into an error not a panic.

panic: flow already registered [recovered]
	panic: SELECT COUNT(*) FROM comments;: flow already registered [recovered]
	panic: SELECT COUNT(*) FROM comments;: flow already registered

goroutine 5064283 [running]:
panic(0x1767a00, 0xc4f5d471e0)
	/home/radu/goroot/src/runtime/panic.go:500 +0x1a1
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc4200b0870)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:185 +0x6e
panic(0x1767a00, 0xc4f5d471e0)
	/home/radu/goroot/src/runtime/panic.go:458 +0x243
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).ExecuteStatements.func1(0xc4abcf31b7, 0x1e)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:455 +0x149
panic(0x1722600, 0xc4f5d471c0)
	/home/radu/goroot/src/runtime/panic.go:458 +0x243
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*flowRegistry).RegisterFlow(0xc4202285d0, 0xe94ba1002c773502, 0x5fd584bb82d464b0, 0xc4f5a11680, 0xc4c729cc30)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow_registry.go:130 +0x328
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Start(0xc4f5a11680, 0x1af3fe0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:350 +0x1f7
github.com/cockroachdb/cockroach/pkg/sql.(*distSQLPlanner).PlanAndRun(0xc4203e6310, 0x7fd97b5218e8, 0xc48798a000, 0xc4ea2e4000, 0x249cfc0, 0xc4af580300, 0xc46fe48000, 0x0, 0xc4d6dc90c0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_physical_planner.go:1474 +0x1366
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).execDistSQL(0xc4201ba9a0, 0xc43c922d70, 0x249cfc0, 0xc4af580300, 0xc50af37ee8, 0x0, 0x0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:1222 +0x113
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).execStmt(0xc4201ba9a0, 0x2494da0, 0xc4dca0bfb0, 0xc43c922d70, 0xc50af30001, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:1343 +0x49d
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).execStmtInOpenTxn(0xc4201ba9a0, 0x2494da0, 0xc4dca0bfb0, 0xc43c922d70, 0x101, 0xc43c922c68, 0x0, 0x0, 0x0, 0x0, ...)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:1113 +0x2ab
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).execStmtsInCurrentTxn(0xc4201ba9a0, 0xc4ed3f0c50, 0x1, 0x1, 0xc43c922d70, 0xc43c922c68, 0x101, 0x0, 0x6, 0xaf38b48, ...)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:850 +0x7db
github.com/cockroachdb/cockroach/pkg/sql.runTxnAttempt(0xc4201ba9a0, 0xc43c922d70, 0x1, 0xc43c922c68, 0xc4ed3f0c90, 0xc4ed3f0c50, 0x1, 0x1, 0x0, 0x0, ...)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:768 +0xf8
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).execRequest.func2(0xc4ea2e4000, 0xc4ed3f0c90, 0x0, 0x0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:621 +0x2b7
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Exec(0xc4ea2e4000, 0xc4ad550101, 0xc420018140, 0xc4ad555880, 0x0, 0x0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:520 +0x215
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).execRequest(0xc4201ba9a0, 0xc43c922c00, 0xc4abcf31b7, 0x1e, 0x0, 0x0, 0x0, 0x0, 0xc4e6bad300)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:643 +0x5b0
github.com/cockroachdb/cockroach/pkg/sql.(*Executor).ExecuteStatements(0xc4201ba9a0, 0xc43c922c00, 0xc4abcf31b7, 0x1e, 0x0, 0x0, 0x0, 0x0, 0xc4abcf3100)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/executor.go:461 +0x11f
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*v3Conn).executeStatements(0xc49b401800, 0x7fd97b5218e8, 0xc4b0e53950, 0xc4abcf31b7, 0x1e, 0x0, 0x0, 0x0, 0x0, 0x701601, ...)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/v3.go:792 +0x94
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*v3Conn).handleSimpleQuery(0xc49b401800, 0x7fd97b5218e8, 0xc4b0e53950, 0xc49b401828, 0x23, 0x0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/v3.go:470 +0xb3
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*v3Conn).serve(0xc49b401800, 0x7fd97b5218e8, 0xc4b0e53950, 0x5400, 0xc420250138, 0x7fd97b5218e8, 0xc4b0e53920, 0x0, 0x0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/v3.go:397 +0xa62
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*Server).ServeConn(0xc420250000, 0x7fd97b5218e8, 0xc4b0e53920, 0x249c280, 0xc4d66a0000, 0x0, 0x0)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/server.go:309 +0x999
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start.func8.1(0x249c280, 0xc4d66a0000)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:522 +0x106
github.com/cockroachdb/cockroach/pkg/util/netutil.(*Server).ServeWith.func1(0xc4200b0870, 0xc420042030, 0x249c280, 0xc4d66a0000, 0xc4209c5780)
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/util/netutil/net.go:136 +0x95
created by github.com/cockroachdb/cockroach/pkg/util/netutil.(*Server).ServeWith
	/home/radu/go/src/github.com/cockroachdb/cockroach/pkg/util/netutil/net.go:138 +0x277
@RaduBerinde RaduBerinde self-assigned this Jan 12, 2017
@cuongdo cuongdo added this to the Q1 2017 milestone Feb 22, 2017
@RaduBerinde RaduBerinde removed their assignment Feb 24, 2017
@RaduBerinde
Copy link
Member Author

Unassigning issues that someone should look into while I'm gone.

@RaduBerinde RaduBerinde added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 24, 2017
@dianasaur323
Copy link
Contributor

@RaduBerinde do you still want to take a look at this now that you are back?

@RaduBerinde RaduBerinde self-assigned this Apr 18, 2017
@dianasaur323 dianasaur323 modified the milestones: 1.0, Q1 2017 Apr 19, 2017
@cuongdo cuongdo modified the milestones: 1.1, 1.0 Apr 25, 2017
@andreimatei
Copy link
Contributor

andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 23, 2017
Two flows with the same id seem to be scheduled on the same node which
used to cause panics. How that can happen is currently unclear; cosmic
rays? This patch turns the panic into an RPC or query error (depending
on sync/async flow), adds some more sanity checks and adds a
representation of the flow to the error and the error will also invite
users to report on cockroachdb#12876.

Touches cockroachdb#12876.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 24, 2017
Two flows with the same id seem to be scheduled on the same node which
used to cause panics. How that can happen is currently unclear; cosmic
rays? This patch turns the panic into an RPC or query error (depending
on sync/async flow), adds some more sanity checks and adds a
representation of the flow to the error and the error will also invite
users to report on cockroachdb#12876.

Touches cockroachdb#12876.
@RaduBerinde
Copy link
Member Author

This is now converted to an error. Moving to 1.2 since we don't have much to go on in terms of reproducing the condition.

@RaduBerinde RaduBerinde modified the milestones: 1.2, 1.1 Aug 28, 2017
@dt
Copy link
Member

dt commented Sep 7, 2017

hello!

on sapphire-1, right now:

root@localhost:26257/> IMPORT TABLE lineitem (l_orderkey INTEGER NOT NULL, l_partkey INTEGER NOT NULL, l_suppkey INTEGER NOT NULL, l_linenumber INTEGER NOT NULL, l_quantity DECIMAL(15,2) NOT NULL, l_extendedprice DECIMAL(15,2) NOT NULL, l_discount DECIMAL(15,2) NOT NULL, l_tax DECIMAL(15,2) NOT NULL, l_returnflag CHAR(1) NOT NULL, l_linestatus CHAR(1) NOT NULL, l_shipdate DATE NOT NULL, l_commitdate DATE NOT NULL, l_receiptdate DATE NOT NULL, l_shipinstruct CHAR(25) NOT NULL, l_shipmode CHAR(10) NOT NULL, l_comment VARCHAR(44) NOT NULL, PRIMARY KEY (l_orderkey, l_linenumber), INDEX l_ok (l_orderkey ASC), INDEX l_pk (l_partkey ASC), INDEX l_sk (l_suppkey ASC), INDEX l_sd (l_shipdate ASC), INDEX l_cd (l_commitdate ASC), INDEX l_rd (l_receiptdate ASC), INDEX l_pk_sk (l_partkey ASC, l_suppkey ASC), INDEX l_sk_pk (l_suppkey ASC, l_partkey ASC)) CSV DATA ('http://192.168.1.4:2015/csv/sf-5/lineitem.tbl.1', 'http://192.168.1.5:2015/csv/sf-5/lineitem.tbl.2', 'http://192.168.1.6:2015/csv/sf-5/lineitem.tbl.3', 'http://192.168.1.7:2015/csv/sf-5/lineitem.tbl.4', 'http://192.168.1.4:2015/csv/sf-5/lineitem.tbl.5', 'http://192.168.1.5:2015/csv/sf-5/lineitem.tbl.6', 'http://192.168.1.6:2015/csv/sf-5/lineitem.tbl.7', 'http://192.168.1.7:2015/csv/sf-5/lineitem.tbl.8') WITH temp = 'http://192.168.1.4:2015,192.168.1.5:2015,192.168.1.6:2015,192.168.1.7:2015/import-temp', delimiter = '|', distributed, transform_only;
pq: unexpected error: different nodes with the same address: [3 2] and %!d(MISSING) (we've been trying to track this particular issue down; please report your reproduction at https://github.com/cockroachdb/cockroach/issues/12876)

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Sep 7, 2017
We were incorrectly passing all the args as a single slice arg.

Also improving the error message for cockroachdb#12876.
@RaduBerinde
Copy link
Member Author

image

n2 and n3 indeed look like they have the same address (though n2 doesn't look live in gossip). Possibly some misconfiguration? We should either detect this and fail earlier or handle it gracefully in distsql.

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Sep 7, 2017
We were incorrectly passing all the args as a single slice arg.

Also improving the error message for cockroachdb#12876.
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Sep 7, 2017
We were incorrectly passing all the args as a single slice arg.

Also improving the error message for cockroachdb#12876.
@andreimatei andreimatei removed this from the 1.2 milestone Sep 19, 2017
@a-robinson
Copy link
Contributor

Looks like this problem is CSV-specific. The normal DistSQL code path nicely health-checks and version-checks nodes before planning on them:

addr := replInfo.NodeDesc.Address.String()
if err := dsp.checkNodeHealth(planCtx.ctx, nodeID, addr); err != nil {
log.Eventf(planCtx.ctx, "not planning on node %d. unhealthy", nodeID)
return dsp.nodeDesc.NodeID, nil
}
if !dsp.nodeVersionIsCompatible(nodeID, dsp.planVersion) {
log.Eventf(planCtx.ctx, "not planning on node %d. incompatible version", nodeID)
return dsp.nodeDesc.NodeID, nil
}
planCtx.nodeAddresses[nodeID] = addr

The CSV code, on the other hand, uses a different code path that just grabs all nodes that have ever existed, including decommissioned nodes, seemingly without doing any checks on them:

// TODO(dan): Filter out unhealthy nodes.
resp, err := p.ExecCfg().StatusServer.Nodes(ctx, &serverpb.NodesRequest{})
if err != nil {
return err
}
var nodes []roachpb.NodeDescriptor
for _, node := range resp.Nodes {
nodes = append(nodes, node.Desc)
}

// Because we're not going through the normal pathways, we have to set up
// the nodeID -> nodeAddress map ourselves.
for _, node := range nodes {
planCtx.nodeAddresses[node.NodeID] = node.Address.String()
}

Passing back to @mjibson since this doesn't appear to be a gossip-related issue and I'm not super familiar with the code paths involved.

@andreimatei
Copy link
Contributor

Ah, missed the part where both repros were for imports. Good catch.

@maddyblue maddyblue modified the milestones: 1.1, 2.1 Apr 30, 2018
craig bot pushed a commit that referenced this issue May 3, 2018
25154: opt: add WITH ORDINALITY via a RowNumber operator r=justinj a=justinj

WITH ORDINALITY is has the behaviour of a special case of the window
function RANK with the empty set of partitioning columns. This commit
leaves most of window functions unimplemented, only implementing enough
to do WITH ORDINALITY.

WITH ORDINALITY introduces a new column `ordinality` to the input set,
whose value corresponds to a given row's position in that input set.
The semantics of the ordering are briefly discussed at
https://www.cockroachlabs.com/docs/stable/query-order.html#order-preservation.

This commit introduces window functions in a very limited form - only as
required for WITH ORDINALITY, so no partitioning or window functions
besides ROW_NUMBER() are supported. In addition, a given window function
can only have a single windowing operator.

Release note: None

25162: importccl: check node health and compatibility during IMPORT planning r=mjibson a=mjibson

Simplify the LoadCSV signature by taking just a PlanHookState for any
argument that can be fetched from it. Determine the node list using
this new health check function. We can remove the rand.Shuffle call
because the map iteration should produce some level of randomness.

Fixes #12876

Release note (bug fix): Fix problems with imports sometimes failing
after node decommissioning.

25226: opt: Hoist EXISTS and ANY operators r=andy-kimball a=andy-kimball

This PR contains several commits that flatten EXISTS and ANY subqueries by hoisting them up and joining them with the higher level relational query. Hoisting and flattening subqueries into a single, simple tree of relational operators is the preparatory step to decorrelation. Next will come rules which will try to eliminate correlation in the flattened tree.


Co-authored-by: Justin Jaffray <justin@cockroachlabs.com>
Co-authored-by: Matt Jibson <matt.jibson@gmail.com>
Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com>
@craig craig bot closed this as completed in #25162 May 3, 2018
craig bot pushed a commit that referenced this issue May 14, 2018
25307: release-2.0: importccl: check node health and compatibility during IMPORT planning r=mjibson a=mjibson

Backport 1/1 commits from #25162.

/cc @cockroachdb/release

---

Simplify the LoadCSV signature by taking just a PlanHookState for any
argument that can be fetched from it. Determine the node list using
this new health check function. We can remove the rand.Shuffle call
because the map iteration should produce some level of randomness.

Fixes #12876

Release note (bug fix): Fix problems with imports sometimes failing
after node decommissioning.


Co-authored-by: Matt Jibson <matt.jibson@gmail.com>
@SJAnderson
Copy link

@andreimatei just got this. it happened after i rebooted my server and my PD SSD didn't attach itself before my daemon started cockroach again.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 24, 2018
Since the respective issue has been closed.
Touches cockroachdb#12876

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 24, 2018
Since the respective issue has been closed.
Touches cockroachdb#12876

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 25, 2018
Since the respective issue has been closed.
Touches cockroachdb#12876

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 25, 2018
Since the respective issue has been closed.
Touches cockroachdb#12876

Release note: None
craig bot pushed a commit that referenced this issue Sep 25, 2018
30563: sql: remove an UnexpectedWithIssueErrorf r=andreimatei a=andreimatei

Since the respective issue has been closed.
Touches #12876

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants