-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distsql: panic 'flow already registered' #12876
Comments
Unassigning issues that someone should look into while I'm gone. |
@RaduBerinde do you still want to take a look at this now that you are back? |
Two flows with the same id seem to be scheduled on the same node which used to cause panics. How that can happen is currently unclear; cosmic rays? This patch turns the panic into an RPC or query error (depending on sync/async flow), adds some more sanity checks and adds a representation of the flow to the error and the error will also invite users to report on cockroachdb#12876. Touches cockroachdb#12876.
Two flows with the same id seem to be scheduled on the same node which used to cause panics. How that can happen is currently unclear; cosmic rays? This patch turns the panic into an RPC or query error (depending on sync/async flow), adds some more sanity checks and adds a representation of the flow to the error and the error will also invite users to report on cockroachdb#12876. Touches cockroachdb#12876.
This is now converted to an error. Moving to 1.2 since we don't have much to go on in terms of reproducing the condition. |
hello! on sapphire-1, right now:
|
We were incorrectly passing all the args as a single slice arg. Also improving the error message for cockroachdb#12876.
We were incorrectly passing all the args as a single slice arg. Also improving the error message for cockroachdb#12876.
We were incorrectly passing all the args as a single slice arg. Also improving the error message for cockroachdb#12876.
Looks like this problem is CSV-specific. The normal DistSQL code path nicely health-checks and version-checks nodes before planning on them: cockroach/pkg/sql/distsql_physical_planner.go Lines 896 to 906 in 1700c40
The CSV code, on the other hand, uses a different code path that just grabs all nodes that have ever existed, including decommissioned nodes, seemingly without doing any checks on them: cockroach/pkg/ccl/importccl/csv.go Lines 927 to 935 in 1700c40
cockroach/pkg/sql/distsql_plan_csv.go Lines 215 to 219 in 1700c40
Passing back to @mjibson since this doesn't appear to be a gossip-related issue and I'm not super familiar with the code paths involved. |
Ah, missed the part where both repros were for imports. Good catch. |
25154: opt: add WITH ORDINALITY via a RowNumber operator r=justinj a=justinj WITH ORDINALITY is has the behaviour of a special case of the window function RANK with the empty set of partitioning columns. This commit leaves most of window functions unimplemented, only implementing enough to do WITH ORDINALITY. WITH ORDINALITY introduces a new column `ordinality` to the input set, whose value corresponds to a given row's position in that input set. The semantics of the ordering are briefly discussed at https://www.cockroachlabs.com/docs/stable/query-order.html#order-preservation. This commit introduces window functions in a very limited form - only as required for WITH ORDINALITY, so no partitioning or window functions besides ROW_NUMBER() are supported. In addition, a given window function can only have a single windowing operator. Release note: None 25162: importccl: check node health and compatibility during IMPORT planning r=mjibson a=mjibson Simplify the LoadCSV signature by taking just a PlanHookState for any argument that can be fetched from it. Determine the node list using this new health check function. We can remove the rand.Shuffle call because the map iteration should produce some level of randomness. Fixes #12876 Release note (bug fix): Fix problems with imports sometimes failing after node decommissioning. 25226: opt: Hoist EXISTS and ANY operators r=andy-kimball a=andy-kimball This PR contains several commits that flatten EXISTS and ANY subqueries by hoisting them up and joining them with the higher level relational query. Hoisting and flattening subqueries into a single, simple tree of relational operators is the preparatory step to decorrelation. Next will come rules which will try to eliminate correlation in the flattened tree. Co-authored-by: Justin Jaffray <justin@cockroachlabs.com> Co-authored-by: Matt Jibson <matt.jibson@gmail.com> Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com>
25307: release-2.0: importccl: check node health and compatibility during IMPORT planning r=mjibson a=mjibson Backport 1/1 commits from #25162. /cc @cockroachdb/release --- Simplify the LoadCSV signature by taking just a PlanHookState for any argument that can be fetched from it. Determine the node list using this new health check function. We can remove the rand.Shuffle call because the map iteration should produce some level of randomness. Fixes #12876 Release note (bug fix): Fix problems with imports sometimes failing after node decommissioning. Co-authored-by: Matt Jibson <matt.jibson@gmail.com>
@andreimatei just got this. it happened after i rebooted my server and my PD SSD didn't attach itself before my daemon started cockroach again. |
Since the respective issue has been closed. Touches cockroachdb#12876 Release note: None
Since the respective issue has been closed. Touches cockroachdb#12876 Release note: None
Since the respective issue has been closed. Touches cockroachdb#12876 Release note: None
Since the respective issue has been closed. Touches cockroachdb#12876 Release note: None
Ran into the panic below the first time I tried to run a distsql query on a node. Did not run into it again after restart. Nothing useful in the logs, except a long spam of messages like the one below (prior to the query being run).
I looked through the code and the only explanation I can come up with is if the distsql planner's nodeID was not set correctly. Then we may end up sending a setup flow request to ourselves AND setup a sync flow locally with the same id.
Investigate the nodeID issue and make this into an error not a panic.
The text was updated successfully, but these errors were encountered: