-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17251][SQL] Support OuterReference in projection list of IN correlated subqueries
#16012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…orrelated subquery
OuterReference in projection list of a correlated subqueryOuterReference in projection list of IN correlated subqueries
| override def nullable: Boolean = e.nullable | ||
| override def prettyName: String = "outer" | ||
|
|
||
| override def name: String = throw new UnsupportedOperationException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement as much of these as possible (delegate to the NamedExpression). It is a resolved expression after all.
| @@ -0,0 +1,18 @@ | |||
| CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES 1, 2 AS t1(a); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming: lets call this subqueries.sql. Correlation has a different meaning in different contexts
|
Test build #69163 has finished for PR 16012 at commit
|
|
Thank you, @hvanhovell ! I'll fix like that! |
| */ | ||
| case class OuterReference(e: NamedExpression) extends LeafExpression with Unevaluable { | ||
| case class OuterReference(e: NamedExpression)( | ||
| val exprId: ExprId = NamedExpression.newExprId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the exprId of the NamedExpression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay? I thought it works like 'Alias'. Anyway, no problem. I'll update like that.
| override def name: String = e.name | ||
| override def qualifier: Option[String] = e.qualifier | ||
| override def toAttribute: Attribute = e.toAttribute | ||
| override def newInstance(): NamedExpression = OuterReference(e)() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OuterReference(e.newInstance())()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
| try { | ||
| outer.resolve(nameParts, resolver) match { | ||
| case Some(outerAttr) => OuterReference(outerAttr) | ||
| case Some(outerAttr) => OuterReference(outerAttr)() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure the analyzer change has the desired effect. This just remove the outer reference from the tree, and this won't work if we use the attribute anywhere in the tree. For example:
select *
from tbl_a
where id in (select x
from (select tbl_b.id,
tbl_a.id + 1 as x,
tbl_a.id + tbl_b.id as y
from tbl_b)
where y > 0)I think we need to break this down into two steps:
- Do not support this for now and just fix the named expression. That would be my goal for 2.1.
- Try to see if we can rewrite the tree in such a way that we can extract the value. That would be my goal for 2.2. I am not sure how well we can make this work. In the end I think we need a dedicated subquery operator.
cc @nsyca what do you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Correct. I'll check that again.
BTW, What about the predicates? It felt the predicates are handled the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not looked at the code changes closely but got a general idea of what the originally reported problem is. I second @hvanhovell to not support outer reference in a SELECT clause of a subquery in 2.1. Just fix the named expression first.
IN subquery might be okay as it reflects the inner join semantics more or less. NOT IN subquery is converted to a special case of an anti-join with extra logic for the null value.
select *
from tbl_a
where tbl_a.c1 not in (select tbl_a.c2 from tbl_b)Does the LeftAnti with effectively no join predicate, i.e.,
(isnull(tbl_a.c1 = tbl_a.c2) || (tbl_a.c1 = tbl_a.c2))
work correctly today? And if it returns a correct result, is it by design, not by chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another interesting case to consider:
select ...
from t1
where t1.c1 in (select sum(t1.c2) from t2)If we support correlated columns in SELECT clause, do we build the Aggregate on T2 or T1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LEFT ANTI join should produce the correct result. Unfortunately we push down the tbl_a.c1 = tbl_a.c2 expression into the tbl_a side of the plan. So we need to fix this. I have created https://issues.apache.org/jira/browse/SPARK-18597 to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @hvanhovell .
BTW, may I rebase this PR and try to the second plan for 2.2 here?
- Try to see if we can rewrite the tree in such a way that we can extract the value. That would be my goal for 2.2. I am not sure how well we can make this work. In the end I think we need a dedicated subquery operator.
|
Test build #69173 has finished for PR 16012 at commit
|
|
Thank you for review, @hvanhovell and @nsyca . |
|
Hi, @hvanhovell and @nsyca . I created a JIRA issue for Option 2 with simple and complex example (@hvanhovell 's). Now, I'm closing this PR. Thank you for reviews, @hvanhovell and @nsyca . |
What changes were proposed in this pull request?
Currently, correlated subqueries do not allow
OuterReferencecolumns in projection lists of IN correlated subqueries. This PR aims to support that by makingOuterReferenceasNamedExpressionand extendingpullOutCorrelatedPredicatesintopullOutCorrelatedProjectionAndPredicates.How was this patch tested?
Pass the Jenkins test with new test cases.