Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Nov 25, 2016

What changes were proposed in this pull request?

Currently, correlated subqueries do not allow OuterReference columns in projection lists of IN correlated subqueries. This PR aims to support that by making OuterReference as NamedExpression and extending pullOutCorrelatedPredicates into pullOutCorrelatedProjectionAndPredicates.

scala> sql("CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES 1, 2 AS t1(a)")
scala> sql("CREATE TEMPORARY VIEW t2 AS SELECT * FROM VALUES 1 AS t2(b)")
scala> sql("SELECT a FROM t1 WHERE a IN (SELECT a FROM t2)").show
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to org.apache.spark.sql.catalyst.expressions.NamedExpression

How was this patch tested?

Pass the Jenkins test with new test cases.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-17251][SQL] Support OuterReference in projection list of a correlated subquery [SPARK-17251][SQL] Support OuterReference in projection list of IN correlated subqueries Nov 25, 2016
override def nullable: Boolean = e.nullable
override def prettyName: String = "outer"

override def name: String = throw new UnsupportedOperationException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement as much of these as possible (delegate to the NamedExpression). It is a resolved expression after all.

@@ -0,0 +1,18 @@
CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES 1, 2 AS t1(a);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: lets call this subqueries.sql. Correlation has a different meaning in different contexts

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69163 has finished for PR 16012 at commit 58449b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class OuterReference(e: NamedExpression)

@dongjoon-hyun
Copy link
Member Author

Thank you, @hvanhovell ! I'll fix like that!

*/
case class OuterReference(e: NamedExpression) extends LeafExpression with Unevaluable {
case class OuterReference(e: NamedExpression)(
val exprId: ExprId = NamedExpression.newExprId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the exprId of the NamedExpression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay? I thought it works like 'Alias'. Anyway, no problem. I'll update like that.

override def name: String = e.name
override def qualifier: Option[String] = e.qualifier
override def toAttribute: Attribute = e.toAttribute
override def newInstance(): NamedExpression = OuterReference(e)()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OuterReference(e.newInstance())()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

try {
outer.resolve(nameParts, resolver) match {
case Some(outerAttr) => OuterReference(outerAttr)
case Some(outerAttr) => OuterReference(outerAttr)()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the analyzer change has the desired effect. This just remove the outer reference from the tree, and this won't work if we use the attribute anywhere in the tree. For example:

select *
from   tbl_a
where id in (select x
             from (select tbl_b.id,
                          tbl_a.id + 1 as x,
                          tbl_a.id + tbl_b.id as y
                   from   tbl_b)
              where y > 0)

I think we need to break this down into two steps:

  1. Do not support this for now and just fix the named expression. That would be my goal for 2.1.
  2. Try to see if we can rewrite the tree in such a way that we can extract the value. That would be my goal for 2.2. I am not sure how well we can make this work. In the end I think we need a dedicated subquery operator.

cc @nsyca what do you think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Correct. I'll check that again.

BTW, What about the predicates? It felt the predicates are handled the same way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not looked at the code changes closely but got a general idea of what the originally reported problem is. I second @hvanhovell to not support outer reference in a SELECT clause of a subquery in 2.1. Just fix the named expression first.

IN subquery might be okay as it reflects the inner join semantics more or less. NOT IN subquery is converted to a special case of an anti-join with extra logic for the null value.

select *
from   tbl_a
where  tbl_a.c1 not in (select tbl_a.c2 from tbl_b)

Does the LeftAnti with effectively no join predicate, i.e.,

(isnull(tbl_a.c1 = tbl_a.c2) || (tbl_a.c1 = tbl_a.c2))

work correctly today? And if it returns a correct result, is it by design, not by chance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another interesting case to consider:

select ...
from   t1
where  t1.c1 in (select sum(t1.c2) from t2)

If we support correlated columns in SELECT clause, do we build the Aggregate on T2 or T1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LEFT ANTI join should produce the correct result. Unfortunately we push down the tbl_a.c1 = tbl_a.c2 expression into the tbl_a side of the plan. So we need to fix this. I have created https://issues.apache.org/jira/browse/SPARK-18597 to track this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hvanhovell .
BTW, may I rebase this PR and try to the second plan for 2.2 here?

  1. Try to see if we can rewrite the tree in such a way that we can extract the value. That would be my goal for 2.2. I am not sure how well we can make this work. In the end I think we need a dedicated subquery operator.

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69173 has finished for PR 16012 at commit 3de9419.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class OuterReference(e: NamedExpression)(

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @hvanhovell and @nsyca .
I agree with you. We need enough time for this.
So, the option one for 2.1 is spun off into #16015.

@dongjoon-hyun
Copy link
Member Author

Hi, @hvanhovell and @nsyca .

I created a JIRA issue for Option 2 with simple and complex example (@hvanhovell 's).

Now, I'm closing this PR. Thank you for reviews, @hvanhovell and @nsyca .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-17251 branch November 28, 2016 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants