Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,14 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
// Note that will almost certainly be planned as a Broadcast Nested Loop join.
// Use EXISTS if performance matters to you.
val (joinCond, outerPlan) = rewriteExistentialExpr(conditions, p)
val anyNull = splitConjunctivePredicates(joinCond.get).map(IsNull).reduceLeft(Or)
Join(outerPlan, sub, LeftAnti, Option(Or(anyNull, joinCond.get)))
// Expand the NOT IN expression with the NULL-aware semantic
// to its full form. That is from:
// (a1,b1,...) = (a2,b2,...)
// to
// (a1=a2 OR isnull(a1=a2)) AND (b1=b2 OR isnull(b1=b2)) AND ...
val joinConds = splitConjunctivePredicates(joinCond.get)
val pairs = joinConds.map(c => Or(c, IsNull(c))).reduceLeft(And)
Join(outerPlan, sub, LeftAnti, Option(pairs))
case (p, predicate) =>
val (newCond, inputPlan) = rewriteExistentialExpr(Seq(predicate), p)
Project(p.output, Filter(newCond.get, inputPlan))
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
-- This file contains test cases for NOT IN subquery with multiple columns.

-- The data sets are populated as follows:
-- 1) When T1.A1 = T2.A2
-- 1.1) T1.B1 = T2.B2
-- 1.2) T1.B1 = T2.B2 returns false
-- 1.3) T1.B1 is null
-- 1.4) T2.B2 is null
-- 2) When T1.A1 = T2.A2 returns false
-- 3) When T1.A1 is null
-- 4) When T1.A2 is null

-- T1.A1 T1.B1 T2.A2 T2.B2
-- ----- ----- ----- -----
-- 1 1 1 1 (1.1)
-- 1 3 (1.2)
-- 1 null 1 null (1.3 & 1.4)
--
-- 2 1 1 1 (2)
-- null 1 (3)
-- null 3 (4)

create temporary view t1 as select * from values
(1, 1), (2, 1), (null, 1),
(1, 3), (null, 3),
(1, null), (null, 2)
as t1(a1, b1);

create temporary view t2 as select * from values
(1, 1),
(null, 3),
(1, null)
as t2(a2, b2);

-- multiple columns in NOT IN
-- TC 01.01
select a1,b1
from t1
where (a1,b1) not in (select a2,b2
from t2);

-- multiple columns with expressions in NOT IN
-- TC 01.02
select a1,b1
from t1
where (a1-1,b1) not in (select a2,b2
from t2);

-- multiple columns with expressions in NOT IN
-- TC 01.02
select a1,b1
from t1
where (a1,b1) not in (select a2+1,b2
from t2);

Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 5


-- !query 0
create temporary view t1 as select * from values
(1, 1), (2, 1), (null, 1),
(1, 3), (null, 3),
(1, null), (null, 2)
as t1(a1, b1)
-- !query 0 schema
struct<>
-- !query 0 output



-- !query 1
create temporary view t2 as select * from values
(1, 1),
(null, 3),
(1, null)
as t2(a2, b2)
-- !query 1 schema
struct<>
-- !query 1 output



-- !query 2
select a1,b1
from t1
where (a1,b1) not in (select a2,b2
from t2)
-- !query 2 schema
struct<a1:int,b1:int>
-- !query 2 output
2 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns an empty set without this fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is (null, 2) missing? There is no tuple in t2 for which b2=2.

Copy link
Contributor Author

@nsyca nsyca Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider this:

(null, 2) NOT IN { (1, 1), (null, 3), (1, null) }

which is equal to

.... AND (null <> 1 OR 2 <> null) => ... AND (unknown OR unknown)
                                  => ... AND unknown
                                  => unknown

Therefore (null, 2) is not part of the result set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yeah you are right. I was confusing this with the or rules.


-- !query 3
select a1,b1
from t1
where (a1-1,b1) not in (select a2,b2
from t2)
-- !query 3 schema
struct<a1:int,b1:int>
-- !query 3 output
1 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns an empty set without this fix.


-- !query 4
select a1,b1
from t1
where (a1,b1) not in (select a2+1,b2
from t2)
-- !query 4 schema
struct<a1:int,b1:int>
-- !query 4 output
1 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns an empty set without this fix.

Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,12 @@ class SQLQueryTestSuite extends QueryTest with SharedSQLContext {
s"-- Number of queries: ${outputs.size}\n\n\n" +
outputs.zipWithIndex.map{case (qr, i) => qr.toString(i)}.mkString("\n\n\n") + "\n"
}
stringToFile(new File(testCase.resultFile), goldenOutput)
val resultFile = new File(testCase.resultFile);
val parent = resultFile.getParentFile();
if (!parent.exists()) {
assert(parent.mkdirs(), "Could not create directory: " + parent)
}
stringToFile(resultFile, goldenOutput)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This newly added code is to address an issue, when test files are located in a hierarchy of sub-directories, at the time the golden result files are generated it could happen that the structure of those sub-directories are not yet created. The code will create the required sub-directories.

}

// Read back the golden file.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -263,12 +263,12 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
Row(1, 2.0) :: Row(1, 2.0) :: Nil)

checkAnswer(
sql("select * from l where a not in (select c from t where b < d)"),
Row(1, 2.0) :: Row(1, 2.0) :: Row(3, 3.0) :: Nil)
sql("select * from l where (a, b) not in (select c, d from t) and a < 4"),
Row(1, 2.0) :: Row(1, 2.0) :: Row(2, 1.0) :: Row(2, 1.0) :: Row(3, 3.0) :: Nil)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query with correlated predicates in NOT IN subquery could generate incorrect results (this problem is tracked by SPARK-18966). With this fix, it reveals the problem. Here I modify the test case to cover the code path for multiple columns instead.


// Empty sub-query
checkAnswer(
sql("select * from l where a not in (select c from r where c > 10 and b < d)"),
sql("select * from l where (a, b) not in (select c, d from r where c > 10)"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the predicate c > 10 (which filters all the rows in the subquery), it covers up the correlated predicate problem. Instead of removing the test case completely, I just modify to have a different coverage for multiple columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we also should test an empty subquery :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case is effectively covering the case of empty subquery.

Row(1, 2.0) :: Row(1, 2.0) :: Row(2, 1.0) :: Row(2, 1.0) ::
Row(3, 3.0) :: Row(null, null) :: Row(null, 5.0) :: Row(6, null) :: Nil)

Expand Down