-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19017][SQL] NOT IN subquery with more than one column may return incorrect results #16467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b988651
069ed8f
edca333
64184fd
29f82b0
ac43ab4
631d396
7eb9b2d
1387cf5
3faa2d5
a308634
473c81b
278ebae
f1524b9
9e1b29e
de655d0
f3f7773
5c36dce
05d2683
6a1a415
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| -- This file contains test cases for NOT IN subquery with multiple columns. | ||
|
|
||
| -- The data sets are populated as follows: | ||
| -- 1) When T1.A1 = T2.A2 | ||
| -- 1.1) T1.B1 = T2.B2 | ||
| -- 1.2) T1.B1 = T2.B2 returns false | ||
| -- 1.3) T1.B1 is null | ||
| -- 1.4) T2.B2 is null | ||
| -- 2) When T1.A1 = T2.A2 returns false | ||
| -- 3) When T1.A1 is null | ||
| -- 4) When T1.A2 is null | ||
|
|
||
| -- T1.A1 T1.B1 T2.A2 T2.B2 | ||
| -- ----- ----- ----- ----- | ||
| -- 1 1 1 1 (1.1) | ||
| -- 1 3 (1.2) | ||
| -- 1 null 1 null (1.3 & 1.4) | ||
| -- | ||
| -- 2 1 1 1 (2) | ||
| -- null 1 (3) | ||
| -- null 3 (4) | ||
|
|
||
| create temporary view t1 as select * from values | ||
| (1, 1), (2, 1), (null, 1), | ||
| (1, 3), (null, 3), | ||
| (1, null), (null, 2) | ||
| as t1(a1, b1); | ||
|
|
||
| create temporary view t2 as select * from values | ||
| (1, 1), | ||
| (null, 3), | ||
| (1, null) | ||
| as t2(a2, b2); | ||
|
|
||
| -- multiple columns in NOT IN | ||
| -- TC 01.01 | ||
| select a1,b1 | ||
| from t1 | ||
| where (a1,b1) not in (select a2,b2 | ||
| from t2); | ||
|
|
||
| -- multiple columns with expressions in NOT IN | ||
| -- TC 01.02 | ||
| select a1,b1 | ||
| from t1 | ||
| where (a1-1,b1) not in (select a2,b2 | ||
| from t2); | ||
|
|
||
| -- multiple columns with expressions in NOT IN | ||
| -- TC 01.02 | ||
| select a1,b1 | ||
| from t1 | ||
| where (a1,b1) not in (select a2+1,b2 | ||
| from t2); | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| -- Automatically generated by SQLQueryTestSuite | ||
| -- Number of queries: 5 | ||
|
|
||
|
|
||
| -- !query 0 | ||
| create temporary view t1 as select * from values | ||
| (1, 1), (2, 1), (null, 1), | ||
| (1, 3), (null, 3), | ||
| (1, null), (null, 2) | ||
| as t1(a1, b1) | ||
| -- !query 0 schema | ||
| struct<> | ||
| -- !query 0 output | ||
|
|
||
|
|
||
|
|
||
| -- !query 1 | ||
| create temporary view t2 as select * from values | ||
| (1, 1), | ||
| (null, 3), | ||
| (1, null) | ||
| as t2(a2, b2) | ||
| -- !query 1 schema | ||
| struct<> | ||
| -- !query 1 output | ||
|
|
||
|
|
||
|
|
||
| -- !query 2 | ||
| select a1,b1 | ||
| from t1 | ||
| where (a1,b1) not in (select a2,b2 | ||
| from t2) | ||
| -- !query 2 schema | ||
| struct<a1:int,b1:int> | ||
| -- !query 2 output | ||
| 2 1 | ||
|
|
||
|
|
||
| -- !query 3 | ||
| select a1,b1 | ||
| from t1 | ||
| where (a1-1,b1) not in (select a2,b2 | ||
| from t2) | ||
| -- !query 3 schema | ||
| struct<a1:int,b1:int> | ||
| -- !query 3 output | ||
| 1 1 | ||
|
|
||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It returns an empty set without this fix. |
||
|
|
||
| -- !query 4 | ||
| select a1,b1 | ||
| from t1 | ||
| where (a1,b1) not in (select a2+1,b2 | ||
| from t2) | ||
| -- !query 4 schema | ||
| struct<a1:int,b1:int> | ||
| -- !query 4 output | ||
| 1 1 | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It returns an empty set without this fix. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -163,7 +163,12 @@ class SQLQueryTestSuite extends QueryTest with SharedSQLContext { | |
| s"-- Number of queries: ${outputs.size}\n\n\n" + | ||
| outputs.zipWithIndex.map{case (qr, i) => qr.toString(i)}.mkString("\n\n\n") + "\n" | ||
| } | ||
| stringToFile(new File(testCase.resultFile), goldenOutput) | ||
| val resultFile = new File(testCase.resultFile); | ||
| val parent = resultFile.getParentFile(); | ||
| if (!parent.exists()) { | ||
| assert(parent.mkdirs(), "Could not create directory: " + parent) | ||
| } | ||
| stringToFile(resultFile, goldenOutput) | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This newly added code is to address an issue, when test files are located in a hierarchy of sub-directories, at the time the golden result files are generated it could happen that the structure of those sub-directories are not yet created. The code will create the required sub-directories. |
||
| } | ||
|
|
||
| // Read back the golden file. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -263,12 +263,12 @@ class SubquerySuite extends QueryTest with SharedSQLContext { | |
| Row(1, 2.0) :: Row(1, 2.0) :: Nil) | ||
|
|
||
| checkAnswer( | ||
| sql("select * from l where a not in (select c from t where b < d)"), | ||
| Row(1, 2.0) :: Row(1, 2.0) :: Row(3, 3.0) :: Nil) | ||
| sql("select * from l where (a, b) not in (select c, d from t) and a < 4"), | ||
| Row(1, 2.0) :: Row(1, 2.0) :: Row(2, 1.0) :: Row(2, 1.0) :: Row(3, 3.0) :: Nil) | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Query with correlated predicates in NOT IN subquery could generate incorrect results (this problem is tracked by SPARK-18966). With this fix, it reveals the problem. Here I modify the test case to cover the code path for multiple columns instead. |
||
|
|
||
| // Empty sub-query | ||
| checkAnswer( | ||
| sql("select * from l where a not in (select c from r where c > 10 and b < d)"), | ||
| sql("select * from l where (a, b) not in (select c, d from r where c > 10)"), | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With the predicate
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then we also should test an empty subquery :)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test case is effectively covering the case of empty subquery. |
||
| Row(1, 2.0) :: Row(1, 2.0) :: Row(2, 1.0) :: Row(2, 1.0) :: | ||
| Row(3, 3.0) :: Row(null, null) :: Row(null, 5.0) :: Row(6, null) :: Nil) | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns an empty set without this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is
(null, 2)missing? There is no tuple in t2 for which b2=2.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider this:
(null, 2) NOT IN { (1, 1), (null, 3), (1, null) }which is equal to
Therefore
(null, 2)is not part of the result set.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok yeah you are right. I was confusing this with the or rules.