[SPARK-21274][SQL] Implement INTERSECT ALL clause #21886

dilipbiswal · 2018-07-26T17:58:03Z

What changes were proposed in this pull request?

Implements INTERSECT ALL clause through query rewrites using existing operators in Spark. Please refer to Link for the design.

Input Query

SELECT c1 FROM ut1 INTERSECT ALL SELECT c1 FROM ut2

Rewritten Query

   SELECT c1
    FROM (
         SELECT replicate_row(min_count, c1)
         FROM (
              SELECT c1,
                     IF (vcol1_cnt > vcol2_cnt, vcol2_cnt, vcol1_cnt) AS min_count
              FROM (
                   SELECT   c1, count(vcol1) as vcol1_cnt, count(vcol2) as vcol2_cnt
                   FROM (
                        SELECT c1, true as vcol1, null as vcol2 FROM ut1
                        UNION ALL
                        SELECT c1, null as vcol1, true as vcol2 FROM ut2
                        ) AS union_all
                   GROUP BY c1
                   HAVING vcol1_cnt >= 1 AND vcol2_cnt >= 1
                  )
              )
          )

How was this patch tested?

Added test cases in SQLQueryTestSuite, DataFrameSuite, SetOperationSuite

viirya · 2018-07-26T19:16:46Z

Typo in the PR description: IF (vcol1_cnt > vcol1_cnt, vcol2_cnt, vcol1_cnt) -> IF (vcol1_cnt > vcol2_cnt, vcol2_cnt, vcol1_cnt).

viirya · 2018-07-26T19:19:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

typo here vcol1_cnt > vcol1_cnt -> vcol1_cnt > vcol2_cnt.

viirya · 2018-07-26T19:19:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

Do we need to have vcol1_cnt and vcol2_cnt here? I think above replicate_row only takes min_count input.

@viirya Thanks !! No we don't. In the actual code, we don't project these columns out. I will fix the doc.

viirya · 2018-07-26T19:21:11Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Better to add resolves columns by position (not by name)?

viirya · 2018-07-26T19:21:57Z

python/pyspark/sql/dataframe.py

typo: bothe.

frame -> dataframe.

@viirya Will fix.

viirya · 2018-07-26T19:26:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

Based on the implementation, I think this should be:

SELECT true as vcol1, null as vcol2, c1 FROM ut1 UNION ALL SELECT null as vcol1, true as vcol2, c1 FROM ut2

@viirya OK :-)

SparkQA · 2018-07-26T19:57:58Z

Test build #93611 has finished for PR 21886 at commit 1039e47.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Intersect(

dilipbiswal · 2018-07-26T20:46:04Z

@gatorsmile I see this failure in other PRs as well. Is this introduced by some recent changes ?

viirya · 2018-07-26T21:16:07Z

retest this please.

SparkQA · 2018-07-26T22:23:45Z

Test build #93617 has finished for PR 21886 at commit 6392469.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-27T00:07:29Z

Test build #93629 has finished for PR 21886 at commit 6392469.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-27T00:13:46Z

Test build #93642 has finished for PR 21886 at commit 7268736.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-07-27T01:49:57Z

retest this please

SparkQA · 2018-07-27T05:50:23Z

Test build #93651 has finished for PR 21886 at commit 7268736.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-27T20:43:11Z

Test build #93677 has finished for PR 21886 at commit bfe7030.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-27T20:48:49Z

cc @dilipbiswal Could you resolve the conflicts? I will start the review after the rebase.

dilipbiswal · 2018-07-28T01:07:54Z

@gatorsmile Rebased.

SparkQA · 2018-07-28T05:22:05Z

Test build #93708 has finished for PR 21886 at commit 67b15ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-28T05:23:28Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * @since 2.4.0
+   */
+  def intersectAll(other: Dataset[T]): Dataset[T] = withSetOperator {
+    Intersect(planWithBarrier, other.planWithBarrier, isAll = true)


could you use logicalPlan?

@gatorsmile Sure.. how about exceptAll that was checked in today ?

yes. Please do it too.

gatorsmile · 2018-07-28T05:41:53Z

sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out

+-- !query 8
+SELECT c1 FROM tab1
+INTERSECT ALL
+SELECT c1, c2 FROM tab2


use k and v

gatorsmile · 2018-07-28T05:43:05Z

sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out

+
+-- !query 1
+CREATE TEMPORARY VIEW tab2 AS SELECT * FROM VALUES
+    (1, 2), 


also add another duplicate rows for (1, 2);

gatorsmile · 2018-07-28T05:43:17Z

sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out

+CREATE TEMPORARY VIEW tab1 AS SELECT * FROM VALUES
+    (1, 2), 
+    (1, 2),
+    (1, 3),


also add another duplicate row (1, 3)

gatorsmile · 2018-07-28T05:43:35Z

sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out

+-- !query 1
+CREATE TEMPORARY VIEW tab2 AS SELECT * FROM VALUES
+    (1, 2), 
+    (2, 3)


add one more row (3, 4)

gatorsmile · 2018-07-28T05:44:07Z

The code looks good to me. Let us improve the test cases.

dilipbiswal · 2018-07-28T05:45:09Z

@gatorsmile Thank you.. I will make the changes.

SparkQA · 2018-07-28T15:11:56Z

Test build #93725 has finished for PR 21886 at commit 8ba4b71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-29T16:59:34Z

sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out

+1	2
+2	3
+NULL	NULL
+NULL	NULL


This misses one row (1, 3). Could you investigate the cause?

@gatorsmile Thank you.. I just went over my notes. The reason for the difference in output is because in Spark we give the same precedence to to all the set operators. The operators are basically evaluated in the order they appear in the query from left to right. But per standard, INTERSECT should have higher precedence over UNION and EXCEPT. We do have this problem in our current support of EXCEPT (DISTINCT) and INTERSECT (DISTINCT). I am fixing the test now to add parenthesize around the query block to force certain order of evaluation. I have opened https://issues.apache.org/jira/browse/SPARK-24966 to work in fixing the precedence in our grammer.

gatorsmile · 2018-07-30T02:38:00Z

LGTM pending Jenkins

HyukjinKwon · 2018-07-30T04:36:41Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+case class Intersect(
+   left: LogicalPlan,
+   right: LogicalPlan,
+   isAll: Boolean = false) extends SetOperation(left, right) {


not a big deal at all but this has three spaces ..

HyukjinKwon · 2018-07-30T04:40:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

+          "logical intersect  operator should have been replaced by semi-join in the optimizer")
+      case logical.Intersect(left, right, true) =>
+        throw new IllegalStateException(
+          "logical intersect operator should have been replaced by union, aggregate" +


nit: looks we need a space for aggregate" -> aggregate "

SparkQA · 2018-07-30T04:42:42Z

Test build #93760 has finished for PR 21886 at commit 5d5461a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-30T05:10:30Z

@dilipbiswal Please address the style issues in your other PRs.

gatorsmile · 2018-07-30T05:12:06Z

Thanks! Merged to master.

gatorsmile · 2018-07-30T05:12:59Z

@dilipbiswal The merged PR does not pick up your last commit.

dilipbiswal · 2018-07-30T05:13:05Z

@gatorsmile Ok Sean.. I will correct in next PR. Thank you very very much.

SparkQA · 2018-07-30T07:05:01Z

Test build #93769 has finished for PR 21886 at commit 89d03af.

This patch fails due to an unknown error code, -9.
This patch does not merge cleanly.
This patch adds no public classes.

viirya reviewed Jul 26, 2018

View reviewed changes

dilipbiswal force-pushed the dkb_intersect_all_final branch from 7268736 to bfe7030 Compare July 27, 2018 16:34

dilipbiswal added 5 commits July 27, 2018 15:03

generator

782be2b

[SPARK-21274] Implement INTERSECT ALL clause

fbaba34

Code review

65a9a68

code review

aa3ce4a

rebase errors

67b15ee

dilipbiswal force-pushed the dkb_intersect_all_final branch from bfe7030 to 67b15ee Compare July 28, 2018 01:06

gatorsmile reviewed Jul 28, 2018

View reviewed changes

Code review

8ba4b71

gatorsmile reviewed Jul 29, 2018

View reviewed changes

Enforce precedence of set operatorss in test

5d5461a

HyukjinKwon reviewed Jul 30, 2018

View reviewed changes

Code review

89d03af

asfgit closed this in 65a4bc1 Jul 30, 2018

dilipbiswal mentioned this pull request Aug 1, 2018

[SPARK-24966][SQL] Implement precedence rules for set operations. #21941

Closed

[SPARK-21274][SQL] Implement INTERSECT ALL clause #21886

[SPARK-21274][SQL] Implement INTERSECT ALL clause #21886

Uh oh!

Conversation

dilipbiswal commented Jul 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Jul 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 26, 2018

Uh oh!

dilipbiswal commented Jul 26, 2018

Uh oh!

viirya commented Jul 26, 2018

Uh oh!

SparkQA commented Jul 26, 2018

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

dilipbiswal commented Jul 27, 2018

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

gatorsmile commented Jul 27, 2018

Uh oh!

dilipbiswal commented Jul 28, 2018

Uh oh!

SparkQA commented Jul 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jul 28, 2018

Uh oh!

dilipbiswal commented Jul 28, 2018

Uh oh!

SparkQA commented Jul 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jul 30, 2018

Uh oh!

HyukjinKwon Jul 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

dilipbiswal commented Jul 26, 2018 •

edited

Loading

HyukjinKwon Jul 30, 2018 •

edited

Loading

dilipbiswal commented Jul 30, 2018 •

edited

Loading