Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

@dilipbiswal dilipbiswal commented Jul 24, 2018

What changes were proposed in this pull request?

Implements EXCEPT ALL clause through query rewrites using existing operators in Spark. In this PR, an internal UDTF (replicate_rows) is added to aid in preserving duplicate rows. Please refer to Link for the design.

Note This proposed UDTF is kept as a internal function that is purely used to aid with this particular rewrite to give us flexibility to change to a more generalized UDTF in future.

Input Query

SELECT c1 FROM ut1 EXCEPT ALL SELECT c1 FROM ut2

Rewritten Query

SELECT c1
    FROM (
     SELECT replicate_rows(sum_val, c1)
       FROM (
         SELECT c1, sum_val
           FROM (
             SELECT c1, sum(vcol) AS sum_val
               FROM (
                 SELECT 1L as vcol, c1 FROM ut1
                 UNION ALL
                 SELECT -1L as vcol, c1 FROM ut2
              ) AS union_all
            GROUP BY union_all.c1
          )
        WHERE sum_val > 0
       )
   )

How was this patch tested?

Added test cases in SQLQueryTestSuite, DataFrameSuite and SetOperationSuite

@dilipbiswal dilipbiswal changed the title [SPARK-21274] Implement EXCEPT ALL clause. [SPARK-21274][SQL] Implement EXCEPT ALL clause. Jul 24, 2018
@SparkQA
Copy link

SparkQA commented Jul 24, 2018

Test build #93488 has finished for PR 21857 at commit 5cf8c4c.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ReplicateRows(children: Seq[Expression]) extends Generator with CodegenFallback
  • abstract class ExceptBase(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
  • case class Except(left: LogicalPlan, right: LogicalPlan) extends ExceptBase(left, right)
  • case class ExceptAll(left: LogicalPlan, right: LogicalPlan) extends ExceptBase(left, right)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's for an internal purpose, you can just remove this though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Jul 24, 2018

Test build #93490 has finished for PR 21857 at commit 988d3b4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i -> _

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a internal -> an internal

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.

Copy link
Member

@maropu maropu Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the feature freeze will come soon, so I'm not sure this will appear in v2.4. cc: @gatorsmile @rxin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still targeting it to 2.4. If we are unable to make it, we can change it.

Copy link
Member

@maropu maropu Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a logical node for ExceptAll? As another option, we can add a flag as a field value of Except (e.g., case class Except(left: LogicalPlan, right: LogicalPlan, all: Boolean)? I think Except and ExceptAll is not different for the analyzer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Right. So this way , most of the pattern matching happens on the the Base class where things are common. I went back and forth on this as well.. If there is a consensus i will change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Some details to aid the decision making. I remember now.. This way, i had to change less number of files. I just looked at the usage of Except to double check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us avoid adding a new logical plan node. : )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's ok to move these tests to except.sql.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu I thought we like to keep these sql files relatively small and not contain too many sqls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: need indents

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu will check and fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support codegen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu I would like to take this in a follow up. I think we have codegen disabled for generators in general. So we will not be able to take advantage of it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok. It sounds ok to me.

Copy link
Member

@maropu maropu Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need tests for the RewriteExceptAll rule (you can check RewriteDistinctAggregatesSuite as a reference).

Copy link
Contributor Author

@dilipbiswal dilipbiswal Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah.. ok... Thanks a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu I have added a unit test to check the plan. Please look at it when you get a chance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: need spaces like (0), (1), (2), (3), ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Will do.

@SparkQA
Copy link

SparkQA commented Jul 24, 2018

Test build #93495 has finished for PR 21857 at commit a6fc341.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then, we do not need these changes.

Copy link
Contributor Author

@dilipbiswal dilipbiswal Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Thats right Sean. We will not need changes here. However may i request you to please command-B on Except class ? We may need to change the pattern matching in other places, right ? Just wanted to make sure you are okay with it before i went ahead and made the changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine about that. Please make a change and avoid introducing a new LogicalPlan node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile @maropu I have removed ExceptAll operator.

@SparkQA
Copy link

SparkQA commented Jul 24, 2018

Test build #93505 has finished for PR 21857 at commit 96e2dc9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@dilipbiswal Also post the design doc in the PR description?

@gatorsmile
Copy link
Member

cc @ueshin @maryannxue Please review this?

@dilipbiswal
Copy link
Contributor Author

@gatorsmile I have the link to the design doc in the description ? Is there another way ?

@SparkQA
Copy link

SparkQA commented Jul 24, 2018

Test build #93510 has finished for PR 21857 at commit c516f78.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there is no cnt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Thanks.. will change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove it from the pr description as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Done. Thanks.

@SparkQA
Copy link

SparkQA commented Jul 26, 2018

Test build #93573 has finished for PR 21857 at commit b201b88.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2018

Test build #93587 has finished for PR 21857 at commit c49b15c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Except(

@SparkQA
Copy link

SparkQA commented Jul 26, 2018

Test build #93595 has finished for PR 21857 at commit 93abd51.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal dilipbiswal force-pushed the dkb_except_all_final branch from 93abd51 to a415f6a Compare July 26, 2018 17:47
Copy link
Member

@gatorsmile gatorsmile Jul 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah.. sorry.. will change it to Except.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qualifier = None, ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile will change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isAll = true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile will change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does replicate_rows's output include sum_val? I think it removes the multiplier value from output?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Jul 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Yes it does. You had suggested to remove it in the generator PR. At that time it was an external function and i wanted to match with hive. Now that its an internal function i decided to remove it as per your suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think here it should be replicate_rows(sum_val, c1) AS c1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.. thanks.. will fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't forget to update the pr description as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good to mention resolves columns by position (not by name) here too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no operator called "except all " ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it looks no diff to above one. Maybe except (all) operator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.. will change to except (all).. the difference was in the replaced operators, actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is better to add one more row to show the behavior of preserving duplicates.

@SparkQA
Copy link

SparkQA commented Jul 26, 2018

Test build #93610 has finished for PR 21857 at commit a415f6a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2018

Test build #93639 has finished for PR 21857 at commit f0da978.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Thank you. fixed.

@SparkQA
Copy link

SparkQA commented Jul 27, 2018

Test build #93656 has finished for PR 21857 at commit 1f107aa.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal dilipbiswal force-pushed the dkb_except_all_final branch from 1f107aa to 4e04883 Compare July 27, 2018 16:38
Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Jul 27, 2018

Test build #93678 has finished for PR 21857 at commit 4e04883.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 10f1f19 Jul 27, 2018
@gatorsmile
Copy link
Member

Thanks! Merged to master.

@dilipbiswal
Copy link
Contributor Author

Thank you very much @gatorsmile @maropu @viirya @ueshin @HyukjinKwon @kiszk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants