[SPARK-12745] [SQL] Hive Parser: Limit is not supported inside Set Operation #10689

gatorsmile · 2016-01-11T01:18:10Z

The current SQLContext allows the following query, which is copied from a test case in SQLQuerySuite:

     checkAnswer(sql(
       """
         |select key from ((select * from testData limit 1)
         |  union all (select * from testData limit 1)) x limit 1
       """.stripMargin),
       Row(1)
     )

However, it is rejected by the Hive parser.

This PR is to make Hive parser support the Limit Clause inside Set Operator.

SparkQA · 2016-01-11T02:58:35Z

Test build #49080 has finished for PR 10689 at commit 310cb32.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-01-11T03:09:48Z

@hvanhovell @rxin Could you take a look? Thank you!

rxin · 2016-01-11T03:57:26Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/CatalystQlSuite.scala

should we test there is a limit being injected? otherwise the parser could've just ignored the clause.

Sure, will add such a test case.

rxin · 2016-01-11T06:13:18Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/CatalystQlSuite.scala

can we do this similar to how we compare plans in various optimizer suites, e.g. https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeInSuite.scala#L84

Sure, it sounds better. Will do.

SparkQA · 2016-01-11T06:59:01Z

Test build #49094 has finished for PR 10689 at commit 6244975.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-01-11T08:50:43Z

Jenkins, retest this please.

hvanhovell · 2016-01-11T09:02:57Z

@gatorsmile the fix looks good.

@rxin / @marmbrus / @gatorsmile I am not sure if we should support this at all. Using a limit in SELECT's connected by a UNION ALL is fine, but things tend to get really strange once you start using this in combination with other SET or JOIN operations; it'll get very hard to reasion about the result. Most RDMS'es do not support this. I'd rather have an optimizer rule which pushes down limit clauses whenever this is possible.

SparkQA · 2016-01-11T10:52:40Z

Test build #49117 has finished for PR 10689 at commit 6244975.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-01-11T15:36:52Z

Thank you for your review! @hvanhovell Let me share my two cents:

We have another PR to push down Limit through Union ALL. However, it is impossible to push Limit through Union Distinct: [SPARK-12503] [SQL] Pushing Limit Through Union All #10451 Thus, users have to add Limit manually, if it can greatly enhance the performance.
If we want to convert a logical plan back a SQL (in [SPARK-12593][SQL] Converts resolved logical plan back to SQL #10541), we need to support it, I think. @liancheng Please correct me, if my understanding is wrong.
Our Dataframe API can add Limit almost everywhere. This is super critical when the data volume is huge. In the long term, we should provide the same functions for all the different interfaces, I think.

hvanhovell · 2016-01-11T16:03:20Z

@gatorsmile I do see the performance benefits of limit while processing. The reservation I am having is reasoning about non-toplevel limit statements. A set-operator example:

select a from db.tbl_a
intersect
select b from db.tbl_b

The result should all distinct rows in a for which we can find an equal tuple in b. Let's add limit to this:

select a from db.tbl_a limit 10
intersect
select b from db.tbl_b limit 10

The result now be the first (distinct?) 10 rows from a which will be filtered by checking if they exist in the first 10 rows of b (I think). I am not sure this is what a user expects, further more:

You will probably end up with less then 10 rows here.
The results will be probably non-deterministic (unless you would also allow somekind of ordering in a subquery).

Do you have a concrete realworld example where you need this?

I don't really mind if we would put this back in the parser (the engine supports it anyway). But I don't think we should just do something like this without some consideration.

gatorsmile · 2016-01-11T17:05:26Z

Give two tables tbl_a and tbl_b, tbl_a has billions of rows but tbl_b has thousands of rows. tbl_a has one column col_frkey_tbl_a whose values should be from tbl_b's column col_key_tbl_b. However, one user wants to do a quick check to confirm it. The query he can try is

select col_frkey_tbl_a from db.tbl_a limit 10000
intersect
select col_key_tbl_b from db.tbl_b

The above query can avoid fetching billions of rows from tbl_a. Hopefully, it can answer your question. @hvanhovell

SparkQA · 2016-01-11T17:58:46Z

Test build #49158 has finished for PR 10689 at commit 94386aa.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-11T19:56:08Z

Test build #49160 has finished for PR 10689 at commit b9ba021.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-01-11T20:03:20Z

That example seems kind of artificial to me. Additionally large non-terminal limits are not planned very well today so I think users are going to be surprised.

gatorsmile · 2016-01-11T21:15:01Z

Yeah! I just read the implementation of Limit. As you said, the current non-terminal one is not highly efficient, especially when the number of limits is not small.

rxin · 2016-01-12T01:48:01Z

@gatorsmile I think we'd need more proper design for limits. Let's close this as later.

gatorsmile · 2016-01-12T01:51:49Z

Sure, let me close it.

In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base. Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out: - The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR **removes** this keyword. - The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is **not supported** anymore. See apache#10689 for the rationale for this. - Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we **remove** this feature from the parser. It would be quite easy to implement such a feature as an Expression later on. - Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed. cc rxin viirya marmbrus yhuai cloud-fan Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#10745 from hvanhovell/SPARK-12575-2.

gatorsmile added 2 commits January 10, 2016 17:12

The Limit Clause can be applied inside the set operation

428160f

Merge remote-tracking branch 'upstream/master' into limitInUnion

310cb32

rxin reviewed Jan 11, 2016
View reviewed changes

improved the test case.

6244975

rxin reviewed Jan 11, 2016
View reviewed changes

gatorsmile added 2 commits January 11, 2016 09:51

updated the test case.

94386aa

style fix.

7d55c10

style fix.

b9ba021

gatorsmile closed this Jan 12, 2016

hvanhovell mentioned this pull request Jan 13, 2016

[SPARK-12575][SQL] Grammar parity with existing SQL parser #10745

Closed

gatorsmile mentioned this pull request Feb 5, 2016

[SPARK-8964] [SQL] Use Exchange to perform shuffle in Limit #7334

Closed

gatorsmile mentioned this pull request Feb 14, 2016

[SPARK-13236] SQL Generation for Set Operations #11195

Closed

[SPARK-12745] [SQL] Hive Parser: Limit is not supported inside Set Operation #10689

[SPARK-12745] [SQL] Hive Parser: Limit is not supported inside Set Operation #10689

Uh oh!

Conversation

gatorsmile commented Jan 11, 2016

Uh oh!

SparkQA commented Jan 11, 2016

Uh oh!

gatorsmile commented Jan 11, 2016

Uh oh!

rxin Jan 11, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jan 11, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Jan 11, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jan 11, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 11, 2016

Uh oh!

JoshRosen commented Jan 11, 2016

Uh oh!

hvanhovell commented Jan 11, 2016

Uh oh!

SparkQA commented Jan 11, 2016

Uh oh!

gatorsmile commented Jan 11, 2016

Uh oh!

hvanhovell commented Jan 11, 2016

Uh oh!

gatorsmile commented Jan 11, 2016

Uh oh!

SparkQA commented Jan 11, 2016

Uh oh!

SparkQA commented Jan 11, 2016

Uh oh!

marmbrus commented Jan 11, 2016

Uh oh!

gatorsmile commented Jan 11, 2016

Uh oh!

rxin commented Jan 12, 2016

Uh oh!

gatorsmile commented Jan 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants