-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12745] [SQL] Hive Parser: Limit is not supported inside Set Operation #10689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #49080 has finished for PR 10689 at commit
|
|
@hvanhovell @rxin Could you take a look? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we test there is a limit being injected? otherwise the parser could've just ignored the clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will add such a test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do this similar to how we compare plans in various optimizer suites, e.g. https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeInSuite.scala#L84
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it sounds better. Will do.
|
Test build #49094 has finished for PR 10689 at commit
|
|
Jenkins, retest this please. |
|
@gatorsmile the fix looks good. @rxin / @marmbrus / @gatorsmile I am not sure if we should support this at all. Using a limit in SELECT's connected by a UNION ALL is fine, but things tend to get really strange once you start using this in combination with other SET or JOIN operations; it'll get very hard to reasion about the result. Most RDMS'es do not support this. I'd rather have an optimizer rule which pushes down limit clauses whenever this is possible. |
|
Test build #49117 has finished for PR 10689 at commit
|
|
Thank you for your review! @hvanhovell Let me share my two cents:
|
|
@gatorsmile I do see the performance benefits of The result should all distinct rows in The result now be the first (distinct?) 10 rows from
Do you have a concrete realworld example where you need this? I don't really mind if we would put this back in the parser (the engine supports it anyway). But I don't think we should just do something like this without some consideration. |
|
Give two tables The above query can avoid fetching billions of rows from |
|
Test build #49158 has finished for PR 10689 at commit
|
|
Test build #49160 has finished for PR 10689 at commit
|
|
That example seems kind of artificial to me. Additionally large non-terminal limits are not planned very well today so I think users are going to be surprised. |
|
Yeah! I just read the implementation of |
|
@gatorsmile I think we'd need more proper design for limits. Let's close this as later. |
|
Sure, let me close it. |
In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base. Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out: - The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR **removes** this keyword. - The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is **not supported** anymore. See apache#10689 for the rationale for this. - Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we **remove** this feature from the parser. It would be quite easy to implement such a feature as an Expression later on. - Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed. cc rxin viirya marmbrus yhuai cloud-fan Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#10745 from hvanhovell/SPARK-12575-2.
The current SQLContext allows the following query, which is copied from a test case in SQLQuerySuite:
However, it is rejected by the Hive parser.
This PR is to make Hive parser support the Limit Clause inside Set Operator.