Skip to content

Conversation

@gatorsmile
Copy link
Member

The current SQLContext allows the following query, which is copied from a test case in SQLQuerySuite:

     checkAnswer(sql(
       """
         |select key from ((select * from testData limit 1)
         |  union all (select * from testData limit 1)) x limit 1
       """.stripMargin),
       Row(1)
     )

However, it is rejected by the Hive parser.

This PR is to make Hive parser support the Limit Clause inside Set Operator.

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49080 has finished for PR 10689 at commit 310cb32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@hvanhovell @rxin Could you take a look? Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we test there is a limit being injected? otherwise the parser could've just ignored the clause.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add such a test case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it sounds better. Will do.

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49094 has finished for PR 10689 at commit 6244975.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@hvanhovell
Copy link
Contributor

@gatorsmile the fix looks good.

@rxin / @marmbrus / @gatorsmile I am not sure if we should support this at all. Using a limit in SELECT's connected by a UNION ALL is fine, but things tend to get really strange once you start using this in combination with other SET or JOIN operations; it'll get very hard to reasion about the result. Most RDMS'es do not support this. I'd rather have an optimizer rule which pushes down limit clauses whenever this is possible.

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49117 has finished for PR 10689 at commit 6244975.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

Thank you for your review! @hvanhovell Let me share my two cents:

@hvanhovell
Copy link
Contributor

@gatorsmile I do see the performance benefits of limit while processing. The reservation I am having is reasoning about non-toplevel limit statements. A set-operator example:

select a from db.tbl_a
intersect
select b from db.tbl_b

The result should all distinct rows in a for which we can find an equal tuple in b. Let's add limit to this:

select a from db.tbl_a limit 10
intersect
select b from db.tbl_b limit 10

The result now be the first (distinct?) 10 rows from a which will be filtered by checking if they exist in the first 10 rows of b (I think). I am not sure this is what a user expects, further more:

  • You will probably end up with less then 10 rows here.
  • The results will be probably non-deterministic (unless you would also allow somekind of ordering in a subquery).

Do you have a concrete realworld example where you need this?

I don't really mind if we would put this back in the parser (the engine supports it anyway). But I don't think we should just do something like this without some consideration.

@gatorsmile
Copy link
Member Author

Give two tables tbl_a and tbl_b, tbl_a has billions of rows but tbl_b has thousands of rows. tbl_a has one column col_frkey_tbl_a whose values should be from tbl_b's column col_key_tbl_b. However, one user wants to do a quick check to confirm it. The query he can try is

select col_frkey_tbl_a from db.tbl_a limit 10000
intersect
select col_key_tbl_b from db.tbl_b 

The above query can avoid fetching billions of rows from tbl_a. Hopefully, it can answer your question. @hvanhovell

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49158 has finished for PR 10689 at commit 94386aa.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49160 has finished for PR 10689 at commit b9ba021.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

That example seems kind of artificial to me. Additionally large non-terminal limits are not planned very well today so I think users are going to be surprised.

@gatorsmile
Copy link
Member Author

Yeah! I just read the implementation of Limit. As you said, the current non-terminal one is not highly efficient, especially when the number of limits is not small.

@rxin
Copy link
Contributor

rxin commented Jan 12, 2016

@gatorsmile I think we'd need more proper design for limits. Let's close this as later.

@gatorsmile
Copy link
Member Author

Sure, let me close it.

@gatorsmile gatorsmile closed this Jan 12, 2016
ghost pushed a commit to dbtsai/spark that referenced this pull request Jan 15, 2016
In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base.

Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out:
- The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR **removes** this keyword.
- The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is **not supported** anymore. See apache#10689 for the rationale for this.
- Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we **remove** this feature from the parser. It would be quite easy to implement such a feature as an Expression later on.
- Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed.

cc rxin viirya marmbrus yhuai cloud-fan

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes apache#10745 from hvanhovell/SPARK-12575-2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants