[SPARK-12575][SQL] Grammar parity with existing SQL parser #10745

hvanhovell · 2016-01-13T23:24:52Z

In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base.

Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out:

The SQL Parser allowed syntax like APPROXIMATE(0.01) COUNT(DISTINCT a). In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. APPROXIMATE_COUNT_DISTINCT(a, 0.01) would also do the job and is much easier to maintain. So, this PR removes this keyword.
The old SQL Parser supports LIMIT clauses in nested queries. This is not supported anymore. See [SPARK-12745] [SQL] Hive Parser: Limit is not supported inside Set Operation #10689 for the rationale for this.
Hive has a charset name char set literal combination it supports, for instance the following expression _ISO-8859-1 0x4341464562616265 would yield this string: CAFEbabe. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we remove this feature from the parser. It would be quite easy to implement such a feature as an Expression later on.
Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a Double whereas the SQL Parser would convert a non-scientific decimal into a BigDecimal, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: 81923801.42BD, which can be used when a big decimal is needed.

cc @rxin @viirya @marmbrus @yhuai @cloud-fan

…ith charset names.

hvanhovell · 2016-01-13T23:27:20Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

I am not happy about this one: we are using an unconfigured parser here.

Not pretty but: you could get the conf from the SQLContext getOrCreate / getActive?

If we do that, we should make sure we have a default one when there is no active context.

Could we also move this to SQLImplicits, or SQLContext for that matter.

That won't work for Java though.

Yeah, totally forgot about that.

hvanhovell · 2016-01-13T23:34:52Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/SqlParserSuite.scala

Allmost all tests have been moved to CatalystQl suite in a previous PR.

yhuai · 2016-01-13T23:49:23Z

test this please

SparkQA · 2016-01-14T01:24:16Z

Test build #49347 has finished for PR 10745 at commit 2b6a876.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-14T02:29:30Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/CatalystQlSuite.scala

are there any other database systems that allow this feature?

MySQL apparently supports this: https://docs.oracle.com/cd/E17952_01/refman-5.5-en/charset-literal.html

Talked to a few more people about this. I think it's best to just drop this feature. Only MySQL and Hive support this, and we cannot support the identical syntax anyway. I'd say if this is a really desired feature, we can just build a function for it.

I'll drop the feature.

rxin · 2016-01-14T02:43:34Z

Looks pretty good overall.

hvanhovell · 2016-01-14T08:43:18Z

retest this please

SparkQA · 2016-01-14T11:01:10Z

Test build #49394 has finished for PR 10745 at commit 2b6a876.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-14T21:56:17Z

Test build #49409 has finished for PR 10745 at commit 8ea9865.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-14T23:19:55Z

sql/catalyst/src/main/antlr3/org/apache/spark/sql/catalyst/parser/ExpressionParser.g

are these copied from hive?

No this is what was supported in the old SqlParser.

are you trying to support both hive's and our interval literal grammar?

Hive does not support multi time unit interval, such as: 1 year 3 month 10 milliseconds

are you trying to support both hive's and our interval literal grammar?

In this case I am trying to do support both. Our interval grammar can be seen as an extention to hive's interval grammar.

+1 on supporting both. actually we have to here.

cloud-fan · 2016-01-15T00:21:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala

How about we check elements.isEmpty first and throw exception if needed, and then foldLeft? then we don't need this updated variable.

actually instead of foldLeft with two values, it might be easier and more clear to write this as a loop and just mutate two variables.

SparkQA · 2016-01-15T01:26:40Z

Test build #49425 has finished for PR 10745 at commit e8c0813.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-01-15T10:48:46Z

retest this please

hvanhovell · 2016-01-15T10:49:03Z

getting weird seemingly unrelated python error.

SparkQA · 2016-01-15T13:10:28Z

Test build #49457 has finished for PR 10745 at commit e8c0813.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…r function. Add some docs.

rxin · 2016-01-15T19:30:52Z

LGTM. Will merge once tests pass.

cloud-fan · 2016-01-15T20:25:35Z

...src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala

why this change?

HyperLogLogPlusPlus was failing because I was passing it Decimal literals. I thought I could solve this by casting. While this is not relevant anymore, I still think this is valid.

inputTypes is used to check input types, however, HyperLogLogPlusPlus only have one input, see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala#L132.

So we don't need to give 2 type constraints here.

Yeah you are right. I am fixing it now.

SparkQA · 2016-01-15T21:24:42Z

Test build #49470 has finished for PR 10745 at commit 7e31ee8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-15T23:09:40Z

Test build #49483 has finished for PR 10745 at commit 5aa780f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-15T23:14:30Z

LGTM

rxin · 2016-01-15T23:16:50Z

Thanks - going to merge this.

hvanhovell added 24 commits January 7, 2016 18:38

Enable Expression Parsing in CatalysQl

c15ae29

Enable Expression Parsing in CatalysQl

cd7f8ec

Merge remote-tracking branch 'spark/master' into SPARK-12576

682df13

Add tests

7f37d81

Fix a few parser bugs. Address rxin's comments.

c2b35b7

Fix HIveQlSuite

b070bf9

Make name more consistent. Remove dead clause.

bc0e298

Replace existing SQL parser with the new Parser

17d6da0

Merge remote-tracking branch 'spark/master' into SPARK-12575-2

ebe7d90

Merge remote-tracking branch 'spark/master' into SPARK-12575-2

5b19b8a

Change tests using Approximate

e1de29f

Align CatalystQl behavior with the old SparkSQLParser.

d5c2898

Merge remote-tracking branch 'spark/master' into SPARK-12576

3111ffb

Comment string improvement.

beb5ca0

Merge branch 'SPARK-12576' into SPARK-12575-2

0592b8d

Merge remote-tracking branch 'spark/master' into SPARK-12575-2

9a3d716

Fix nested unary expressions.

3f73287

Add Long type

514ba3b

Do not use keywords in query/

ea01c5a

Identifier names cannot start with an _ in order to avoid confusion w…

155aa44

…ith charset names.

Remove charset literal. Improve interval handling.

67b1386

Make tests pass. Improve integration.

02dc7dd

Merge remote-tracking branch 'spark/master' into SPARK-12575-2

5eea11d

Style

179c5d9

hvanhovell reviewed Jan 13, 2016
View reviewed changes

Revert visibility of parse method.

2b6a876

hvanhovell reviewed Jan 13, 2016
View reviewed changes

rxin reviewed Jan 14, 2016
View reviewed changes

Fix bug, and remove some not-yet-used functionality from the parser.

8ea9865

cloud-fan reviewed Jan 14, 2016
View reviewed changes

Fix python test.

e8c0813

cloud-fan reviewed Jan 15, 2016
View reviewed changes

Remove CharSet literal. Change Decimal default to Double. Improve exp…

7e31ee8

…r function. Add some docs.

cloud-fan reviewed Jan 15, 2016
View reviewed changes

Revert inputTypes change in HyperLogLogPlusPlus

5aa780f

asfgit closed this in 7cd7f22 Jan 15, 2016

felixcheung mentioned this pull request Jan 17, 2016

[SPARK-12862][SPARKR] Jenkins does not run R tests #10792

Closed

[SPARK-12575][SQL] Grammar parity with existing SQL parser #10745

[SPARK-12575][SQL] Grammar parity with existing SQL parser #10745

Uh oh!

Conversation

hvanhovell commented Jan 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented Jan 13, 2016

Uh oh!

SparkQA commented Jan 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jan 14, 2016

Uh oh!

hvanhovell commented Jan 14, 2016

Uh oh!

SparkQA commented Jan 14, 2016

Uh oh!

SparkQA commented Jan 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 15, 2016

Uh oh!

hvanhovell commented Jan 15, 2016

Uh oh!

hvanhovell commented Jan 15, 2016

Uh oh!

SparkQA commented Jan 15, 2016

Uh oh!

rxin commented Jan 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 15, 2016

Uh oh!

SparkQA commented Jan 15, 2016

Uh oh!

cloud-fan commented Jan 15, 2016

Uh oh!