Skip to content

Conversation

@hvanhovell
Copy link
Contributor

In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base.

Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out:

  • The SQL Parser allowed syntax like APPROXIMATE(0.01) COUNT(DISTINCT a). In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. APPROXIMATE_COUNT_DISTINCT(a, 0.01) would also do the job and is much easier to maintain. So, this PR removes this keyword.
  • The old SQL Parser supports LIMIT clauses in nested queries. This is not supported anymore. See [SPARK-12745] [SQL] Hive Parser: Limit is not supported inside Set Operation #10689 for the rationale for this.
  • Hive has a charset name char set literal combination it supports, for instance the following expression _ISO-8859-1 0x4341464562616265 would yield this string: CAFEbabe. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we remove this feature from the parser. It would be quite easy to implement such a feature as an Expression later on.
  • Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a Double whereas the SQL Parser would convert a non-scientific decimal into a BigDecimal, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: 81923801.42BD, which can be used when a big decimal is needed.

cc @rxin @viirya @marmbrus @yhuai @cloud-fan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not happy about this one: we are using an unconfigured parser here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not pretty but: you could get the conf from the SQLContext getOrCreate / getActive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, we should make sure we have a default one when there is no active context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also move this to SQLImplicits, or SQLContext for that matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't work for Java though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, totally forgot about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allmost all tests have been moved to CatalystQl suite in a previous PR.

@yhuai
Copy link
Contributor

yhuai commented Jan 13, 2016

test this please

@SparkQA
Copy link

SparkQA commented Jan 14, 2016

Test build #49347 has finished for PR 10745 at commit 2b6a876.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any other database systems that allow this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked to a few more people about this. I think it's best to just drop this feature. Only MySQL and Hive support this, and we cannot support the identical syntax anyway. I'd say if this is a really desired feature, we can just build a function for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll drop the feature.

@rxin
Copy link
Contributor

rxin commented Jan 14, 2016

Looks pretty good overall.

@hvanhovell
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 14, 2016

Test build #49394 has finished for PR 10745 at commit 2b6a876.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 14, 2016

Test build #49409 has finished for PR 10745 at commit 8ea9865.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these copied from hive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is what was supported in the old SqlParser.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you trying to support both hive's and our interval literal grammar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive does not support multi time unit interval, such as: 1 year 3 month 10 milliseconds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you trying to support both hive's and our interval literal grammar?

In this case I am trying to do support both. Our interval grammar can be seen as an extention to hive's interval grammar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on supporting both. actually we have to here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we check elements.isEmpty first and throw exception if needed, and then foldLeft? then we don't need this updated variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually instead of foldLeft with two values, it might be easier and more clear to write this as a loop and just mutate two variables.

@SparkQA
Copy link

SparkQA commented Jan 15, 2016

Test build #49425 has finished for PR 10745 at commit e8c0813.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor Author

retest this please

@hvanhovell
Copy link
Contributor Author

getting weird seemingly unrelated python error.

@SparkQA
Copy link

SparkQA commented Jan 15, 2016

Test build #49457 has finished for PR 10745 at commit e8c0813.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jan 15, 2016

LGTM. Will merge once tests pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HyperLogLogPlusPlus was failing because I was passing it Decimal literals. I thought I could solve this by casting. While this is not relevant anymore, I still think this is valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputTypes is used to check input types, however, HyperLogLogPlusPlus only have one input, see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala#L132.

So we don't need to give 2 type constraints here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are right. I am fixing it now.

@SparkQA
Copy link

SparkQA commented Jan 15, 2016

Test build #49470 has finished for PR 10745 at commit 7e31ee8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 15, 2016

Test build #49483 has finished for PR 10745 at commit 5aa780f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

@rxin
Copy link
Contributor

rxin commented Jan 15, 2016

Thanks - going to merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants