[SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as a decimal number token when parsing SQL string #15006

clockfly · 2016-09-07T23:52:56Z

What changes were proposed in this pull request?

The Antlr lexer we use to tokenize a SQL string may wrongly tokenize a fully qualified identifier as a decimal number token. For example, table identifier default.123_table is wrongly tokenized as

default // Matches lexer rule IDENTIFIER
.123 // Matches lexer rule DECIMAL_VALUE
_TABLE // Matches lexer rule IDENTIFIER

The correct tokenization for default.123_table should be:

default // Matches lexer rule IDENTIFIER, 
. // Matches a single dot
123_TABLE // Matches lexer rule IDENTIFIER

This PR fix the Antlr grammar so that it can tokenize fully qualified identifier correctly:

Fully qualified table name can be parsed correctly. For example, select * from database.123_suffix.
Fully qualified column name can be parsed correctly, for example select a.123_suffix from a.

Before change

Case 1: Failed to parse fully qualified column name

scala> spark.sql("select a.123_column from a").show
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>, 
...
, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 8)
== SQL ==
select a.123_column from a
--------^^^

Case 2: Failed to parse fully qualified table name

scala> spark.sql("select * from default.123_table")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>, 
...
IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 21)

== SQL ==
select * from default.123_table
---------------------^^^

After Change

Case 1: fully qualified column name, no ParseException thrown

scala> spark.sql("select a.123_column from a").show

Case 2: fully qualified table name, no ParseException thrown

scala> spark.sql("select * from default.123_table")

How was this patch tested?

Unit test.

hvanhovell · 2016-09-08T00:55:42Z

Cool. I was looking into lexer modes for solving this. I'll take a look at this in the morning.

hvanhovell · 2016-09-08T00:57:14Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

Only this rule causes an issue right?

SCIENTIFIC_DECIMAL_VALUE, DOUBLE_LITERAL, and BIGDECIMAL_LITERAL may also cause this issue.

SparkQA · 2016-09-08T01:57:09Z

Test build #65063 has finished for PR 15006 at commit 6b4ad92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-08T06:29:14Z

Test build #65079 has finished for PR 15006 at commit 793edce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-09-08T06:50:00Z

Just FYI, the identifier must begin with

Platforma	Specification
SQL:2003 Standard	A letter
MySQL	A letter
Oracle	A letter
DB2	A letter
PostgreSQL	A letter or underscore (_)
SQL Server	A letter, underscore (_), at sign (@), or pound sign (#)

dilipbiswal · 2016-09-08T06:55:09Z

@gatorsmile FYI - Hive seems to allow identifier to start with number.

gatorsmile · 2016-09-08T06:55:16Z

Also tried it in Hive.

hive> create table 3r (col1 int);
OK
Time taken: 1.293 seconds
hive> create database 4d;
OK
Time taken: 0.041 seconds

gatorsmile · 2016-09-08T06:57:42Z

More tries in Hive

hive> create table 3M (col3 int);
NoViableAltException(8@[192:1: tableName : (db= identifier DOT tab= identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])

hive> create table 3L (col3 int);
NoViableAltException(7@[192:1: tableName : (db= identifier DOT tab= identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])

hive> create table 3 (col3 int);
NoViableAltException(302@[192:1: tableName : (db= identifier DOT tab= identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])

gatorsmile · 2016-09-08T07:01:13Z

More tries in Hive

hive> create table `3M` (col3 int);
OK
Time taken: 0.06 seconds
hive> create table `3L` (col3 int);
OK
Time taken: 0.062 seconds
hive> create table `3` (col3 int);
OK
Time taken: 0.041 seconds

gatorsmile · 2016-09-08T07:02:47Z

If users really need to use the names starting with numeric, could they use quoted identifiers?

If we are more flexible than Hive in identifiers, we also need to detect and block them when creating such a table using Hive metastore.

clockfly · 2016-09-08T07:22:27Z

@gatorsmile

We are not "more flexible than Hive". We are as flexible as Hive. With this PR, we only treats a string as an IDENTIFIER if there is no ambiguity that it can not be interpreted as a number token.

For example, after this fix:

scala> spark.sql("create table 3r") // allowed, 3r is not ambiguous
scala> spark.sql("create table 3M") // not allowed, as 3M is ambiguous, it can also mean BYTELENGTH_LITERAL.
scala> spark.sql("create table 3L") // not allowed, as 3L is ambiguous, it can also mean BIGINT_LITERAL.

clockfly · 2016-09-08T07:24:34Z

Besides, Spark 1.6, Spark 2.0, supports syntax like select * from 123_LIST, so maybe we should keep it compatible so that we can support similar syntax like select * from default.123_LIST

gatorsmile · 2016-09-08T07:26:51Z

uh... We use the Hive APIs. Basically, what we did is like passing a quoted identifier to Hive. Thus, it is OK to be more flexible.

You are also right. I did not run Spark SQL after your fix. : )

gatorsmile · 2016-09-08T07:29:11Z

This is a design decision. Either is fine to me. : )

dilipbiswal · 2016-09-08T07:34:38Z

@clockfly I also spent some time looking into this :-) I initially tried to handle this at lexer level and found it difficult distinguish between the number literals and table names. In particular SCIENTIFIC_DECIMAL looked troubling like 1234e10. So i tried to solve this at parser rule level where we have more context to disambiguate between literals and identifiers. I tried the approach of accepting these literals as table name identifiers and then handle or reject using the post hook much like how quoted identifiers are handled today. I have to admit , its a little complex but it handles most of the cases (based on my current testing).

clockfly · 2016-09-08T07:58:52Z

@dilipbiswal Thanks for this. Yes, I also tried to do this at Parser level, but found it would requires us to change the visitor code, which is not very clean.
The current lexer rule semantics predicate solution is the cleanest way I found to solve this case.

The current lexer rule can handles cases like .1234e10

dilipbiswal · 2016-09-08T08:12:30Z

@clockfly Thank you !! One question, by visitor code, do you mean the visitTableIdentifier code ? If so, i didn't make any change there. I just added a post hook in ParseDriver - FYI.

Also, i was looking at my notes on problematic cases and why i thought of handling the issue
at parser level. I saw the following syntax work on both hive and spark.

select 1.23X

In this case 1.23 is a double value and X is a alias name even though there is no space between them. The above results in a logical plan like following

== Parsed Logical Plan ==
Project [1.23 AS x#0]
+- OneRowRelation$

== Analyzed Logical Plan ==
x: decimal(3,2)
Project [1.23 AS x#0]
+- OneRowRelation$

Not sure if we allowed this intentionally :-) or its a defect.

SparkQA · 2016-09-08T08:17:33Z

Test build #65082 has finished for PR 15006 at commit 478ef69.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

clockfly · 2016-09-09T01:29:19Z

@hvanhovell

How do you think the case @dilipbiswal posted?

Currently, there is a semantic mismatch between Spark and Postgres.
In Spark 1.6/2.0, SELECT 123X is interpreted as SELECT 123X``. In Postgres, SELECT 123X is interpreted as `SELECT 123 AS X`.

I think our current way of handling 123X as IDENTIFIER makes more sense.

hvanhovell · 2016-09-14T10:11:33Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

+   */
+  public boolean isValidDecimal() {
+    int nextChar = _input.LA(1);
+    if (nextChar >= 'A' && nextChar <= 'Z' || nextChar >= '0' && nextChar <= '9' ||


You are basically checking the IDENTIFIER rule here by hand.

You could also write:

return !(nextChar >= 'A' && nextChar <= 'Z' || nextChar >= '0' && nextChar <= '9' || nextChar == '_');

hvanhovell · 2016-09-14T10:15:15Z

What would tbl.123 produce?

hvanhovell · 2016-09-14T10:15:36Z

Left a two minor comments. Otherwise LGTM.

hvanhovell · 2016-09-14T19:17:47Z

These issues are quite hairy, and it is clear that there is no consensus among other systems. I think we should try to maintain backwards compatibility with Spark 1.6.

dilipbiswal · 2016-09-15T00:50:52Z

Thank you @clockfly @hvanhovell @gatorsmile

hvanhovell · 2016-09-15T18:53:28Z

Merging to master/2.0. Thanks!

…er as a decimal number token when parsing SQL string ## What changes were proposed in this pull request? The Antlr lexer we use to tokenize a SQL string may wrongly tokenize a fully qualified identifier as a decimal number token. For example, table identifier `default.123_table` is wrongly tokenized as ``` default // Matches lexer rule IDENTIFIER .123 // Matches lexer rule DECIMAL_VALUE _TABLE // Matches lexer rule IDENTIFIER ``` The correct tokenization for `default.123_table` should be: ``` default // Matches lexer rule IDENTIFIER, . // Matches a single dot 123_TABLE // Matches lexer rule IDENTIFIER ``` This PR fix the Antlr grammar so that it can tokenize fully qualified identifier correctly: 1. Fully qualified table name can be parsed correctly. For example, `select * from database.123_suffix`. 2. Fully qualified column name can be parsed correctly, for example `select a.123_suffix from a`. ### Before change #### Case 1: Failed to parse fully qualified column name ``` scala> spark.sql("select a.123_column from a").show org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '.123' expecting {<EOF>, ... , IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 8) == SQL == select a.123_column from a --------^^^ ``` #### Case 2: Failed to parse fully qualified table name ``` scala> spark.sql("select * from default.123_table") org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '.123' expecting {<EOF>, ... IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 21) == SQL == select * from default.123_table ---------------------^^^ ``` ### After Change #### Case 1: fully qualified column name, no ParseException thrown ``` scala> spark.sql("select a.123_column from a").show ``` #### Case 2: fully qualified table name, no ParseException thrown ``` scala> spark.sql("select * from default.123_table") ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #15006 from clockfly/SPARK-17364. (cherry picked from commit a6b8182) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

…er as a decimal number token when parsing SQL string ## What changes were proposed in this pull request? The Antlr lexer we use to tokenize a SQL string may wrongly tokenize a fully qualified identifier as a decimal number token. For example, table identifier `default.123_table` is wrongly tokenized as ``` default // Matches lexer rule IDENTIFIER .123 // Matches lexer rule DECIMAL_VALUE _TABLE // Matches lexer rule IDENTIFIER ``` The correct tokenization for `default.123_table` should be: ``` default // Matches lexer rule IDENTIFIER, . // Matches a single dot 123_TABLE // Matches lexer rule IDENTIFIER ``` This PR fix the Antlr grammar so that it can tokenize fully qualified identifier correctly: 1. Fully qualified table name can be parsed correctly. For example, `select * from database.123_suffix`. 2. Fully qualified column name can be parsed correctly, for example `select a.123_suffix from a`. ### Before change #### Case 1: Failed to parse fully qualified column name ``` scala> spark.sql("select a.123_column from a").show org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '.123' expecting {<EOF>, ... , IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 8) == SQL == select a.123_column from a --------^^^ ``` #### Case 2: Failed to parse fully qualified table name ``` scala> spark.sql("select * from default.123_table") org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '.123' expecting {<EOF>, ... IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 21) == SQL == select * from default.123_table ---------------------^^^ ``` ### After Change #### Case 1: fully qualified column name, no ParseException thrown ``` scala> spark.sql("select a.123_column from a").show ``` #### Case 2: fully qualified table name, no ParseException thrown ``` scala> spark.sql("select * from default.123_table") ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes apache#15006 from clockfly/SPARK-17364.

clockfly force-pushed the SPARK-17364 branch from 29c1436 to 6b4ad92 Compare September 7, 2016 23:53

hvanhovell reviewed Sep 8, 2016
View reviewed changes

fix SPARK-17364

793edce

clockfly force-pushed the SPARK-17364 branch from 6b4ad92 to 793edce Compare September 8, 2016 04:29

clockfly changed the title ~~[SPARK-17364][SQL][WIP][Don't MERGE][POC] Allow number to be used in identifier~~ [SPARK-17364][SQL] Antlr lexer wrongly treats identifier token as a decimal token when parsing SQL string Sep 8, 2016

clockfly changed the title ~~[SPARK-17364][SQL] Antlr lexer wrongly treats identifier token as a decimal token when parsing SQL string~~ [SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier token as a decimal token when parsing SQL string Sep 8, 2016

clockfly changed the title ~~[SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier token as a decimal token when parsing SQL string~~ [SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as a decimal token when parsing SQL string Sep 8, 2016

clockfly changed the title ~~[SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as a decimal token when parsing SQL string~~ [SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as a decimal number token when parsing SQL string Sep 8, 2016

add test case for column with number

478ef69

hvanhovell reviewed Sep 14, 2016
View reviewed changes

asfgit closed this in a6b8182 Sep 15, 2016

[SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as a decimal number token when parsing SQL string #15006

[SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as a decimal number token when parsing SQL string #15006

Uh oh!

Conversation

clockfly commented Sep 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Before change

Case 1: Failed to parse fully qualified column name

Case 2: Failed to parse fully qualified table name

After Change

Case 1: fully qualified column name, no ParseException thrown

Case 2: fully qualified table name, no ParseException thrown

How was this patch tested?

Uh oh!

hvanhovell commented Sep 8, 2016

Uh oh!

hvanhovell Sep 8, 2016

Choose a reason for hiding this comment

Uh oh!

clockfly Sep 8, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 8, 2016

Uh oh!

SparkQA commented Sep 8, 2016

Uh oh!

gatorsmile commented Sep 8, 2016

Uh oh!

dilipbiswal commented Sep 8, 2016

Uh oh!

gatorsmile commented Sep 8, 2016

Uh oh!

gatorsmile commented Sep 8, 2016

Uh oh!

gatorsmile commented Sep 8, 2016

Uh oh!

gatorsmile commented Sep 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clockfly commented Sep 8, 2016

Uh oh!

clockfly commented Sep 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Sep 8, 2016

Uh oh!

gatorsmile commented Sep 8, 2016

Uh oh!

dilipbiswal commented Sep 8, 2016

Uh oh!

clockfly commented Sep 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dilipbiswal commented Sep 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 8, 2016

Uh oh!

clockfly commented Sep 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvanhovell Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Sep 14, 2016

Uh oh!

hvanhovell commented Sep 14, 2016

Uh oh!

hvanhovell commented Sep 14, 2016

Uh oh!

dilipbiswal commented Sep 15, 2016

Uh oh!

hvanhovell commented Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

clockfly commented Sep 7, 2016 •

edited

Loading

gatorsmile commented Sep 8, 2016 •

edited

Loading

clockfly commented Sep 8, 2016 •

edited

Loading

clockfly commented Sep 8, 2016 •

edited

Loading

dilipbiswal commented Sep 8, 2016 •

edited

Loading

clockfly commented Sep 9, 2016 •

edited

Loading

hvanhovell Sep 14, 2016 •

edited

Loading

hvanhovell commented Sep 15, 2016 •

edited

Loading