Skip to content

Conversation

@clockfly
Copy link
Contributor

@clockfly clockfly commented Sep 7, 2016

What changes were proposed in this pull request?

The Antlr lexer we use to tokenize a SQL string may wrongly tokenize a fully qualified identifier as a decimal number token. For example, table identifier default.123_table is wrongly tokenized as

default // Matches lexer rule IDENTIFIER
.123 // Matches lexer rule DECIMAL_VALUE
_TABLE // Matches lexer rule IDENTIFIER

The correct tokenization for default.123_table should be:

default // Matches lexer rule IDENTIFIER, 
. // Matches a single dot
123_TABLE // Matches lexer rule IDENTIFIER

This PR fix the Antlr grammar so that it can tokenize fully qualified identifier correctly:

  1. Fully qualified table name can be parsed correctly. For example, select * from database.123_suffix.
  2. Fully qualified column name can be parsed correctly, for example select a.123_suffix from a.

Before change

Case 1: Failed to parse fully qualified column name

scala> spark.sql("select a.123_column from a").show
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>, 
...
, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 8)
== SQL ==
select a.123_column from a
--------^^^

Case 2: Failed to parse fully qualified table name

scala> spark.sql("select * from default.123_table")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>, 
...
IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 21)

== SQL ==
select * from default.123_table
---------------------^^^

After Change

Case 1: fully qualified column name, no ParseException thrown

scala> spark.sql("select a.123_column from a").show

Case 2: fully qualified table name, no ParseException thrown

scala> spark.sql("select * from default.123_table")

How was this patch tested?

Unit test.

@hvanhovell
Copy link
Contributor

Cool. I was looking into lexer modes for solving this. I'll take a look at this in the morning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only this rule causes an issue right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCIENTIFIC_DECIMAL_VALUE, DOUBLE_LITERAL, and BIGDECIMAL_LITERAL may also cause this issue.

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65063 has finished for PR 15006 at commit 6b4ad92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@clockfly clockfly changed the title [SPARK-17364][SQL][WIP][Don't MERGE][POC] Allow number to be used in identifier [SPARK-17364][SQL] Antlr lexer wrongly treats identifier token as a decimal token when parsing SQL string Sep 8, 2016
@clockfly clockfly changed the title [SPARK-17364][SQL] Antlr lexer wrongly treats identifier token as a decimal token when parsing SQL string [SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier token as a decimal token when parsing SQL string Sep 8, 2016
@clockfly clockfly changed the title [SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier token as a decimal token when parsing SQL string [SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as a decimal token when parsing SQL string Sep 8, 2016
@clockfly clockfly changed the title [SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as a decimal token when parsing SQL string [SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as a decimal number token when parsing SQL string Sep 8, 2016
@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65079 has finished for PR 15006 at commit 793edce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Just FYI, the identifier must begin with

Platforma Specification
SQL:2003 Standard A letter
MySQL A letter
Oracle A letter
DB2 A letter
PostgreSQL A letter or underscore (_)
SQL Server A letter, underscore (_), at sign (@), or pound sign (#)

@dilipbiswal
Copy link
Contributor

@gatorsmile FYI - Hive seems to allow identifier to start with number.

@gatorsmile
Copy link
Member

Also tried it in Hive.

hive> create table 3r (col1 int);
OK
Time taken: 1.293 seconds
hive> create database 4d;
OK
Time taken: 0.041 seconds

@gatorsmile
Copy link
Member

More tries in Hive

hive> create table 3M (col3 int);
NoViableAltException(8@[192:1: tableName : (db= identifier DOT tab= identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])

hive> create table 3L (col3 int);
NoViableAltException(7@[192:1: tableName : (db= identifier DOT tab= identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])

hive> create table 3 (col3 int);
NoViableAltException(302@[192:1: tableName : (db= identifier DOT tab= identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])

@gatorsmile
Copy link
Member

More tries in Hive

hive> create table `3M` (col3 int);
OK
Time taken: 0.06 seconds
hive> create table `3L` (col3 int);
OK
Time taken: 0.062 seconds
hive> create table `3` (col3 int);
OK
Time taken: 0.041 seconds

@gatorsmile
Copy link
Member

gatorsmile commented Sep 8, 2016

If users really need to use the names starting with numeric, could they use quoted identifiers?

If we are more flexible than Hive in identifiers, we also need to detect and block them when creating such a table using Hive metastore.

@clockfly
Copy link
Contributor Author

clockfly commented Sep 8, 2016

@gatorsmile

We are not "more flexible than Hive". We are as flexible as Hive. With this PR, we only treats a string as an IDENTIFIER if there is no ambiguity that it can not be interpreted as a number token.

For example, after this fix:

scala> spark.sql("create table 3r") // allowed, 3r is not ambiguous
scala> spark.sql("create table 3M") // not allowed, as 3M is ambiguous, it can also mean BYTELENGTH_LITERAL.
scala> spark.sql("create table 3L") // not allowed, as 3L is ambiguous, it can also mean BIGINT_LITERAL.

@clockfly
Copy link
Contributor Author

clockfly commented Sep 8, 2016

Besides, Spark 1.6, Spark 2.0, supports syntax like select * from 123_LIST, so maybe we should keep it compatible so that we can support similar syntax like select * from default.123_LIST

@gatorsmile
Copy link
Member

uh... We use the Hive APIs. Basically, what we did is like passing a quoted identifier to Hive. Thus, it is OK to be more flexible.

You are also right. I did not run Spark SQL after your fix. : )

@gatorsmile
Copy link
Member

This is a design decision. Either is fine to me. : )

@dilipbiswal
Copy link
Contributor

@clockfly I also spent some time looking into this :-) I initially tried to handle this at lexer level and found it difficult distinguish between the number literals and table names. In particular SCIENTIFIC_DECIMAL looked troubling like 1234e10. So i tried to solve this at parser rule level where we have more context to disambiguate between literals and identifiers. I tried the approach of accepting these literals as table name identifiers and then handle or reject using the post hook much like how quoted identifiers are handled today. I have to admit , its a little complex but it handles most of the cases (based on my current testing).

@clockfly
Copy link
Contributor Author

clockfly commented Sep 8, 2016

@dilipbiswal Thanks for this. Yes, I also tried to do this at Parser level, but found it would requires us to change the visitor code, which is not very clean.
The current lexer rule semantics predicate solution is the cleanest way I found to solve this case.

The current lexer rule can handles cases like .1234e10

@dilipbiswal
Copy link
Contributor

dilipbiswal commented Sep 8, 2016

@clockfly Thank you !! One question, by visitor code, do you mean the visitTableIdentifier code ? If so, i didn't make any change there. I just added a post hook in ParseDriver - FYI.

Also, i was looking at my notes on problematic cases and why i thought of handling the issue
at parser level. I saw the following syntax work on both hive and spark.

select 1.23X 

In this case 1.23 is a double value and X is a alias name even though there is no space between them. The above results in a logical plan like following

== Parsed Logical Plan ==
Project [1.23 AS x#0]
+- OneRowRelation$

== Analyzed Logical Plan ==
x: decimal(3,2)
Project [1.23 AS x#0]
+- OneRowRelation$

Not sure if we allowed this intentionally :-) or its a defect.

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65082 has finished for PR 15006 at commit 478ef69.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@clockfly
Copy link
Contributor Author

clockfly commented Sep 9, 2016

@hvanhovell

How do you think the case @dilipbiswal posted?

Currently, there is a semantic mismatch between Spark and Postgres.
In Spark 1.6/2.0, SELECT 123X is interpreted as SELECT 123X``. In Postgres, SELECT 123X is interpreted as `SELECT 123 AS X`.

I think our current way of handling 123X as IDENTIFIER makes more sense.

*/
public boolean isValidDecimal() {
int nextChar = _input.LA(1);
if (nextChar >= 'A' && nextChar <= 'Z' || nextChar >= '0' && nextChar <= '9' ||
Copy link
Contributor

@hvanhovell hvanhovell Sep 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are basically checking the IDENTIFIER rule here by hand.

You could also write:

return !(nextChar >= 'A' && nextChar <= 'Z' ||
         nextChar >= '0' && nextChar <= '9' ||
         nextChar == '_');

@hvanhovell
Copy link
Contributor

What would tbl.123 produce?

@hvanhovell
Copy link
Contributor

Left a two minor comments. Otherwise LGTM.

@hvanhovell
Copy link
Contributor

These issues are quite hairy, and it is clear that there is no consensus among other systems. I think we should try to maintain backwards compatibility with Spark 1.6.

@dilipbiswal
Copy link
Contributor

Thank you @clockfly @hvanhovell @gatorsmile

@hvanhovell
Copy link
Contributor

hvanhovell commented Sep 15, 2016

Merging to master/2.0. Thanks!

@asfgit asfgit closed this in a6b8182 Sep 15, 2016
asfgit pushed a commit that referenced this pull request Sep 15, 2016
…er as a decimal number token when parsing SQL string

## What changes were proposed in this pull request?

The Antlr lexer we use to tokenize a SQL string may wrongly tokenize a fully qualified identifier as a decimal number token. For example, table identifier `default.123_table` is wrongly tokenized as
```
default // Matches lexer rule IDENTIFIER
.123 // Matches lexer rule DECIMAL_VALUE
_TABLE // Matches lexer rule IDENTIFIER
```

The correct tokenization for `default.123_table` should be:
```
default // Matches lexer rule IDENTIFIER,
. // Matches a single dot
123_TABLE // Matches lexer rule IDENTIFIER
```

This PR fix the Antlr grammar so that it can tokenize fully qualified identifier correctly:
1. Fully qualified table name can be parsed correctly. For example, `select * from database.123_suffix`.
2. Fully qualified column name can be parsed correctly, for example `select a.123_suffix from a`.

### Before change

#### Case 1: Failed to parse fully qualified column name

```
scala> spark.sql("select a.123_column from a").show
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>,
...
, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 8)
== SQL ==
select a.123_column from a
--------^^^
```

#### Case 2: Failed to parse fully qualified table name
```
scala> spark.sql("select * from default.123_table")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>,
...
IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 21)

== SQL ==
select * from default.123_table
---------------------^^^
```

### After Change

#### Case 1: fully qualified column name, no ParseException thrown
```
scala> spark.sql("select a.123_column from a").show
```

#### Case 2: fully qualified table name, no ParseException thrown
```
scala> spark.sql("select * from default.123_table")
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #15006 from clockfly/SPARK-17364.

(cherry picked from commit a6b8182)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
…er as a decimal number token when parsing SQL string

## What changes were proposed in this pull request?

The Antlr lexer we use to tokenize a SQL string may wrongly tokenize a fully qualified identifier as a decimal number token. For example, table identifier `default.123_table` is wrongly tokenized as
```
default // Matches lexer rule IDENTIFIER
.123 // Matches lexer rule DECIMAL_VALUE
_TABLE // Matches lexer rule IDENTIFIER
```

The correct tokenization for `default.123_table` should be:
```
default // Matches lexer rule IDENTIFIER,
. // Matches a single dot
123_TABLE // Matches lexer rule IDENTIFIER
```

This PR fix the Antlr grammar so that it can tokenize fully qualified identifier correctly:
1. Fully qualified table name can be parsed correctly. For example, `select * from database.123_suffix`.
2. Fully qualified column name can be parsed correctly, for example `select a.123_suffix from a`.

### Before change

#### Case 1: Failed to parse fully qualified column name

```
scala> spark.sql("select a.123_column from a").show
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>,
...
, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 8)
== SQL ==
select a.123_column from a
--------^^^
```

#### Case 2: Failed to parse fully qualified table name
```
scala> spark.sql("select * from default.123_table")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>,
...
IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 21)

== SQL ==
select * from default.123_table
---------------------^^^
```

### After Change

#### Case 1: fully qualified column name, no ParseException thrown
```
scala> spark.sql("select a.123_column from a").show
```

#### Case 2: fully qualified table name, no ParseException thrown
```
scala> spark.sql("select * from default.123_table")
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes apache#15006 from clockfly/SPARK-17364.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants