-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as a decimal number token when parsing SQL string #15006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
29c1436 to
6b4ad92
Compare
|
Cool. I was looking into lexer modes for solving this. I'll take a look at this in the morning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only this rule causes an issue right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SCIENTIFIC_DECIMAL_VALUE, DOUBLE_LITERAL, and BIGDECIMAL_LITERAL may also cause this issue.
|
Test build #65063 has finished for PR 15006 at commit
|
6b4ad92 to
793edce
Compare
|
Test build #65079 has finished for PR 15006 at commit
|
|
Just FYI, the identifier must begin with
|
|
@gatorsmile FYI - Hive seems to allow identifier to start with number. |
|
Also tried it in Hive. |
|
More tries in Hive |
|
More tries in Hive |
|
If users really need to use the names starting with numeric, could they use quoted identifiers? If we are more flexible than Hive in identifiers, we also need to detect and block them when creating such a table using Hive metastore. |
|
We are not "more flexible than Hive". We are as flexible as Hive. With this PR, we only treats a string as an IDENTIFIER if there is no ambiguity that it can not be interpreted as a number token. For example, after this fix: |
|
Besides, Spark 1.6, Spark 2.0, supports syntax like |
|
uh... We use the Hive APIs. Basically, what we did is like passing a quoted identifier to Hive. Thus, it is OK to be more flexible. You are also right. I did not run Spark SQL after your fix. : ) |
|
This is a design decision. Either is fine to me. : ) |
|
@clockfly I also spent some time looking into this :-) I initially tried to handle this at lexer level and found it difficult distinguish between the number literals and table names. In particular SCIENTIFIC_DECIMAL looked troubling like 1234e10. So i tried to solve this at parser rule level where we have more context to disambiguate between literals and identifiers. I tried the approach of accepting these literals as table name identifiers and then handle or reject using the post hook much like how quoted identifiers are handled today. I have to admit , its a little complex but it handles most of the cases (based on my current testing). |
|
@dilipbiswal Thanks for this. Yes, I also tried to do this at Parser level, but found it would requires us to change the visitor code, which is not very clean. The current lexer rule can handles cases like |
|
@clockfly Thank you !! One question, by visitor code, do you mean the visitTableIdentifier code ? If so, i didn't make any change there. I just added a post hook in ParseDriver - FYI. Also, i was looking at my notes on problematic cases and why i thought of handling the issue In this case 1.23 is a double value and X is a alias name even though there is no space between them. The above results in a logical plan like following Not sure if we allowed this intentionally :-) or its a defect. |
|
Test build #65082 has finished for PR 15006 at commit
|
|
How do you think the case @dilipbiswal posted? Currently, there is a semantic mismatch between Spark and Postgres. I think our current way of handling |
| */ | ||
| public boolean isValidDecimal() { | ||
| int nextChar = _input.LA(1); | ||
| if (nextChar >= 'A' && nextChar <= 'Z' || nextChar >= '0' && nextChar <= '9' || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are basically checking the IDENTIFIER rule here by hand.
You could also write:
return !(nextChar >= 'A' && nextChar <= 'Z' ||
nextChar >= '0' && nextChar <= '9' ||
nextChar == '_');|
What would |
|
Left a two minor comments. Otherwise LGTM. |
|
These issues are quite hairy, and it is clear that there is no consensus among other systems. I think we should try to maintain backwards compatibility with Spark 1.6. |
|
Thank you @clockfly @hvanhovell @gatorsmile |
|
Merging to master/2.0. Thanks! |
…er as a decimal number token when parsing SQL string
## What changes were proposed in this pull request?
The Antlr lexer we use to tokenize a SQL string may wrongly tokenize a fully qualified identifier as a decimal number token. For example, table identifier `default.123_table` is wrongly tokenized as
```
default // Matches lexer rule IDENTIFIER
.123 // Matches lexer rule DECIMAL_VALUE
_TABLE // Matches lexer rule IDENTIFIER
```
The correct tokenization for `default.123_table` should be:
```
default // Matches lexer rule IDENTIFIER,
. // Matches a single dot
123_TABLE // Matches lexer rule IDENTIFIER
```
This PR fix the Antlr grammar so that it can tokenize fully qualified identifier correctly:
1. Fully qualified table name can be parsed correctly. For example, `select * from database.123_suffix`.
2. Fully qualified column name can be parsed correctly, for example `select a.123_suffix from a`.
### Before change
#### Case 1: Failed to parse fully qualified column name
```
scala> spark.sql("select a.123_column from a").show
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>,
...
, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 8)
== SQL ==
select a.123_column from a
--------^^^
```
#### Case 2: Failed to parse fully qualified table name
```
scala> spark.sql("select * from default.123_table")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>,
...
IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 21)
== SQL ==
select * from default.123_table
---------------------^^^
```
### After Change
#### Case 1: fully qualified column name, no ParseException thrown
```
scala> spark.sql("select a.123_column from a").show
```
#### Case 2: fully qualified table name, no ParseException thrown
```
scala> spark.sql("select * from default.123_table")
```
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes #15006 from clockfly/SPARK-17364.
(cherry picked from commit a6b8182)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
…er as a decimal number token when parsing SQL string
## What changes were proposed in this pull request?
The Antlr lexer we use to tokenize a SQL string may wrongly tokenize a fully qualified identifier as a decimal number token. For example, table identifier `default.123_table` is wrongly tokenized as
```
default // Matches lexer rule IDENTIFIER
.123 // Matches lexer rule DECIMAL_VALUE
_TABLE // Matches lexer rule IDENTIFIER
```
The correct tokenization for `default.123_table` should be:
```
default // Matches lexer rule IDENTIFIER,
. // Matches a single dot
123_TABLE // Matches lexer rule IDENTIFIER
```
This PR fix the Antlr grammar so that it can tokenize fully qualified identifier correctly:
1. Fully qualified table name can be parsed correctly. For example, `select * from database.123_suffix`.
2. Fully qualified column name can be parsed correctly, for example `select a.123_suffix from a`.
### Before change
#### Case 1: Failed to parse fully qualified column name
```
scala> spark.sql("select a.123_column from a").show
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>,
...
, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 8)
== SQL ==
select a.123_column from a
--------^^^
```
#### Case 2: Failed to parse fully qualified table name
```
scala> spark.sql("select * from default.123_table")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.123' expecting {<EOF>,
...
IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 21)
== SQL ==
select * from default.123_table
---------------------^^^
```
### After Change
#### Case 1: fully qualified column name, no ParseException thrown
```
scala> spark.sql("select a.123_column from a").show
```
#### Case 2: fully qualified table name, no ParseException thrown
```
scala> spark.sql("select * from default.123_table")
```
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes apache#15006 from clockfly/SPARK-17364.
What changes were proposed in this pull request?
The Antlr lexer we use to tokenize a SQL string may wrongly tokenize a fully qualified identifier as a decimal number token. For example, table identifier
default.123_tableis wrongly tokenized asThe correct tokenization for
default.123_tableshould be:This PR fix the Antlr grammar so that it can tokenize fully qualified identifier correctly:
select * from database.123_suffix.select a.123_suffix from a.Before change
Case 1: Failed to parse fully qualified column name
Case 2: Failed to parse fully qualified table name
After Change
Case 1: fully qualified column name, no ParseException thrown
Case 2: fully qualified table name, no ParseException thrown
How was this patch tested?
Unit test.