-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20399][SQL] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser #17736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Let's see if it breaks any existing tests. |
|
Test build #76091 has finished for PR 17736 at commit
|
|
LGTM. Thanks. @cloud-fan @rxin this fixes our production jobs when we port our applications from 1.6 to 2.0. I think it's a important bug fix. Thanks. |
|
Test build #76094 has finished for PR 17736 at commit
|
|
cc @hvanhovell for review ... |
|
seems all string literals in Spark 2.0 parser behave differently from Spark 1.6? |
|
Is it? Are there any significant difference? I don't remember there is necessary migration from 1.6 to 2.0 for string literals. |
|
isn't the regex parsed as string literal? |
|
It is. But it has no problem for normal string literal. It causes problem only if the string literal is used as regex pattern string. |
|
ping @cloud-fan @hvanhovell |
|
what does |
|
@cloud-fan Do you mean Spark 2.x: Spark 1.6: |
|
shall we fix this inconsistency too? |
|
I don't know why we have |
|
@cloud-fan although @hvanhovell haven't commented yet, I will go to fix the inconsistency first and see if we have defined tests against it. |
|
As I am trying to fix the inconsistency, one issue I found is |
|
Other inconsistency. Spark 2.0: Spark 1.6: |
|
@cloud-fan For the inconsistency of string literal, after thinking and experimenting, I think Spark 2.0's approach is more reasonable. If we follow 1.6, we can't support something like |
|
hi @viirya , thanks for your example! And I have one suggestion, can we use |
|
@cloud-fan I've updated the example. Please check if it is better for you. Thanks. |
|
yea, much clearer now, and the string literal in Spark 2.0 looks more reasonable. For the regex, I think it's unfair to compare Think about a SQL shell, users can write |
|
For the regex, currently users need to write something like |
Like I said before, it's because java string literal plays a role here, try to use Migration is a real problem, but it's also a problem for string literals. We can add a config to fallback to old SQL parser behavior. |
|
A config seems good to me. We can solve migration issue but don't affect Spark 2.0 behavior. |
Yea, I think so. Before we create a new PR for the config, let's try to get some feedback from @hvanhovell |
|
@hvanhovell What you think about adding a config to fallback string literal parsing consistent with old SQL parser behavior (don't unescape input string). |
|
For some reference. In 1.6 we used the Catalyst SqlParser to parse the expression in Java/scala normal strings make things mind meltingly confusing. I think it is fair that we provide an option to disable the parser's unescaping as a way to get out of this. This might not be the best solution if you use regexes in both pure SQL and in scala at the same time, but it at least is an improvement. |
|
@hvanhovell Thanks for comment. It sounds reasonable to me. And the config can help migration issue. @cloud-fan I am going to add the config, is a new PR better or just update this? |
|
let's create a new PR |
…nsistent with old sql parser behavior
## What changes were proposed in this pull request?
The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.
The following codes can reproduce it:
val data = Seq("\u0020\u0021\u0023", "abc")
val df = data.toDF()
// 1st usage: works in 1.6
// Let parser parse pattern string
val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
// 2nd usage: works in 1.6, 2.x
// Call Column.rlike so the pattern string is a literal which doesn't go through parser
val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))
// In 2.x, we need add backslashes to make regex pattern parsed correctly
val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")
Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #17887 from viirya/add-config-fallback-string-parsing.
(cherry picked from commit 609ba5f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…nsistent with old sql parser behavior
## What changes were proposed in this pull request?
The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.
The following codes can reproduce it:
val data = Seq("\u0020\u0021\u0023", "abc")
val df = data.toDF()
// 1st usage: works in 1.6
// Let parser parse pattern string
val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
// 2nd usage: works in 1.6, 2.x
// Call Column.rlike so the pattern string is a literal which doesn't go through parser
val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))
// In 2.x, we need add backslashes to make regex pattern parsed correctly
val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")
Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #17887 from viirya/add-config-fallback-string-parsing.
…nsistent with old sql parser behavior
## What changes were proposed in this pull request?
The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.
The following codes can reproduce it:
val data = Seq("\u0020\u0021\u0023", "abc")
val df = data.toDF()
// 1st usage: works in 1.6
// Let parser parse pattern string
val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
// 2nd usage: works in 1.6, 2.x
// Call Column.rlike so the pattern string is a literal which doesn't go through parser
val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))
// In 2.x, we need add backslashes to make regex pattern parsed correctly
val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")
Follow the discussion in apache#17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes apache#17887 from viirya/add-config-fallback-string-parsing.
…nsistent with old sql parser behavior
## What changes were proposed in this pull request?
The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.
The following codes can reproduce it:
val data = Seq("\u0020\u0021\u0023", "abc")
val df = data.toDF()
// 1st usage: works in 1.6
// Let parser parse pattern string
val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
// 2nd usage: works in 1.6, 2.x
// Call Column.rlike so the pattern string is a literal which doesn't go through parser
val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))
// In 2.x, we need add backslashes to make regex pattern parsed correctly
val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")
Follow the discussion in apache#17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes apache#17887 from viirya/add-config-fallback-string-parsing.
What changes were proposed in this pull request?
The new SQL parser is introduced into Spark 2.0. Seems it bring an issue regarding the regex pattern string.
The following codes can reproduce it:
Due to unescaping SQL String in parser, the first usage working in 1.6 can't work in 2.0. To make it work, we need to add additional backslashes.
It is quite weird that we can't use the same regex pattern string in the 2 usages. We should not do unescaping on regex pattern string.
How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.