[SPARK-20399][SQL] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser #17736

viirya · 2017-04-24T03:11:27Z

What changes were proposed in this pull request?

The new SQL parser is introduced into Spark 2.0. Seems it bring an issue regarding the regex pattern string.

The following codes can reproduce it:

val data = Seq("\u0020\u0021\u0023", "abc")
val df = data.toDF()

// 1st usage: works in 1.6
// Let parser parse pattern string
val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
// 2nd usage: works in 1.6, 2.x
// Call Column.rlike so the pattern string is a literal which doesn't go through parser
val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))

// In 2.x, we need add backslashes to make regex pattern parsed correctly
val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")

Due to unescaping SQL String in parser, the first usage working in 1.6 can't work in 2.0. To make it work, we need to add additional backslashes.

It is quite weird that we can't use the same regex pattern string in the 2 usages. We should not do unescaping on regex pattern string.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

viirya · 2017-04-24T03:11:39Z

cc @dbtsai @hvanhovell

viirya · 2017-04-24T03:12:24Z

Let's see if it breaks any existing tests.

SparkQA · 2017-04-24T05:28:19Z

Test build #76091 has finished for PR 17736 at commit a0f4a13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2017-04-24T05:37:05Z

LGTM. Thanks. @cloud-fan @rxin this fixes our production jobs when we port our applications from 1.6 to 2.0. I think it's a important bug fix. Thanks.

SparkQA · 2017-04-24T06:17:47Z

Test build #76094 has finished for PR 17736 at commit f295782.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-04-24T06:27:02Z

cc @hvanhovell for review ...

cloud-fan · 2017-04-24T09:27:53Z

seems all string literals in Spark 2.0 parser behave differently from Spark 1.6?

viirya · 2017-04-24T12:25:14Z

Is it? Are there any significant difference? I don't remember there is necessary migration from 1.6 to 2.0 for string literals.

cloud-fan · 2017-04-25T03:09:42Z

isn't the regex parsed as string literal?

viirya · 2017-04-25T04:46:12Z

It is. But it has no problem for normal string literal. It causes problem only if the string literal is used as regex pattern string.

viirya · 2017-04-27T06:03:21Z

ping @cloud-fan @hvanhovell

cloud-fan · 2017-04-27T12:18:07Z

what does SELECT '//abc' result to? a row with a string value /abc or //abc? Is it consistent between spark 1.6 and 2.0?

viirya · 2017-04-28T02:06:47Z

@cloud-fan Do you mean SELECT \\abc?

Spark 2.x:

sql("select '\\abc'").show()  \\ Because 2.0 unescapes input string, "\\a" is interpreted as escaped 'a'

+---+
|abc|
+---+
|abc|
+---+

sql("select 'ab\\tc'").show()

+----+
|ab     c|
+----+
|ab     c|
+----+

sql("select 'ab\tc'").show()  // This is consistent with 1.6. The input is escaped character.

+----+
|ab     c|
+----+
|ab     c|
+----+

Spark 1.6:

sql("select '\\abc'").show()

+----+
| _c0|
+----+
|\abc|
+----+

sql("select 'ab\\tc'").show()  // 1.6 doesn't perform unescape, so "\\t" is interpreted as a backslash + 't'

+-----+
|  _c0|
+-----+
|ab\tc|
+-----+

sql("select 'ab\tc'").show()

+----+
| _c0|
+----+
|ab     c|
+----+

cloud-fan · 2017-04-28T06:20:55Z

shall we fix this inconsistency too?

viirya · 2017-04-28T06:38:33Z

I don't know why we have unescapeSQLString to unescape the string input which causes this inconsistency. Maybe @hvanhovell knows why.

viirya · 2017-04-29T02:19:58Z

@cloud-fan although @hvanhovell haven't commented yet, I will go to fix the inconsistency first and see if we have defined tests against it.

viirya · 2017-04-30T05:29:37Z

As I am trying to fix the inconsistency, one issue I found is '\\Z', the ASCII 26 - CTRL + Z (EOF on windows). If we make the string literal parsing as consistent with 1.6, we can't support it. (1.6 doesn't support it too.)

viirya · 2017-04-30T09:28:16Z

Other inconsistency.

Spark 2.0:

sql("""select 'abc'def'""").show()

[info]   org.apache.spark.sql.catalyst.parser.ParseException: extraneous input ''' expecting {<EOF>, ',', 'FROM', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 15)
[info] 
[info] == SQL ==
[info] select 'abc'def'
[info] ---------------^^^

sql("""select 'abc\'def'""").show()

+-------+
|abc'def|
+-------+
|abc'def|
+-------+

Spark 1.6:

sql("""select 'abc'def'""").show()

[info]   java.lang.RuntimeException: [1.17] failure: ``union'' expected but ErrorToken(unclosed string literal) found
[info] 
[info] select 'abc'def'
[info]                 ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)

sql("""select 'abc\'def'""").show()

[info]   java.lang.RuntimeException: [1.18] failure: ``union'' expected but ErrorToken(unclosed string literal) found
[info] 
[info] select 'abc\'def'
[info]                  ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)

viirya · 2017-04-30T10:11:16Z

@cloud-fan For the inconsistency of string literal, after thinking and experimenting, I think Spark 2.0's approach is more reasonable. If we follow 1.6, we can't support something like select 'abc\\'def'. So I'd prefer keep string literal untouched.

cloud-fan · 2017-05-02T14:23:22Z

hi @viirya , thanks for your example! And I have one suggestion, can we use """string""" instead of "string" in the example? Otherwise I have to manually parse the string according to java string literal rules in my brain, because "a'b" is exactly same as "a\'b".

viirya · 2017-05-03T02:35:05Z

@cloud-fan I've updated the example. Please check if it is better for you. Thanks.

cloud-fan · 2017-05-03T03:16:44Z

yea, much clearer now, and the string literal in Spark 2.0 looks more reasonable.

For the regex, I think it's unfair to compare df.filter("value rlike '^\\x20[\\x20-\\x23]+$'") with df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$")), because java string literal also plays a role here.

Think about a SQL shell, users can write SELECT ... WHERE value RLIKE '^\\x20[\\x20-\\x23]+$', which is consistent with the java version, so I think the current SQL parser is corrected.

viirya · 2017-05-03T03:24:58Z

For the regex, currently users need to write something like df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'"). Migration is an issue as @dbtsai reported. This also seems unreasonable to me because so many backslashes are confusing and it seems to me that no other systems have similar behavior (are there?). It is this patch tries to fix.

cloud-fan · 2017-05-03T04:05:53Z

This also seems unreasonable to me because so many backslashes are confusing and it seems to me that no other systems have similar behavior

Like I said before, it's because java string literal plays a role here, try to use """string""" and it can be much better. If we wanna compare with other systems, we should compare the SQL shell.

Migration is a real problem, but it's also a problem for string literals. We can add a config to fallback to old SQL parser behavior.

viirya · 2017-05-03T07:21:26Z

"""string""" can mitigate this issue. So the user input query is always going with this kind of string literal in SQL shell?

A config seems good to me. We can solve migration issue but don't affect Spark 2.0 behavior.

cloud-fan · 2017-05-03T12:57:15Z

So the user input query is always going with this kind of string literal in SQL shell?

Yea, I think so.

Before we create a new PR for the config, let's try to get some feedback from @hvanhovell

viirya · 2017-05-04T06:26:06Z

@hvanhovell What you think about adding a config to fallback string literal parsing consistent with old SQL parser behavior (don't unescape input string).

hvanhovell · 2017-05-04T07:42:32Z

For some reference. In 1.6 we used the Catalyst SqlParser to parse the expression in Dataframe.filter(), and we used the Hive (ANTLR based) parser for parsing for SQL commands. In Spark 2.0 we moved all of this to a single parser. When porting the parser, I followed the rules in the Hive parser (incl. the unescaping logic), and this fell through the cracks.

Java/scala normal strings make things mind meltingly confusing. I think it is fair that we provide an option to disable the parser's unescaping as a way to get out of this. This might not be the best solution if you use regexes in both pure SQL and in scala at the same time, but it at least is an improvement.

viirya · 2017-05-04T09:02:30Z

@hvanhovell Thanks for comment. It sounds reasonable to me. And the config can help migration issue. @cloud-fan I am going to add the config, is a new PR better or just update this?

cloud-fan · 2017-05-04T13:06:27Z

let's create a new PR

…nsistent with old sql parser behavior ## What changes were proposed in this pull request? The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string. The following codes can reproduce it: val data = Seq("\u0020\u0021\u0023", "abc") val df = data.toDF() // 1st usage: works in 1.6 // Let parser parse pattern string val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'") // 2nd usage: works in 1.6, 2.x // Call Column.rlike so the pattern string is a literal which doesn't go through parser val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$")) // In 2.x, we need add backslashes to make regex pattern parsed correctly val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'") Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17887 from viirya/add-config-fallback-string-parsing. (cherry picked from commit 609ba5f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nsistent with old sql parser behavior ## What changes were proposed in this pull request? The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string. The following codes can reproduce it: val data = Seq("\u0020\u0021\u0023", "abc") val df = data.toDF() // 1st usage: works in 1.6 // Let parser parse pattern string val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'") // 2nd usage: works in 1.6, 2.x // Call Column.rlike so the pattern string is a literal which doesn't go through parser val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$")) // In 2.x, we need add backslashes to make regex pattern parsed correctly val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'") Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17887 from viirya/add-config-fallback-string-parsing.

…nsistent with old sql parser behavior ## What changes were proposed in this pull request? The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string. The following codes can reproduce it: val data = Seq("\u0020\u0021\u0023", "abc") val df = data.toDF() // 1st usage: works in 1.6 // Let parser parse pattern string val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'") // 2nd usage: works in 1.6, 2.x // Call Column.rlike so the pattern string is a literal which doesn't go through parser val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$")) // In 2.x, we need add backslashes to make regex pattern parsed correctly val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'") Follow the discussion in apache#17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#17887 from viirya/add-config-fallback-string-parsing.

Don't unescape regex pattern string.

a0f4a13

Simply get the input string.

f295782

viirya changed the title ~~[SPARK-20399][SQL][WIP] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser~~ [SPARK-20399][SQL] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser Apr 24, 2017

viirya changed the title ~~[SPARK-20399][SQL] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser~~ [SPARK-20399][SQL][WIP] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser Apr 30, 2017

viirya changed the title ~~[SPARK-20399][SQL][WIP] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser~~ [SPARK-20399][SQL] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser Apr 30, 2017

viirya mentioned this pull request May 7, 2017

[SPARK-20399][SQL] Add a config to fallback string literal parsing consistent with old sql parser behavior #17887

Closed

viirya closed this May 7, 2017

viirya deleted the rlike-regex branch December 27, 2023 18:20

[SPARK-20399][SQL] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser #17736

[SPARK-20399][SQL] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser #17736

Uh oh!

Conversation

viirya commented Apr 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Apr 24, 2017

Uh oh!

viirya commented Apr 24, 2017

Uh oh!

SparkQA commented Apr 24, 2017

Uh oh!

dbtsai commented Apr 24, 2017

Uh oh!

SparkQA commented Apr 24, 2017

Uh oh!

rxin commented Apr 24, 2017

Uh oh!

cloud-fan commented Apr 24, 2017

Uh oh!

viirya commented Apr 24, 2017

Uh oh!

cloud-fan commented Apr 25, 2017

Uh oh!

viirya commented Apr 25, 2017

Uh oh!

viirya commented Apr 27, 2017

Uh oh!

cloud-fan commented Apr 27, 2017

Uh oh!

viirya commented Apr 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Apr 28, 2017

Uh oh!

viirya commented Apr 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Apr 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Apr 30, 2017

Uh oh!

viirya commented Apr 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Apr 30, 2017

Uh oh!

cloud-fan commented May 2, 2017

Uh oh!

viirya commented May 3, 2017

Uh oh!

cloud-fan commented May 3, 2017

Uh oh!

viirya commented May 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented May 3, 2017

Uh oh!

viirya commented May 3, 2017

Uh oh!

cloud-fan commented May 3, 2017

Uh oh!

viirya commented May 4, 2017

Uh oh!

hvanhovell commented May 4, 2017

Uh oh!

viirya commented May 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented May 4, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

viirya commented Apr 24, 2017 •

edited

Loading

viirya commented Apr 28, 2017 •

edited

Loading

viirya commented Apr 28, 2017 •

edited

Loading

viirya commented Apr 29, 2017 •

edited

Loading

viirya commented Apr 30, 2017 •

edited

Loading

viirya commented May 3, 2017 •

edited

Loading

viirya commented May 4, 2017 •

edited

Loading