Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Apr 24, 2017

What changes were proposed in this pull request?

The new SQL parser is introduced into Spark 2.0. Seems it bring an issue regarding the regex pattern string.

The following codes can reproduce it:

val data = Seq("\u0020\u0021\u0023", "abc")
val df = data.toDF()

// 1st usage: works in 1.6
// Let parser parse pattern string
val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
// 2nd usage: works in 1.6, 2.x
// Call Column.rlike so the pattern string is a literal which doesn't go through parser
val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))

// In 2.x, we need add backslashes to make regex pattern parsed correctly
val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")

Due to unescaping SQL String in parser, the first usage working in 1.6 can't work in 2.0. To make it work, we need to add additional backslashes.

It is quite weird that we can't use the same regex pattern string in the 2 usages. We should not do unescaping on regex pattern string.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@viirya
Copy link
Member Author

viirya commented Apr 24, 2017

cc @dbtsai @hvanhovell

@viirya
Copy link
Member Author

viirya commented Apr 24, 2017

Let's see if it breaks any existing tests.

@SparkQA
Copy link

SparkQA commented Apr 24, 2017

Test build #76091 has finished for PR 17736 at commit a0f4a13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Apr 24, 2017

LGTM. Thanks. @cloud-fan @rxin this fixes our production jobs when we port our applications from 1.6 to 2.0. I think it's a important bug fix. Thanks.

@SparkQA
Copy link

SparkQA commented Apr 24, 2017

Test build #76094 has finished for PR 17736 at commit f295782.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya changed the title [SPARK-20399][SQL][WIP] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser [SPARK-20399][SQL] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser Apr 24, 2017
@rxin
Copy link
Contributor

rxin commented Apr 24, 2017

cc @hvanhovell for review ...

@cloud-fan
Copy link
Contributor

seems all string literals in Spark 2.0 parser behave differently from Spark 1.6?

@viirya
Copy link
Member Author

viirya commented Apr 24, 2017

Is it? Are there any significant difference? I don't remember there is necessary migration from 1.6 to 2.0 for string literals.

@cloud-fan
Copy link
Contributor

isn't the regex parsed as string literal?

@viirya
Copy link
Member Author

viirya commented Apr 25, 2017

It is. But it has no problem for normal string literal. It causes problem only if the string literal is used as regex pattern string.

@viirya
Copy link
Member Author

viirya commented Apr 27, 2017

ping @cloud-fan @hvanhovell

@cloud-fan
Copy link
Contributor

what does SELECT '//abc' result to? a row with a string value /abc or //abc? Is it consistent between spark 1.6 and 2.0?

@viirya
Copy link
Member Author

viirya commented Apr 28, 2017

@cloud-fan Do you mean SELECT \\abc?

Spark 2.x:

sql("select '\\abc'").show()  \\ Because 2.0 unescapes input string, "\\a" is interpreted as escaped 'a'

+---+
|abc|
+---+
|abc|
+---+

sql("select 'ab\\tc'").show()

+----+
|ab     c|
+----+
|ab     c|
+----+

sql("select 'ab\tc'").show()  // This is consistent with 1.6. The input is escaped character.

+----+
|ab     c|
+----+
|ab     c|
+----+

Spark 1.6:

sql("select '\\abc'").show()

+----+
| _c0|
+----+
|\abc|
+----+

sql("select 'ab\\tc'").show()  // 1.6 doesn't perform unescape, so "\\t" is interpreted as a backslash + 't'

+-----+
|  _c0|
+-----+
|ab\tc|
+-----+

sql("select 'ab\tc'").show()

+----+
| _c0|
+----+
|ab     c|
+----+

@cloud-fan
Copy link
Contributor

shall we fix this inconsistency too?

@viirya
Copy link
Member Author

viirya commented Apr 28, 2017

I don't know why we have unescapeSQLString to unescape the string input which causes this inconsistency. Maybe @hvanhovell knows why.

@viirya
Copy link
Member Author

viirya commented Apr 29, 2017

@cloud-fan although @hvanhovell haven't commented yet, I will go to fix the inconsistency first and see if we have defined tests against it.

@viirya viirya changed the title [SPARK-20399][SQL] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser [SPARK-20399][SQL][WIP] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser Apr 30, 2017
@viirya
Copy link
Member Author

viirya commented Apr 30, 2017

As I am trying to fix the inconsistency, one issue I found is '\\Z', the ASCII 26 - CTRL + Z (EOF on windows). If we make the string literal parsing as consistent with 1.6, we can't support it. (1.6 doesn't support it too.)

@viirya
Copy link
Member Author

viirya commented Apr 30, 2017

Other inconsistency.

Spark 2.0:

sql("""select 'abc'def'""").show()

[info]   org.apache.spark.sql.catalyst.parser.ParseException: extraneous input ''' expecting {<EOF>, ',', 'FROM', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 15)
[info] 
[info] == SQL ==
[info] select 'abc'def'
[info] ---------------^^^

sql("""select 'abc\'def'""").show()

+-------+
|abc'def|
+-------+
|abc'def|
+-------+

Spark 1.6:

sql("""select 'abc'def'""").show()

[info]   java.lang.RuntimeException: [1.17] failure: ``union'' expected but ErrorToken(unclosed string literal) found
[info] 
[info] select 'abc'def'
[info]                 ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)

sql("""select 'abc\'def'""").show()

[info]   java.lang.RuntimeException: [1.18] failure: ``union'' expected but ErrorToken(unclosed string literal) found
[info] 
[info] select 'abc\'def'
[info]                  ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)

@viirya
Copy link
Member Author

viirya commented Apr 30, 2017

@cloud-fan For the inconsistency of string literal, after thinking and experimenting, I think Spark 2.0's approach is more reasonable. If we follow 1.6, we can't support something like select 'abc\\'def'. So I'd prefer keep string literal untouched.

@viirya viirya changed the title [SPARK-20399][SQL][WIP] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser [SPARK-20399][SQL] Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser Apr 30, 2017
@cloud-fan
Copy link
Contributor

hi @viirya , thanks for your example! And I have one suggestion, can we use """string""" instead of "string" in the example? Otherwise I have to manually parse the string according to java string literal rules in my brain, because "a'b" is exactly same as "a\'b".

@viirya
Copy link
Member Author

viirya commented May 3, 2017

@cloud-fan I've updated the example. Please check if it is better for you. Thanks.

@cloud-fan
Copy link
Contributor

yea, much clearer now, and the string literal in Spark 2.0 looks more reasonable.

For the regex, I think it's unfair to compare df.filter("value rlike '^\\x20[\\x20-\\x23]+$'") with df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$")), because java string literal also plays a role here.

Think about a SQL shell, users can write SELECT ... WHERE value RLIKE '^\\x20[\\x20-\\x23]+$', which is consistent with the java version, so I think the current SQL parser is corrected.

@viirya
Copy link
Member Author

viirya commented May 3, 2017

For the regex, currently users need to write something like df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'"). Migration is an issue as @dbtsai reported. This also seems unreasonable to me because so many backslashes are confusing and it seems to me that no other systems have similar behavior (are there?). It is this patch tries to fix.

@cloud-fan
Copy link
Contributor

This also seems unreasonable to me because so many backslashes are confusing and it seems to me that no other systems have similar behavior

Like I said before, it's because java string literal plays a role here, try to use """string""" and it can be much better. If we wanna compare with other systems, we should compare the SQL shell.

Migration is a real problem, but it's also a problem for string literals. We can add a config to fallback to old SQL parser behavior.

@viirya
Copy link
Member Author

viirya commented May 3, 2017

"""string""" can mitigate this issue. So the user input query is always going with this kind of string literal in SQL shell?

A config seems good to me. We can solve migration issue but don't affect Spark 2.0 behavior.

@cloud-fan
Copy link
Contributor

So the user input query is always going with this kind of string literal in SQL shell?

Yea, I think so.

Before we create a new PR for the config, let's try to get some feedback from @hvanhovell

@viirya
Copy link
Member Author

viirya commented May 4, 2017

@hvanhovell What you think about adding a config to fallback string literal parsing consistent with old SQL parser behavior (don't unescape input string).

@hvanhovell
Copy link
Contributor

For some reference. In 1.6 we used the Catalyst SqlParser to parse the expression in Dataframe.filter(), and we used the Hive (ANTLR based) parser for parsing for SQL commands. In Spark 2.0 we moved all of this to a single parser. When porting the parser, I followed the rules in the Hive parser (incl. the unescaping logic), and this fell through the cracks.

Java/scala normal strings make things mind meltingly confusing. I think it is fair that we provide an option to disable the parser's unescaping as a way to get out of this. This might not be the best solution if you use regexes in both pure SQL and in scala at the same time, but it at least is an improvement.

@viirya
Copy link
Member Author

viirya commented May 4, 2017

@hvanhovell Thanks for comment. It sounds reasonable to me. And the config can help migration issue. @cloud-fan I am going to add the config, is a new PR better or just update this?

@cloud-fan
Copy link
Contributor

let's create a new PR

@viirya viirya closed this May 7, 2017
asfgit pushed a commit that referenced this pull request May 12, 2017
…nsistent with old sql parser behavior

## What changes were proposed in this pull request?

The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.

The following codes can reproduce it:

    val data = Seq("\u0020\u0021\u0023", "abc")
    val df = data.toDF()

    // 1st usage: works in 1.6
    // Let parser parse pattern string
    val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
    // 2nd usage: works in 1.6, 2.x
    // Call Column.rlike so the pattern string is a literal which doesn't go through parser
    val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))

    // In 2.x, we need add backslashes to make regex pattern parsed correctly
    val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")

Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #17887 from viirya/add-config-fallback-string-parsing.

(cherry picked from commit 609ba5f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
asfgit pushed a commit that referenced this pull request May 12, 2017
…nsistent with old sql parser behavior

## What changes were proposed in this pull request?

The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.

The following codes can reproduce it:

    val data = Seq("\u0020\u0021\u0023", "abc")
    val df = data.toDF()

    // 1st usage: works in 1.6
    // Let parser parse pattern string
    val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
    // 2nd usage: works in 1.6, 2.x
    // Call Column.rlike so the pattern string is a literal which doesn't go through parser
    val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))

    // In 2.x, we need add backslashes to make regex pattern parsed correctly
    val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")

Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #17887 from viirya/add-config-fallback-string-parsing.
robert3005 pushed a commit to palantir/spark that referenced this pull request May 19, 2017
…nsistent with old sql parser behavior

## What changes were proposed in this pull request?

The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.

The following codes can reproduce it:

    val data = Seq("\u0020\u0021\u0023", "abc")
    val df = data.toDF()

    // 1st usage: works in 1.6
    // Let parser parse pattern string
    val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
    // 2nd usage: works in 1.6, 2.x
    // Call Column.rlike so the pattern string is a literal which doesn't go through parser
    val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))

    // In 2.x, we need add backslashes to make regex pattern parsed correctly
    val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")

Follow the discussion in apache#17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#17887 from viirya/add-config-fallback-string-parsing.
lycplus pushed a commit to lycplus/spark that referenced this pull request May 24, 2017
…nsistent with old sql parser behavior

## What changes were proposed in this pull request?

The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.

The following codes can reproduce it:

    val data = Seq("\u0020\u0021\u0023", "abc")
    val df = data.toDF()

    // 1st usage: works in 1.6
    // Let parser parse pattern string
    val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
    // 2nd usage: works in 1.6, 2.x
    // Call Column.rlike so the pattern string is a literal which doesn't go through parser
    val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))

    // In 2.x, we need add backslashes to make regex pattern parsed correctly
    val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")

Follow the discussion in apache#17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#17887 from viirya/add-config-fallback-string-parsing.
@viirya viirya deleted the rlike-regex branch December 27, 2023 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants