[SPARK-25660][SQL] Fix for the backward slash as CSV fields delimiter #22654

MaxGekk · 2018-10-06T10:15:48Z

What changes were proposed in this pull request?

The PR addresses the exception raised on accessing chars out of delimiter string. In particular, the backward slash \ as the CSV fields delimiter causes the following exception on reading abc\1:

String index out of range: 1
java.lang.StringIndexOutOfBoundsException: String index out of range: 1
	at java.lang.String.charAt(String.java:658)

because str.charAt(1) tries to access a char out of str in CSVUtils.toChar

How was this patch tested?

Added tests for empty string and string containing the backward slash to CSVUtilsSuite. Besides of that I added an end-to-end test to check how the backward slash is handled in reading CSV string with it.

SparkQA · 2018-10-06T11:52:42Z

Test build #97045 has finished for PR 22654 at commit 7bf453a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-10-06T12:28:54Z

jenkins, retest this, please

SparkQA · 2018-10-06T16:17:00Z

Test build #97048 has finished for PR 22654 at commit 7bf453a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-10-08T21:59:52Z

@gatorsmile Please, take a look at this.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

SparkQA · 2018-10-09T13:27:53Z

Test build #97153 has finished for PR 22654 at commit 1c2ac25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-10-09T14:26:46Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

  }
+
+  test("using the backward slash as the delimiter") {
+    val input = Seq("""abc\1""").toDS()


Isn't \ the default escape character? this should be read as the string "abc1" then, and not delimited. It would have to be \\, right? I'm not talking about Scala string escaping, but CSV here.

Or is the point that delimiting takes precedence?

Or is the point that delimiting takes precedence?

Right, if an user specified \ as a delimiter, CSV parser considers it as the delimiter first of all. We can see that in the main loop that delimiters are handled before escape characters:
https://github.com/uniVocity/univocity-parsers/blob/6746adc2ddb420ebba7441339887e4bbc35cf087/src/main/java/com/univocity/parsers/csv/CsvParser.java#L115

srowen · 2018-10-09T14:32:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala

-      throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str")
+    (str: Seq[Char]) match {
+      case Seq() => throw new IllegalArgumentException("Delimiter cannot be empty string")
+      case Seq(c) => c


I'm missing why we had to switch up the case statement like this. I get that we need to cover more cases, but there was duplication and now there is a bit more. What about ...

str.length match { case 0 => // error case 1 => str(0) case 2 if str(0) == '\\' => str(1) match { case c if """trbf"'\""".contains(c) => c case 'u' if str == """\u0000""" => '\0' case _ => // error } case _ => // error }

I would prefer more declarative way and less nested levels of controls. but this is personal opinion. Let's look at your example:

str.length

you didn't check that str can be null.

case 2 if str(0) == '\\' => case 'u' if str == """\u0000""" => '\0'

If it has length 2, how str could be """\u0000"""?

case c if """trbf"'\""".contains(c) => c

You should produce control chars not just second char. For example: \t -> Seq('', 't') -> '\t`.

In my approach, everything is simple. One input case is mapped to one output. There is no unnecessary complexity.

Ah yeah good points. This is too clever, off the top of my head. I still wonder if the code here can reduce the duplication of Seq('\\', c) => '\c' but I don't see a way that actually works, yeah.

MaxGekk · 2018-10-11T12:11:43Z

@gatorsmile Could you look at it one more time, please.

gatorsmile · 2018-10-11T22:44:39Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

  }
+
+  test("using the backward slash as the delimiter") {
+    val input = Seq("""abc\1""").toDS()


If a user specified \ as a delimiter, we should issue an exception message and let users add the escape symbol i.e., \\. It is weird to see both \\ and \ are representing the same thing \. We should be consistent for handling backslash in all the cases.

I prohibited single backslash and throw an exception with a tip of using double backslash.

SparkQA · 2018-10-12T17:39:57Z

Test build #97306 has finished for PR 22654 at commit 20856b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-10-12T19:03:08Z

LGTM

Thanks! Merged to master and 2.4.

## What changes were proposed in this pull request? The PR addresses the exception raised on accessing chars out of delimiter string. In particular, the backward slash `\` as the CSV fields delimiter causes the following exception on reading `abc\1`: ```Scala String index out of range: 1 java.lang.StringIndexOutOfBoundsException: String index out of range: 1 at java.lang.String.charAt(String.java:658) ``` because `str.charAt(1)` tries to access a char out of `str` in `CSVUtils.toChar` ## How was this patch tested? Added tests for empty string and string containing the backward slash to `CSVUtilsSuite`. Besides of that I added an end-to-end test to check how the backward slash is handled in reading CSV string with it. Closes #22654 from MaxGekk/csv-slash-delim. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit c7eadb5) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

MaxGekk · 2018-10-12T20:00:30Z

@gatorsmile @srowen Thank you for your work.

## What changes were proposed in this pull request? The PR addresses the exception raised on accessing chars out of delimiter string. In particular, the backward slash `\` as the CSV fields delimiter causes the following exception on reading `abc\1`: ```Scala String index out of range: 1 java.lang.StringIndexOutOfBoundsException: String index out of range: 1 at java.lang.String.charAt(String.java:658) ``` because `str.charAt(1)` tries to access a char out of `str` in `CSVUtils.toChar` ## How was this patch tested? Added tests for empty string and string containing the backward slash to `CSVUtilsSuite`. Besides of that I added an end-to-end test to check how the backward slash is handled in reading CSV string with it. Closes apache#22654 from MaxGekk/csv-slash-delim. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

MaxGekk added 2 commits October 6, 2018 11:12

Test for backward slash as the delimiter

dd16ca3

Bug fix + tests

7bf453a

gatorsmile reviewed Oct 9, 2018

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala Outdated Show resolved Hide resolved

gatorsmile reviewed Oct 9, 2018

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala Show resolved Hide resolved

MaxGekk added 3 commits October 9, 2018 11:24

Removing string interpolation

728aac2

Merge remote-tracking branch 'origin/master' into csv-slash-delim

983315b

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

Support backslash escaped by backslash

1c2ac25

srowen reviewed Oct 9, 2018

View reviewed changes

gatorsmile reviewed Oct 11, 2018

View reviewed changes

Prohibit single backslash

20856b4

asfgit closed this in c7eadb5 Oct 12, 2018

MaxGekk deleted the csv-slash-delim branch August 17, 2019 13:35

[SPARK-25660][SQL] Fix for the backward slash as CSV fields delimiter #22654

[SPARK-25660][SQL] Fix for the backward slash as CSV fields delimiter #22654

Uh oh!

Conversation

MaxGekk commented Oct 6, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 6, 2018

Uh oh!

MaxGekk commented Oct 6, 2018

Uh oh!

SparkQA commented Oct 6, 2018

Uh oh!

MaxGekk commented Oct 8, 2018

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Oct 9, 2018

Uh oh!

srowen Oct 9, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Oct 9, 2018

Choose a reason for hiding this comment

Uh oh!

srowen Oct 9, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Oct 9, 2018

Choose a reason for hiding this comment

Uh oh!

srowen Oct 9, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Oct 11, 2018

Uh oh!

gatorsmile Oct 11, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Oct 12, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 12, 2018

Uh oh!

gatorsmile commented Oct 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Oct 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gatorsmile commented Oct 12, 2018 •

edited

Loading