Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 6, 2018

What changes were proposed in this pull request?

The PR addresses the exception raised on accessing chars out of delimiter string. In particular, the backward slash \ as the CSV fields delimiter causes the following exception on reading abc\1:

String index out of range: 1
java.lang.StringIndexOutOfBoundsException: String index out of range: 1
	at java.lang.String.charAt(String.java:658)

because str.charAt(1) tries to access a char out of str in CSVUtils.toChar

How was this patch tested?

Added tests for empty string and string containing the backward slash to CSVUtilsSuite. Besides of that I added an end-to-end test to check how the backward slash is handled in reading CSV string with it.

@SparkQA
Copy link

SparkQA commented Oct 6, 2018

Test build #97045 has finished for PR 22654 at commit 7bf453a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 6, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Oct 6, 2018

Test build #97048 has finished for PR 22654 at commit 7bf453a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 8, 2018

@gatorsmile Please, take a look at this.

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@SparkQA
Copy link

SparkQA commented Oct 9, 2018

Test build #97153 has finished for PR 22654 at commit 1c2ac25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

test("using the backward slash as the delimiter") {
val input = Seq("""abc\1""").toDS()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't \ the default escape character? this should be read as the string "abc1" then, and not delimited. It would have to be \\, right? I'm not talking about Scala string escaping, but CSV here.

Or is the point that delimiting takes precedence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the point that delimiting takes precedence?

Right, if an user specified \ as a delimiter, CSV parser considers it as the delimiter first of all. We can see that in the main loop that delimiters are handled before escape characters:
https://github.com/uniVocity/univocity-parsers/blob/6746adc2ddb420ebba7441339887e4bbc35cf087/src/main/java/com/univocity/parsers/csv/CsvParser.java#L115

throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str")
(str: Seq[Char]) match {
case Seq() => throw new IllegalArgumentException("Delimiter cannot be empty string")
case Seq(c) => c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing why we had to switch up the case statement like this. I get that we need to cover more cases, but there was duplication and now there is a bit more. What about ...

str.length match {
  case 0 => // error
  case 1 => str(0)
  case 2 if str(0) == '\\' =>
    str(1) match {
      case c if """trbf"'\""".contains(c) => c
      case 'u' if str == """\u0000""" => '\0'
      case _ => // error
    }
  case _ => // error
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer more declarative way and less nested levels of controls. but this is personal opinion. Let's look at your example:

str.length

you didn't check that str can be null.

case 2 if str(0) == '\\' =>
case 'u' if str == """\u0000""" => '\0'

If it has length 2, how str could be """\u0000"""?

case c if """trbf"'\""".contains(c) => c

You should produce control chars not just second char. For example: \t -> Seq('', 't') -> '\t`.

In my approach, everything is simple. One input case is mapped to one output. There is no unnecessary complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah good points. This is too clever, off the top of my head. I still wonder if the code here can reduce the duplication of Seq('\\', c) => '\c' but I don't see a way that actually works, yeah.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 11, 2018

@gatorsmile Could you look at it one more time, please.

}

test("using the backward slash as the delimiter") {
val input = Seq("""abc\1""").toDS()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user specified \ as a delimiter, we should issue an exception message and let users add the escape symbol i.e., \\. It is weird to see both \\ and \ are representing the same thing \. We should be consistent for handling backslash in all the cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prohibited single backslash and throw an exception with a tip of using double backslash.

@SparkQA
Copy link

SparkQA commented Oct 12, 2018

Test build #97306 has finished for PR 22654 at commit 20856b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Oct 12, 2018

LGTM

Thanks! Merged to master and 2.4.

asfgit pushed a commit that referenced this pull request Oct 12, 2018
## What changes were proposed in this pull request?

The PR addresses the exception raised on accessing chars out of delimiter string. In particular, the backward slash `\` as the CSV fields delimiter causes the following exception on reading `abc\1`:
```Scala
String index out of range: 1
java.lang.StringIndexOutOfBoundsException: String index out of range: 1
	at java.lang.String.charAt(String.java:658)
```
because `str.charAt(1)` tries to access a char out of `str` in `CSVUtils.toChar`

## How was this patch tested?

Added tests for empty string and string containing the backward slash to `CSVUtilsSuite`. Besides of that I added an end-to-end test to check how the backward slash is handled in reading CSV string with it.

Closes #22654 from MaxGekk/csv-slash-delim.

Authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(cherry picked from commit c7eadb5)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@asfgit asfgit closed this in c7eadb5 Oct 12, 2018
@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 12, 2018

@gatorsmile @srowen Thank you for your work.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

The PR addresses the exception raised on accessing chars out of delimiter string. In particular, the backward slash `\` as the CSV fields delimiter causes the following exception on reading `abc\1`:
```Scala
String index out of range: 1
java.lang.StringIndexOutOfBoundsException: String index out of range: 1
	at java.lang.String.charAt(String.java:658)
```
because `str.charAt(1)` tries to access a char out of `str` in `CSVUtils.toChar`

## How was this patch tested?

Added tests for empty string and string containing the backward slash to `CSVUtilsSuite`. Besides of that I added an end-to-end test to check how the backward slash is handled in reading CSV string with it.

Closes apache#22654 from MaxGekk/csv-slash-delim.

Authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@MaxGekk MaxGekk deleted the csv-slash-delim branch August 17, 2019 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants