-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25660][SQL] Fix for the backward slash as CSV fields delimiter #22654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #97045 has finished for PR 22654 at commit
|
|
jenkins, retest this, please |
|
Test build #97048 has finished for PR 22654 at commit
|
|
@gatorsmile Please, take a look at this. |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
Show resolved
Hide resolved
# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
|
Test build #97153 has finished for PR 22654 at commit
|
| } | ||
|
|
||
| test("using the backward slash as the delimiter") { | ||
| val input = Seq("""abc\1""").toDS() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't \ the default escape character? this should be read as the string "abc1" then, and not delimited. It would have to be \\, right? I'm not talking about Scala string escaping, but CSV here.
Or is the point that delimiting takes precedence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is the point that delimiting takes precedence?
Right, if an user specified \ as a delimiter, CSV parser considers it as the delimiter first of all. We can see that in the main loop that delimiters are handled before escape characters:
https://github.com/uniVocity/univocity-parsers/blob/6746adc2ddb420ebba7441339887e4bbc35cf087/src/main/java/com/univocity/parsers/csv/CsvParser.java#L115
| throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str") | ||
| (str: Seq[Char]) match { | ||
| case Seq() => throw new IllegalArgumentException("Delimiter cannot be empty string") | ||
| case Seq(c) => c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing why we had to switch up the case statement like this. I get that we need to cover more cases, but there was duplication and now there is a bit more. What about ...
str.length match {
case 0 => // error
case 1 => str(0)
case 2 if str(0) == '\\' =>
str(1) match {
case c if """trbf"'\""".contains(c) => c
case 'u' if str == """\u0000""" => '\0'
case _ => // error
}
case _ => // error
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer more declarative way and less nested levels of controls. but this is personal opinion. Let's look at your example:
str.length
you didn't check that str can be null.
case 2 if str(0) == '\\' =>
case 'u' if str == """\u0000""" => '\0'
If it has length 2, how str could be """\u0000"""?
case c if """trbf"'\""".contains(c) => c
You should produce control chars not just second char. For example: \t -> Seq('', 't') -> '\t`.
In my approach, everything is simple. One input case is mapped to one output. There is no unnecessary complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah good points. This is too clever, off the top of my head. I still wonder if the code here can reduce the duplication of Seq('\\', c) => '\c' but I don't see a way that actually works, yeah.
|
@gatorsmile Could you look at it one more time, please. |
| } | ||
|
|
||
| test("using the backward slash as the delimiter") { | ||
| val input = Seq("""abc\1""").toDS() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user specified \ as a delimiter, we should issue an exception message and let users add the escape symbol i.e., \\. It is weird to see both \\ and \ are representing the same thing \. We should be consistent for handling backslash in all the cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prohibited single backslash and throw an exception with a tip of using double backslash.
|
Test build #97306 has finished for PR 22654 at commit
|
|
LGTM Thanks! Merged to master and 2.4. |
## What changes were proposed in this pull request? The PR addresses the exception raised on accessing chars out of delimiter string. In particular, the backward slash `\` as the CSV fields delimiter causes the following exception on reading `abc\1`: ```Scala String index out of range: 1 java.lang.StringIndexOutOfBoundsException: String index out of range: 1 at java.lang.String.charAt(String.java:658) ``` because `str.charAt(1)` tries to access a char out of `str` in `CSVUtils.toChar` ## How was this patch tested? Added tests for empty string and string containing the backward slash to `CSVUtilsSuite`. Besides of that I added an end-to-end test to check how the backward slash is handled in reading CSV string with it. Closes #22654 from MaxGekk/csv-slash-delim. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit c7eadb5) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
|
@gatorsmile @srowen Thank you for your work. |
## What changes were proposed in this pull request? The PR addresses the exception raised on accessing chars out of delimiter string. In particular, the backward slash `\` as the CSV fields delimiter causes the following exception on reading `abc\1`: ```Scala String index out of range: 1 java.lang.StringIndexOutOfBoundsException: String index out of range: 1 at java.lang.String.charAt(String.java:658) ``` because `str.charAt(1)` tries to access a char out of `str` in `CSVUtils.toChar` ## How was this patch tested? Added tests for empty string and string containing the backward slash to `CSVUtilsSuite`. Besides of that I added an end-to-end test to check how the backward slash is handled in reading CSV string with it. Closes apache#22654 from MaxGekk/csv-slash-delim. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
The PR addresses the exception raised on accessing chars out of delimiter string. In particular, the backward slash
\as the CSV fields delimiter causes the following exception on readingabc\1:because
str.charAt(1)tries to access a char out ofstrinCSVUtils.toCharHow was this patch tested?
Added tests for empty string and string containing the backward slash to
CSVUtilsSuite. Besides of that I added an end-to-end test to check how the backward slash is handled in reading CSV string with it.