Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -97,23 +97,25 @@ object CSVUtils {
*/
@throws[IllegalArgumentException]
def toChar(str: String): Char = {
if (str.charAt(0) == '\\') {
str.charAt(1)
match {
case 't' => '\t'
case 'r' => '\r'
case 'b' => '\b'
case 'f' => '\f'
case '\"' => '\"' // In case user changes quote char and uses \" as delimiter in options
case '\'' => '\''
case 'u' if str == """\u0000""" => '\u0000'
case _ =>
throw new IllegalArgumentException(s"Unsupported special character for delimiter: $str")
}
} else if (str.length == 1) {
str.charAt(0)
} else {
throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str")
(str: Seq[Char]) match {
case Seq() => throw new IllegalArgumentException("Delimiter cannot be empty string")
case Seq('\\') => throw new IllegalArgumentException("Single backslash is prohibited." +
" It has special meaning as beginning of an escape sequence." +
" To get the backslash character, pass a string with two backslashes as the delimiter.")
case Seq(c) => c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing why we had to switch up the case statement like this. I get that we need to cover more cases, but there was duplication and now there is a bit more. What about ...

str.length match {
  case 0 => // error
  case 1 => str(0)
  case 2 if str(0) == '\\' =>
    str(1) match {
      case c if """trbf"'\""".contains(c) => c
      case 'u' if str == """\u0000""" => '\0'
      case _ => // error
    }
  case _ => // error
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer more declarative way and less nested levels of controls. but this is personal opinion. Let's look at your example:

str.length

you didn't check that str can be null.

case 2 if str(0) == '\\' =>
case 'u' if str == """\u0000""" => '\0'

If it has length 2, how str could be """\u0000"""?

case c if """trbf"'\""".contains(c) => c

You should produce control chars not just second char. For example: \t -> Seq('', 't') -> '\t`.

In my approach, everything is simple. One input case is mapped to one output. There is no unnecessary complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah good points. This is too clever, off the top of my head. I still wonder if the code here can reduce the duplication of Seq('\\', c) => '\c' but I don't see a way that actually works, yeah.

case Seq('\\', 't') => '\t'
case Seq('\\', 'r') => '\r'
case Seq('\\', 'b') => '\b'
case Seq('\\', 'f') => '\f'
// In case user changes quote char and uses \" as delimiter in options
case Seq('\\', '\"') => '\"'
case Seq('\\', '\'') => '\''
case Seq('\\', '\\') => '\\'
case _ if str == """\u0000""" => '\u0000'
case Seq('\\', _) =>
throw new IllegalArgumentException(s"Unsupported special character for delimiter: $str")
case _ =>
throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str")
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1826,4 +1826,14 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
val df = spark.read.option("enforceSchema", false).csv(input)
checkAnswer(df, Row("1", "2"))
}

test("using the backward slash as the delimiter") {
val input = Seq("""abc\1""").toDS()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't \ the default escape character? this should be read as the string "abc1" then, and not delimited. It would have to be \\, right? I'm not talking about Scala string escaping, but CSV here.

Or is the point that delimiting takes precedence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the point that delimiting takes precedence?

Right, if an user specified \ as a delimiter, CSV parser considers it as the delimiter first of all. We can see that in the main loop that delimiters are handled before escape characters:
https://github.com/uniVocity/univocity-parsers/blob/6746adc2ddb420ebba7441339887e4bbc35cf087/src/main/java/com/univocity/parsers/csv/CsvParser.java#L115

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user specified \ as a delimiter, we should issue an exception message and let users add the escape symbol i.e., \\. It is weird to see both \\ and \ are representing the same thing \. We should be consistent for handling backslash in all the cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prohibited single backslash and throw an exception with a tip of using double backslash.

val delimiter = """\\"""
checkAnswer(spark.read.option("delimiter", delimiter).csv(input), Row("abc", "1"))
checkAnswer(spark.read.option("inferSchema", true).option("delimiter", delimiter).csv(input),
Row("abc", 1))
val schema = new StructType().add("a", StringType).add("b", IntegerType)
checkAnswer(spark.read.schema(schema).option("delimiter", delimiter).csv(input), Row("abc", 1))
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ class CSVUtilsSuite extends SparkFunSuite {
assert(CSVUtils.toChar("""\"""") === '\"')
assert(CSVUtils.toChar("""\'""") === '\'')
assert(CSVUtils.toChar("""\u0000""") === '\u0000')
assert(CSVUtils.toChar("""\\""") === '\\')
}

test("Does not accept delimiter larger than one character") {
Expand All @@ -44,4 +45,17 @@ class CSVUtilsSuite extends SparkFunSuite {
assert(exception.getMessage.contains("Unsupported special character for delimiter"))
}

test("string with one backward slash is prohibited") {
val exception = intercept[IllegalArgumentException]{
CSVUtils.toChar("""\""")
}
assert(exception.getMessage.contains("Single backslash is prohibited"))
}

test("output proper error message for empty string") {
val exception = intercept[IllegalArgumentException]{
CSVUtils.toChar("")
}
assert(exception.getMessage.contains("Delimiter cannot be empty string"))
}
}