Skip to content

Commit 463ea1d

Browse files
committed
[SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read
Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters Moving univocity-parsers version to spark-parent pom dependencyManagement section Adding new utility method to build multi-char delimiter string, which delegates to existing one Adding tests for multiple character delimited CSV, and new test suite for the CSVExprUtils method Updating spark-deps* files with new univocity-parsers version
1 parent 275e044 commit 463ea1d

File tree

14 files changed

+133
-12
lines changed

14 files changed

+133
-12
lines changed

dev/deps/spark-deps-hadoop-2.7

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,7 @@ stax-api-1.0.1.jar
198198
stream-2.9.6.jar
199199
stringtemplate-3.2.1.jar
200200
super-csv-2.2.0.jar
201-
univocity-parsers-2.7.3.jar
201+
univocity-parsers-2.8.3.jar
202202
validation-api-2.0.1.Final.jar
203203
xbean-asm7-shaded-4.14.jar
204204
xercesImpl-2.9.1.jar

dev/deps/spark-deps-hadoop-3.2

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@ stream-2.9.6.jar
217217
stringtemplate-3.2.1.jar
218218
super-csv-2.2.0.jar
219219
token-provider-1.0.1.jar
220-
univocity-parsers-2.7.3.jar
220+
univocity-parsers-2.8.3.jar
221221
validation-api-2.0.1.Final.jar
222222
woodstox-core-5.0.3.jar
223223
xbean-asm7-shaded-4.14.jar

pom.xml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2180,6 +2180,11 @@
21802180
</exclusion>
21812181
</exclusions>
21822182
</dependency>
2183+
<dependency>
2184+
<groupId>com.univocity</groupId>
2185+
<artifactId>univocity-parsers</artifactId>
2186+
<version>2.8.3</version>
2187+
</dependency>
21832188
</dependencies>
21842189
</dependencyManagement>
21852190

python/pyspark/sql/readwriter.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -360,8 +360,8 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
360360
or RDD of Strings storing CSV rows.
361361
:param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema
362362
or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).
363-
:param sep: sets a single character as a separator for each field and value.
364-
If None is set, it uses the default value, ``,``.
363+
:param sep: sets a separator (one or more characters) for each field and value. If None is
364+
set, it uses the default value, ``,``.
365365
:param encoding: decodes the CSV files by the given encoding type. If None is set,
366366
it uses the default value, ``UTF-8``.
367367
:param quote: sets a single character used for escaping quoted values where the
@@ -890,7 +890,7 @@ def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=No
890890
:param compression: compression codec to use when saving to file. This can be one of the
891891
known case-insensitive shorten names (none, bzip2, gzip, lz4,
892892
snappy and deflate).
893-
:param sep: sets a single character as a separator for each field and value. If None is
893+
:param sep: sets a separator (one or more characters) for each field and value. If None is
894894
set, it uses the default value, ``,``.
895895
:param quote: sets a single character used for escaping quoted values where the
896896
separator can be part of the value. If None is set, it uses the default

python/pyspark/sql/streaming.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -596,8 +596,8 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
596596
:param path: string, or list of strings, for input path(s).
597597
:param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema
598598
or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).
599-
:param sep: sets a single character as a separator for each field and value.
600-
If None is set, it uses the default value, ``,``.
599+
:param sep: sets a separator (one or more characters) for each field and value. If None is
600+
set, it uses the default value, ``,``.
601601
:param encoding: decodes the CSV files by the given encoding type. If None is set,
602602
it uses the default value, ``UTF-8``.
603603
:param quote: sets a single character used for escaping quoted values where the

sql/catalyst/pom.xml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,6 @@
111111
<dependency>
112112
<groupId>com.univocity</groupId>
113113
<artifactId>univocity-parsers</artifactId>
114-
<version>2.7.3</version>
115114
<type>jar</type>
116115
</dependency>
117116
<dependency>

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@
1717

1818
package org.apache.spark.sql.catalyst.csv
1919

20+
import org.apache.commons.lang3.StringUtils
21+
2022
object CSVExprUtils {
2123
/**
2224
* Filter ignorable rows for CSV iterator (lines empty and starting with `comment`).
@@ -79,4 +81,48 @@ object CSVExprUtils {
7981
throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str")
8082
}
8183
}
84+
85+
/**
86+
* Helper method that converts string representation of a character sequence to actual
87+
* delimiter characters. The input is processed in "chunks", and each chunk is converted
88+
* by calling [[CSVExprUtils.toChar()]]. A chunk is either:
89+
* <ul>
90+
* <li>a backslash followed by another character</li>
91+
* <li>a non-backslash character by itself</li>
92+
* </ul>
93+
* , in that order of precedence. The result of the converting all chunks is returned as
94+
* a [[String]].
95+
*
96+
* <br/><br/>Examples:
97+
* <ul><li>`\t` will result in a single tab character as the separator (same as before)
98+
* </li><li>`|||` will result in a sequence of three pipe characters as the separator
99+
* </li><li>`\\` will result in a single backslash as the separator (same as before)
100+
* </li><li>`\.` will result in an error (since a dot is not a character that needs escaped)
101+
* </li><li>`\\.` will result in a backslash, then dot, as the separator character sequence
102+
* </li><li>`.\t.` will result in a dot, then tab, then dot as the separator character sequence
103+
* </li>
104+
* </ul>
105+
*
106+
* @param str the string representing the sequence of separator characters
107+
* @return a [[String]] representing the multi-character delimiter
108+
* @throws IllegalArgumentException if any of the individual input chunks are illegal
109+
*/
110+
def toDelimiterStr(str: String): String = {
111+
var idx = 0
112+
113+
var delimiter = ""
114+
115+
while (idx < str.length()) {
116+
// if the current character is a backslash, check it plus the next char
117+
// in order to use existing escape logic
118+
val readAhead = if (str(idx) == '\\') 2 else 1
119+
// get the chunk of 1 or 2 input characters to convert to a single delimiter char
120+
val chunk = StringUtils.substring(str, idx, idx + readAhead)
121+
delimiter += toChar(chunk)
122+
// advance the counter by the length of input chunk processed
123+
idx += chunk.length()
124+
}
125+
126+
delimiter.mkString("")
127+
}
82128
}

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ class CSVOptions(
9595
}
9696
}
9797

98-
val delimiter = CSVExprUtils.toChar(
98+
val delimiter = CSVExprUtils.toDelimiterStr(
9999
parameters.getOrElse("sep", parameters.getOrElse("delimiter", ",")))
100100
val parseMode: ParseMode =
101101
parameters.get("mode").map(ParseMode.fromString).getOrElse(PermissiveMode)

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@
1717

1818
package org.apache.spark.sql.catalyst.csv
1919

20+
import org.scalatest.prop.TableDrivenPropertyChecks._
21+
2022
import org.apache.spark.SparkFunSuite
2123

2224
class CSVExprUtilsSuite extends SparkFunSuite {
@@ -58,4 +60,40 @@ class CSVExprUtilsSuite extends SparkFunSuite {
5860
}
5961
assert(exception.getMessage.contains("Delimiter cannot be empty string"))
6062
}
63+
64+
val testCases = Table(
65+
("input", "separatorStr", "expectedErrorMsg"),
66+
// normal tab
67+
("""\t""", Some("\t"), None),
68+
// backslash, then tab
69+
("""\\t""", Some("""\t"""), None),
70+
// invalid special character (dot)
71+
("""\.""", None, Some("Unsupported special character for delimiter")),
72+
// backslash, then dot
73+
("""\\.""", Some("""\."""), None),
74+
// nothing special, just straight conversion
75+
("""foo""", Some("foo"), None),
76+
// tab in the middle of some other letters
77+
("""ba\tr""", Some("ba\tr"), None),
78+
// null character, expressed in Unicode literal syntax
79+
("""\u0000""", Some("\u0000"), None),
80+
// and specified directly
81+
("\0", Some("\u0000"), None)
82+
)
83+
84+
test("should correctly produce separator strings, or exceptions, from input") {
85+
forAll(testCases) { (input, separatorStr, expectedErrorMsg) =>
86+
try {
87+
val separator = CSVExprUtils.toDelimiterStr(input)
88+
assert(separatorStr.isDefined)
89+
assert(expectedErrorMsg.isEmpty)
90+
assert(separator.equals(separatorStr.get))
91+
} catch {
92+
case e: IllegalArgumentException =>
93+
assert(separatorStr.isEmpty)
94+
assert(expectedErrorMsg.isDefined)
95+
assert(e.getMessage.contains(expectedErrorMsg.get))
96+
}
97+
}
98+
}
6199
}

sql/core/pom.xml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,6 @@
3838
<dependency>
3939
<groupId>com.univocity</groupId>
4040
<artifactId>univocity-parsers</artifactId>
41-
<version>2.7.3</version>
4241
<type>jar</type>
4342
</dependency>
4443
<dependency>

0 commit comments

Comments
 (0)