[SPARK-19018][SQL] Add support for custom encoding on csv writer #20949

crafty-coder · 2018-03-30T19:47:33Z

What changes were proposed in this pull request?

Add support for custom encoding on csv writer, see https://issues.apache.org/jira/browse/SPARK-19018

How was this patch tested?

Added two unit tests in CSVSuite

HyukjinKwon · 2018-03-31T06:25:24Z

ok to test

HyukjinKwon · 2018-03-31T06:27:14Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+        val df = spark
+          .read
+          .option("header", "false")
+          .option("encoding", encoding)


I think our CSV read encoding option is incomplete for now .. there are many discussions about this now. I am going to fix the read path soon. Let me revisit this after fixing it.

Now it's fine. I think we decided to support encoding in CSV/JSON datasources. Ignore the comment above. We can proceed separately.

SparkQA · 2018-03-31T06:29:35Z

Test build #88779 has finished for PR 20949 at commit b9a7bf0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-06T10:54:56Z

Test build #90274 has finished for PR 20949 at commit 8fd3e0e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-06T11:43:03Z

Test build #90275 has finished for PR 20949 at commit 0d0addf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rberenguel · 2018-05-06T12:09:35Z

I've been giving a look to this PR (I've hit this problem in the past and had a chat with @crafty-coder about it and his fixes, too), is there anything we could do to move it forward?

Also, is there any way to trigger a rebuild on Jenkins without adding a dummy commit? Looks like the JVM on this test run blew the heap, just a re-run should be enough (cc @holdenk @HyukjinKwon )

crafty-coder · 2018-05-06T12:29:01Z

I would say this change has value on its own.

At the moment the csv reader applies the charset config but the csv writer is ignoring it, which I think its a bit confusing.

HyukjinKwon · 2018-05-06T14:41:23Z

retest this please

SparkQA · 2018-05-06T18:26:03Z

Test build #90279 has finished for PR 20949 at commit 0d0addf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-06-24T00:43:01Z

cc @MaxGekk @HyukjinKwon

MaxGekk · 2018-06-24T15:42:03Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

  }

+  test("Save csv with custom charset") {
+    Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding =>


Could you check the UTF-16 and UTF-32 encoding too. The written csv files must contain BOMs for such encodings. I am not sure that Spark CSV datasource is able to read it in per-line mode (multiLine is set to false). Probably, you need to switch to multLine mode or read the files by Scala's library like in JsonSuite:

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

Lines 2322 to 2338 in c7e2742

test("SPARK-23723: write json in UTF-16/32 with multiline off") {

Seq("UTF-16", "UTF-32").foreach { encoding =>

withTempPath { path =>

val ds = spark.createDataset(Seq(("a", 1))).repartition(1)

ds.write

.option("encoding", encoding)

.option("multiline", false)

.json(path.getCanonicalPath)

val jsonFiles = path.listFiles().filter(_.getName.endsWith("json"))

jsonFiles.foreach { jsonFile =>

val readback = Files.readAllBytes(jsonFile.toPath)

val expected = ("""{"_1":"a","_2":1}""" + "\n").getBytes(Charset.forName(encoding))

assert(readback === expected)

}

}

}

}

MaxGekk · 2018-06-24T15:52:57Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+        val originalDF = Seq("µß áâä ÁÂÄ").toDF("_c0")
+        // scalastyle:on
+        originalDF.write
+          .option("header", "false")


The header flag is disabled by default. Just in case, are there any specific reasons fro testing without CSV header?

My bad, there is no reason. It's fixed on the next commit.

MaxGekk · 2018-06-24T16:05:03Z

python/pyspark/sql/readwriter.py

                                          the quote character. If None is set, the default value is
                                          escape character when escape and quote characters are
                                          different, ``\0`` otherwise..
+        :param encoding: sets encoding used for encoding the file. If None is set, it


Could you reformulate this encoding used for encoding

MaxGekk · 2018-06-24T16:11:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

+    context,
+    new Path(path),
+    charset
+  )


Move the ) up like charset). See https://github.com/databricks/scala-style-guide

HyukjinKwon · 2018-06-25T01:31:48Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+    Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding =>
+      withTempDir { dir =>
+        val csvDir = new File(dir, "csv").getCanonicalPath
+        // scalastyle:off


Let's ignore the specific rule for this, e.g.:

// scalastyle:off nonascii ... // scalastyle:on nonascii

HyukjinKwon · 2018-06-25T01:34:26Z

ok to test

SparkQA · 2018-06-25T05:28:09Z

Test build #92277 has finished for PR 20949 at commit 0d0addf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-16T17:59:09Z

Test build #93113 has finished for PR 20949 at commit 91f4750.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-07-16T18:38:22Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

    }
  }

+  test("Save csv with custom charset") {


Could you prepend SPARK-19018 to the test title.

SparkQA · 2018-07-17T13:02:18Z

Test build #93164 has finished for PR 20949 at commit 349e132.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-18T01:22:50Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

   * enclosed in quotes. Default is to only escape values containing a quote character.</li>
   * <li>`header` (default `false`): writes the names of columns as the first line.</li>
   * <li>`nullValue` (default empty string): sets the string representation of a null value.</li>
+   * <li>`encoding` (default `UTF-8`): encoding to use when saving to file.</li>


I think we should match the doc with JSON's

spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Lines 525 to 526 in 6ea582e

* <li>`encoding` (by default it is not set): specifies encoding (charset) of saved json

* files. If it is not set, the UTF-8 charset will be used. </li>

HyukjinKwon · 2018-07-18T01:26:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

+  private val charset = Charset.forName(params.charset)
+
+  private val writer = CodecStreams.createOutputStreamWriter(
+    context,


tiny nit:

private val writer = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)

HyukjinKwon · 2018-07-18T01:27:28Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+          .option("encoding", encoding)
+          .csv(csvDir.getCanonicalPath)
+
+        csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ csvFile =>


h({ => h {

What do you mean?

HyukjinKwon · 2018-07-18T01:29:33Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+
+        csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ csvFile =>
+          val readback = Files.readAllBytes(csvFile.toPath)
+          val expected = (content + "\n").getBytes(Charset.forName(encoding))


Currently, the newline is dependent on Univocity. This test is going to be broken on Windows. Let's use platform's newline

Good Point!

HyukjinKwon · 2018-07-18T01:30:04Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+
+  test("SPARK-19018: error handling for unsupported charsets") {
+    val exception = intercept[SparkException] {
+      withTempDir { dir =>


withTempPath

HyukjinKwon · 2018-07-18T01:31:30Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+    // scalastyle:on nonascii
+
+    Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding =>
+      withTempDir { dir =>


withTempDir -> withTempPath

HyukjinKwon · 2018-07-18T01:32:12Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+      withTempDir { dir =>
+        val csvDir = new File(dir, "csv")
+
+        val originalDF = Seq(content).toDF("_c0").repartition(1)


toDF("_c0") -> toDF()

HyukjinKwon · 2018-07-18T01:33:29Z

python/pyspark/sql/readwriter.py

                                          escape character when escape and quote characters are
                                          different, ``\0`` otherwise..
+        :param encoding: sets the encoding (charset) to be used on the csv file. If None is set, it
+                                          uses the default value, ``UTF-8``.


Likewise, let's match the doc to JSON's.

- Improve method documentation - Inline method calls that are not too big - Use platform newline instead of hardcoded one. - Replace withTempDir with withTempPath

SparkQA · 2018-07-18T09:20:12Z

Test build #93227 has finished for PR 20949 at commit b6311e3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-18T13:05:51Z

Test build #93226 has finished for PR 20949 at commit fd857b0.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-07-18T13:21:13Z

Test build #93229 has finished for PR 20949 at commit 025958a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM too

HyukjinKwon · 2018-07-25T02:38:15Z

retest this please

HyukjinKwon · 2018-07-25T05:54:45Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+          .option("encoding", encoding)
+          .csv(csvDir.getCanonicalPath)
+
+        csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ csvFile =>


nit: .foreach({ -> .foreach { per https://github.com/databricks/scala-style-guide#anonymous-methods

HyukjinKwon · 2018-07-25T05:55:19Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+        val csvDir = new File(path, "csv").getCanonicalPath
+        Seq("a,A,c,A,b,B").toDF().write
+          .option("encoding", "1-9588-osi")
+          .csv(csvDir)


nit: you could use directly path.getCanonicalPath

HyukjinKwon · 2018-07-25T05:57:48Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+    Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding =>
+      withTempPath { path =>
+        val csvDir = new File(path, "csv")
+        Seq(content).toDF().write


nit: .write.repartition(1) to make sure we write only one file

SparkQA · 2018-07-25T06:16:17Z

Test build #93531 has finished for PR 20949 at commit 025958a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-25T06:16:48Z

Merged to master.

HyukjinKwon · 2018-07-25T06:19:24Z

@crafty-coder, what's your JIRA ID? I should know it to assign the JIRA to you.

crafty-coder · 2018-07-25T09:10:39Z

@HyukjinKwon and @MaxGekk thanks for your help in this PR!

My JIRA Id is also crafty-coder

[SPARK-19018][SQL] Add support for custom encoding on csv writer

b9a7bf0

HyukjinKwon reviewed Mar 31, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master'

8fd3e0e

Fix checkstyle issue

0d0addf

crafty-coder mentioned this pull request Jun 22, 2018

[SPARK-24190][SQL] Allow saving of JSON files in UTF-16 and UTF-32 #21247

Closed

MaxGekk reviewed Jun 24, 2018

View reviewed changes

HyukjinKwon reviewed Jun 25, 2018

View reviewed changes

Add test to check UTF-16 and UTF-32

91f4750

MaxGekk reviewed Jul 16, 2018

View reviewed changes

MaxGekk approved these changes Jul 16, 2018

View reviewed changes

Add jira ticked (SPARK-19018:) as prefix on the related tests

349e132

HyukjinKwon reviewed Jul 18, 2018

View reviewed changes

crafty-coder and others added 2 commits July 18, 2018 10:11

Add styling improvements

fd857b0

- Improve method documentation - Inline method calls that are not too big - Use platform newline instead of hardcoded one. - Replace withTempDir with withTempPath

Merge branch 'master' into master

b6311e3

Fix import order on CSVSuite

025958a

HyukjinKwon approved these changes Jul 19, 2018

View reviewed changes

HyukjinKwon reviewed Jul 25, 2018

View reviewed changes

asfgit closed this in 78e0a72 Jul 25, 2018

	test("SPARK-23723: write json in UTF-16/32 with multiline off") {
	Seq("UTF-16", "UTF-32").foreach { encoding =>
	withTempPath { path =>
	val ds = spark.createDataset(Seq(("a", 1))).repartition(1)
	ds.write
	.option("encoding", encoding)
	.option("multiline", false)
	.json(path.getCanonicalPath)
	val jsonFiles = path.listFiles().filter(_.getName.endsWith("json"))
	jsonFiles.foreach { jsonFile =>
	val readback = Files.readAllBytes(jsonFile.toPath)
	val expected = ("""{"_1":"a","_2":1}""" + "\n").getBytes(Charset.forName(encoding))
	assert(readback === expected)
	}
	}
	}
	}

	* <li>`encoding` (by default it is not set): specifies encoding (charset) of saved json
	* files. If it is not set, the UTF-8 charset will be used. </li>

[SPARK-19018][SQL] Add support for custom encoding on csv writer #20949

[SPARK-19018][SQL] Add support for custom encoding on csv writer #20949

Uh oh!

Conversation

crafty-coder commented Mar 30, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Mar 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 31, 2018

Uh oh!

SparkQA commented May 6, 2018

Uh oh!

SparkQA commented May 6, 2018

Uh oh!

rberenguel commented May 6, 2018

Uh oh!

crafty-coder commented May 6, 2018

Uh oh!

HyukjinKwon commented May 6, 2018

Uh oh!

SparkQA commented May 6, 2018

Uh oh!

gatorsmile commented Jun 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 25, 2018

Uh oh!

SparkQA commented Jun 25, 2018

Uh oh!

SparkQA commented Jul 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 18, 2018

Uh oh!

SparkQA commented Jul 18, 2018

Uh oh!

SparkQA commented Jul 18, 2018

Uh oh!

HyukjinKwon Jul 18, 2018 •

edited

Loading