Skip to content

Conversation

@crafty-coder
Copy link
Contributor

What changes were proposed in this pull request?

Add support for custom encoding on csv writer, see https://issues.apache.org/jira/browse/SPARK-19018

How was this patch tested?

Added two unit tests in CSVSuite

@HyukjinKwon
Copy link
Member

ok to test

val df = spark
.read
.option("header", "false")
.option("encoding", encoding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our CSV read encoding option is incomplete for now .. there are many discussions about this now. I am going to fix the read path soon. Let me revisit this after fixing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's fine. I think we decided to support encoding in CSV/JSON datasources. Ignore the comment above. We can proceed separately.

@SparkQA
Copy link

SparkQA commented Mar 31, 2018

Test build #88779 has finished for PR 20949 at commit b9a7bf0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 6, 2018

Test build #90274 has finished for PR 20949 at commit 8fd3e0e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 6, 2018

Test build #90275 has finished for PR 20949 at commit 0d0addf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rberenguel
Copy link
Contributor

I've been giving a look to this PR (I've hit this problem in the past and had a chat with @crafty-coder about it and his fixes, too), is there anything we could do to move it forward?

Also, is there any way to trigger a rebuild on Jenkins without adding a dummy commit? Looks like the JVM on this test run blew the heap, just a re-run should be enough (cc @holdenk @HyukjinKwon )

@crafty-coder
Copy link
Contributor Author

I would say this change has value on its own.

At the moment the csv reader applies the charset config but the csv writer is ignoring it, which I think its a bit confusing.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented May 6, 2018

Test build #90279 has finished for PR 20949 at commit 0d0addf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @MaxGekk @HyukjinKwon

}

test("Save csv with custom charset") {
Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check the UTF-16 and UTF-32 encoding too. The written csv files must contain BOMs for such encodings. I am not sure that Spark CSV datasource is able to read it in per-line mode (multiLine is set to false). Probably, you need to switch to multLine mode or read the files by Scala's library like in JsonSuite:

test("SPARK-23723: write json in UTF-16/32 with multiline off") {
Seq("UTF-16", "UTF-32").foreach { encoding =>
withTempPath { path =>
val ds = spark.createDataset(Seq(("a", 1))).repartition(1)
ds.write
.option("encoding", encoding)
.option("multiline", false)
.json(path.getCanonicalPath)
val jsonFiles = path.listFiles().filter(_.getName.endsWith("json"))
jsonFiles.foreach { jsonFile =>
val readback = Files.readAllBytes(jsonFile.toPath)
val expected = ("""{"_1":"a","_2":1}""" + "\n").getBytes(Charset.forName(encoding))
assert(readback === expected)
}
}
}
}

val originalDF = Seq("µß áâä ÁÂÄ").toDF("_c0")
// scalastyle:on
originalDF.write
.option("header", "false")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header flag is disabled by default. Just in case, are there any specific reasons fro testing without CSV header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, there is no reason. It's fixed on the next commit.

the quote character. If None is set, the default value is
escape character when escape and quote characters are
different, ``\0`` otherwise..
:param encoding: sets encoding used for encoding the file. If None is set, it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you reformulate this encoding used for encoding

context,
new Path(path),
charset
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seq("iso-8859-1", "utf-8", "windows-1250").foreach { encoding =>
withTempDir { dir =>
val csvDir = new File(dir, "csv").getCanonicalPath
// scalastyle:off
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ignore the specific rule for this, e.g.:

// scalastyle:off nonascii
...
// scalastyle:on nonascii

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 25, 2018

Test build #92277 has finished for PR 20949 at commit 0d0addf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2018

Test build #93113 has finished for PR 20949 at commit 91f4750.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

test("Save csv with custom charset") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you prepend SPARK-19018 to the test title.

@SparkQA
Copy link

SparkQA commented Jul 17, 2018

Test build #93164 has finished for PR 20949 at commit 349e132.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* enclosed in quotes. Default is to only escape values containing a quote character.</li>
* <li>`header` (default `false`): writes the names of columns as the first line.</li>
* <li>`nullValue` (default empty string): sets the string representation of a null value.</li>
* <li>`encoding` (default `UTF-8`): encoding to use when saving to file.</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should match the doc with JSON's

* <li>`encoding` (by default it is not set): specifies encoding (charset) of saved json
* files. If it is not set, the UTF-8 charset will be used. </li>

private val charset = Charset.forName(params.charset)

private val writer = CodecStreams.createOutputStreamWriter(
context,
Copy link
Member

@HyukjinKwon HyukjinKwon Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit:

private val writer = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)

.option("encoding", encoding)
.csv(csvDir.getCanonicalPath)

csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ csvFile =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h({ => h {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?


csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ csvFile =>
val readback = Files.readAllBytes(csvFile.toPath)
val expected = (content + "\n").getBytes(Charset.forName(encoding))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the newline is dependent on Univocity. This test is going to be broken on Windows. Let's use platform's newline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Point!


test("SPARK-19018: error handling for unsupported charsets") {
val exception = intercept[SparkException] {
withTempDir { dir =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withTempPath

// scalastyle:on nonascii

Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding =>
withTempDir { dir =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withTempDir -> withTempPath

withTempDir { dir =>
val csvDir = new File(dir, "csv")

val originalDF = Seq(content).toDF("_c0").repartition(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toDF("_c0") -> toDF()

escape character when escape and quote characters are
different, ``\0`` otherwise..
:param encoding: sets the encoding (charset) to be used on the csv file. If None is set, it
uses the default value, ``UTF-8``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, let's match the doc to JSON's.

crafty-coder and others added 2 commits July 18, 2018 10:11
- Improve method documentation
- Inline method calls that are not too big
- Use platform newline instead of hardcoded one.
- Replace withTempDir with withTempPath
@SparkQA
Copy link

SparkQA commented Jul 18, 2018

Test build #93227 has finished for PR 20949 at commit b6311e3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2018

Test build #93226 has finished for PR 20949 at commit fd857b0.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2018

Test build #93229 has finished for PR 20949 at commit 025958a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

@HyukjinKwon
Copy link
Member

retest this please

.option("encoding", encoding)
.csv(csvDir.getCanonicalPath)

csvDir.listFiles().filter(_.getName.endsWith("csv")).foreach({ csvFile =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val csvDir = new File(path, "csv").getCanonicalPath
Seq("a,A,c,A,b,B").toDF().write
.option("encoding", "1-9588-osi")
.csv(csvDir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could use directly path.getCanonicalPath

Seq("iso-8859-1", "utf-8", "utf-16", "utf-32", "windows-1250").foreach { encoding =>
withTempPath { path =>
val csvDir = new File(path, "csv")
Seq(content).toDF().write
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: .write.repartition(1) to make sure we write only one file

@SparkQA
Copy link

SparkQA commented Jul 25, 2018

Test build #93531 has finished for PR 20949 at commit 025958a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

@crafty-coder, what's your JIRA ID? I should know it to assign the JIRA to you.

@asfgit asfgit closed this in 78e0a72 Jul 25, 2018
@crafty-coder
Copy link
Contributor Author

@HyukjinKwon and @MaxGekk thanks for your help in this PR!

My JIRA Id is also crafty-coder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants