[SPARK-19709][SQL] Read empty file with CSV data source #17068

wojtek-szymanski · 2017-02-26T00:04:36Z

What changes were proposed in this pull request?

Bugfix for reading empty file with CSV data source. Instead of throwing NoSuchElementException, an empty data frame is returned.

How was this patch tested?

Added new unit test in org.apache.spark.sql.execution.datasources.csv.CSVSuite

HyukjinKwon · 2017-02-27T14:17:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

      caseSensitive: Boolean,
      options: CSVOptions): StructType = {
-    val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, options).first()
+    val lines = CSVUtils.filterCommentAndEmpty(csv, options)


Hi @wojtek-szymanski I think we should not rely on exception handling. I can think of take(1).headOption but we could use shorten one if you know any other good way. What do you think about this?

You are absolutely right. Relying on exception handling is smelly, while Option gives more opportunities. I also see no difference from performance point of view, since both first() and take(1) call the the same function head(1).

…rame

wojtek-szymanski · 2017-03-04T13:42:46Z

@HyukjinKwon can somebody allow to test this PR as there are no more comments?

HyukjinKwon · 2017-03-04T14:07:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

+      .take(1)
+      .headOption
+      .map(firstLine => infer(sparkSession, parsedOptions, csv, firstLine))
+      .orElse(Some(StructType(Seq())))


Could we maybe just match it to CSVDataSource.scala#L204-L224 just for consistency for now?

Personally, I thought such chaining makes hard to read the codes sometimes. Maybe, we could consider the code de-duplication about this in another PR. If would be easier if they looks similar at least.

I would suggest that we use pattern matching in order to make it more verbose and avoid code like this:

if (maybeFirstRow.isDefined) { val firstRow = maybeFirstRow.get

I also touched WholeFileCSVDataSource to unify both implementations. What's your opinion?
Regarding code de-duplication, I fully agree, that it should be done in separate PR.

HyukjinKwon · 2017-03-04T14:10:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

    assert(result.schema.fieldNames.size === 1)
  }

+  test("test with empty file without schema") {


Let's re-use the test in CSVSuite.scala#L1083.

We could..

test("Empty file produces empty dataframe with empty schema") { Seq(false, true).foreach { wholeFile => val df = spark.read.format("csv") .option("header", true) .option("wholeFile", wholeFile) .load(testFile(emptyFile)) assert(df.schema === spark.emptyDataFrame.schema) checkAnswer(df, spark.emptyDataFrame) } } }

Good idea, done

HyukjinKwon · 2017-03-04T14:11:41Z

Hi @wojtek-szymanski, these are all from me. Let me cc @cloud-fan as my PRs related with this were reviewed by him and I guess I can't trigger the test.

…ex/whole file csv data source.

cloud-fan · 2017-03-06T00:07:01Z

ok to test

SparkQA · 2017-03-06T00:09:09Z

Test build #73950 has finished for PR 17068 at commit b9e9a84.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-06T06:22:32Z

Test build #73970 has started for PR 17068 at commit e7faa80.

HyukjinKwon · 2017-03-06T06:33:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

-    } else {
-      // If the first row could not be read, just return the empty schema.
-      Some(StructType(Nil))
+    }.take(1).headOption match {


IMHO, Option.isDefine with Option.get, Option.map with Option.getOrElse and Option with match case Some... case None all might be fine. But, how about minimising the change by matching the above one to Option.isDefine with Option.get? Then, it would not require the changes here.

I would leave it as it is, since pattern matching still looks a bit clearer than conditionals. If minimizing changes is so critical, I can revert the previous version here and replace pattern matching with conditionals in my fix , @cloud-fan please advise.

I don't have a strong preference, this looks fine

All three patterns I mentioned are being used across the code base. There is no style guide for this both in https://github.com/databricks/scala-style-guide and http://spark.apache.org/contributing.html

In this case, matching new one to other similar ones is a better choice to reduce changed lines, rather than doing the opposite. Personal taste might be secondary.

#17068 (comment) did not show up when I write my comment. I am fine as is. I am not supposed to decide this.

@HyukjinKwon @cloud-fan many thanks for your effort. I really appreciate it and I will take it into account when working with the codebase.

Thank you both for bearing with me.

cloud-fan · 2017-03-06T08:14:12Z

retest this please

SparkQA · 2017-03-06T10:10:02Z

Test build #73983 has finished for PR 17068 at commit e7faa80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-06T21:21:59Z

thanks, merging to master!

wojtek-szymanski added 2 commits February 26, 2017 00:37

Fix: SPARK-19709 CSV datasource fails to read empty file

1c80d4c

test renamed

90f315c

HyukjinKwon reviewed Feb 27, 2017

View reviewed changes

wojtek-szymanski added 3 commits February 28, 2017 01:51

take(1).headOption instated of exception handling on empty CSV data f…

aca1852

…rame

master merged

aa4e315

Code cleanup

bdf1890

HyukjinKwon reviewed Mar 4, 2017

View reviewed changes

wojtek-szymanski added 2 commits March 5, 2017 17:31

Pattern matching over chaining calls on Option. Unified solution in t…

351d4ba

…ex/whole file csv data source.

revert changes in tests

b9e9a84

scala style fixed

e7faa80

HyukjinKwon reviewed Mar 6, 2017

View reviewed changes

asfgit closed this in f6471dc Mar 6, 2017

[SPARK-19709][SQL] Read empty file with CSV data source #17068

[SPARK-19709][SQL] Read empty file with CSV data source #17068

Uh oh!

Conversation

wojtek-szymanski commented Feb 26, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojtek-szymanski commented Mar 4, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 4, 2017

Uh oh!

cloud-fan commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

cloud-fan commented Mar 6, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon Mar 4, 2017 •

edited

Loading