Skip to content

Conversation

@wojtek-szymanski
Copy link
Contributor

What changes were proposed in this pull request?

Bugfix for reading empty file with CSV data source. Instead of throwing NoSuchElementException, an empty data frame is returned.

How was this patch tested?

Added new unit test in org.apache.spark.sql.execution.datasources.csv.CSVSuite

caseSensitive: Boolean,
options: CSVOptions): StructType = {
val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, options).first()
val lines = CSVUtils.filterCommentAndEmpty(csv, options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wojtek-szymanski I think we should not rely on exception handling. I can think of take(1).headOption but we could use shorten one if you know any other good way. What do you think about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right. Relying on exception handling is smelly, while Option gives more opportunities. I also see no difference from performance point of view, since both first() and take(1) call the the same function head(1).

@wojtek-szymanski
Copy link
Contributor Author

@HyukjinKwon can somebody allow to test this PR as there are no more comments?

.take(1)
.headOption
.map(firstLine => infer(sparkSession, parsedOptions, csv, firstLine))
.orElse(Some(StructType(Seq())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe just match it to CSVDataSource.scala#L204-L224 just for consistency for now?

Personally, I thought such chaining makes hard to read the codes sometimes. Maybe, we could consider the code de-duplication about this in another PR. If would be easier if they looks similar at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that we use pattern matching in order to make it more verbose and avoid code like this:

    if (maybeFirstRow.isDefined) {
        val firstRow = maybeFirstRow.get

I also touched WholeFileCSVDataSource to unify both implementations. What's your opinion?
Regarding code de-duplication, I fully agree, that it should be done in separate PR.

assert(result.schema.fieldNames.size === 1)
}

test("test with empty file without schema") {
Copy link
Member

@HyukjinKwon HyukjinKwon Mar 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's re-use the test in CSVSuite.scala#L1083.

We could..

  test("Empty file produces empty dataframe with empty schema") {
    Seq(false, true).foreach { wholeFile =>
      val df = spark.read.format("csv")
        .option("header", true)
        .option("wholeFile", wholeFile)
        .load(testFile(emptyFile))

        assert(df.schema === spark.emptyDataFrame.schema)
        checkAnswer(df, spark.emptyDataFrame)
      }
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done

@HyukjinKwon
Copy link
Member

Hi @wojtek-szymanski, these are all from me. Let me cc @cloud-fan as my PRs related with this were reviewed by him and I guess I can't trigger the test.

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73950 has finished for PR 17068 at commit b9e9a84.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73970 has started for PR 17068 at commit e7faa80.

} else {
// If the first row could not be read, just return the empty schema.
Some(StructType(Nil))
}.take(1).headOption match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, Option.isDefine with Option.get, Option.map with Option.getOrElse and Option with match case Some... case None all might be fine. But, how about minimising the change by matching the above one to Option.isDefine with Option.get? Then, it would not require the changes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave it as it is, since pattern matching still looks a bit clearer than conditionals. If minimizing changes is so critical, I can revert the previous version here and replace pattern matching with conditionals in my fix , @cloud-fan please advise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference, this looks fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three patterns I mentioned are being used across the code base. There is no style guide for this both in https://github.com/databricks/scala-style-guide and http://spark.apache.org/contributing.html

In this case, matching new one to other similar ones is a better choice to reduce changed lines, rather than doing the opposite. Personal taste might be secondary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#17068 (comment) did not show up when I write my comment. I am fine as is. I am not supposed to decide this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon @cloud-fan many thanks for your effort. I really appreciate it and I will take it into account when working with the codebase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for bearing with me.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73983 has finished for PR 17068 at commit e7faa80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in f6471dc Mar 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants