-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19709][SQL] Read empty file with CSV data source #17068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| caseSensitive: Boolean, | ||
| options: CSVOptions): StructType = { | ||
| val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, options).first() | ||
| val lines = CSVUtils.filterCommentAndEmpty(csv, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @wojtek-szymanski I think we should not rely on exception handling. I can think of take(1).headOption but we could use shorten one if you know any other good way. What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely right. Relying on exception handling is smelly, while Option gives more opportunities. I also see no difference from performance point of view, since both first() and take(1) call the the same function head(1).
|
@HyukjinKwon can somebody allow to test this PR as there are no more comments? |
| .take(1) | ||
| .headOption | ||
| .map(firstLine => infer(sparkSession, parsedOptions, csv, firstLine)) | ||
| .orElse(Some(StructType(Seq()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe just match it to CSVDataSource.scala#L204-L224 just for consistency for now?
Personally, I thought such chaining makes hard to read the codes sometimes. Maybe, we could consider the code de-duplication about this in another PR. If would be easier if they looks similar at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that we use pattern matching in order to make it more verbose and avoid code like this:
if (maybeFirstRow.isDefined) {
val firstRow = maybeFirstRow.get
I also touched WholeFileCSVDataSource to unify both implementations. What's your opinion?
Regarding code de-duplication, I fully agree, that it should be done in separate PR.
| assert(result.schema.fieldNames.size === 1) | ||
| } | ||
|
|
||
| test("test with empty file without schema") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's re-use the test in CSVSuite.scala#L1083.
We could..
test("Empty file produces empty dataframe with empty schema") {
Seq(false, true).foreach { wholeFile =>
val df = spark.read.format("csv")
.option("header", true)
.option("wholeFile", wholeFile)
.load(testFile(emptyFile))
assert(df.schema === spark.emptyDataFrame.schema)
checkAnswer(df, spark.emptyDataFrame)
}
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done
|
Hi @wojtek-szymanski, these are all from me. Let me cc @cloud-fan as my PRs related with this were reviewed by him and I guess I can't trigger the test. |
…ex/whole file csv data source.
|
ok to test |
|
Test build #73950 has finished for PR 17068 at commit
|
|
Test build #73970 has started for PR 17068 at commit |
| } else { | ||
| // If the first row could not be read, just return the empty schema. | ||
| Some(StructType(Nil)) | ||
| }.take(1).headOption match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, Option.isDefine with Option.get, Option.map with Option.getOrElse and Option with match case Some... case None all might be fine. But, how about minimising the change by matching the above one to Option.isDefine with Option.get? Then, it would not require the changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave it as it is, since pattern matching still looks a bit clearer than conditionals. If minimizing changes is so critical, I can revert the previous version here and replace pattern matching with conditionals in my fix , @cloud-fan please advise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong preference, this looks fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All three patterns I mentioned are being used across the code base. There is no style guide for this both in https://github.com/databricks/scala-style-guide and http://spark.apache.org/contributing.html
In this case, matching new one to other similar ones is a better choice to reduce changed lines, rather than doing the opposite. Personal taste might be secondary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#17068 (comment) did not show up when I write my comment. I am fine as is. I am not supposed to decide this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon @cloud-fan many thanks for your effort. I really appreciate it and I will take it into account when working with the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both for bearing with me.
|
retest this please |
|
Test build #73983 has finished for PR 17068 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
Bugfix for reading empty file with CSV data source. Instead of throwing
NoSuchElementException, an empty data frame is returned.How was this patch tested?
Added new unit test in
org.apache.spark.sql.execution.datasources.csv.CSVSuite