Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

Since SPARK-8501, Spark doesn't create an ORC file for empty data sets. However, SPARK-21669 is trying to get the length of the written file at the end of writing tasks and fails with FileNotFoundException. This is a regression at 2.3.0 only. We had better fix this and have a test case to prevent future regression.

scala> Seq("str").toDS.limit(0).write.format("orc").save("/tmp/a")
17/10/11 19:28:59 ERROR Utils: Aborting task
java.io.FileNotFoundException: File file:/tmp/a/_temporary/0/_temporary/attempt_20171011192859_0000_m_000000_0/part-00000-aa56c3cf-ec35-48f1-bb73-23ad1480e917-c000.snappy.orc does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
	at org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker.getFileSize(BasicWriteStatsTracker.scala:60)

How was this patch tested?

Pass the newly added test cases.

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile and @cloud-fan .
This is a regression of SPARK-21669 (Internal API for collecting metrics/stats during FileFormatWriter jobs) at Spark 2.3.0. Could you review this PR?

@viirya
Copy link
Member

viirya commented Oct 12, 2017

@dongjoon-hyun This is kind of duplicate to #18979, although the viewpoint of the issue is different.

}
}

Seq("orc", "parquet", "csv", "json", "text").foreach { format =>
Copy link
Member

@viirya viirya Oct 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this test case is worth merging into. cc @steveloughran Shall we include this test into #18979?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Oct 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Please, @steveloughran . :)

@dongjoon-hyun
Copy link
Member Author

Wow. There is a PR for that. Thank you for informing that, @viirya ! Then, it's good.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-22258 branch October 12, 2017 03:41
@SparkQA
Copy link

SparkQA commented Oct 12, 2017

Test build #82654 has finished for PR 19477 at commit b545f28.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

steveloughran added a commit to steveloughran/spark that referenced this pull request Oct 13, 2017
This is going to create merge conflict with this branch until I rebase it, which I'm about to

Change-Id: Ie2309066ad7892cb20155d9de8248c1682bba526
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants