Revert "[SPARK-26081][SPARK-29999]" #26671

HeartSaVioR · 2019-11-26T01:59:33Z

What changes were proposed in this pull request?

This reverts commit 31c4fab (#23052) to make sure the partition calling ManifestFileCommitProtocol.newTaskTempFile creates actual file.

This also reverts part of commit 0d3d46d (#26639) since the commit fixes the issue raised from 31c4fab and we're reverting back. The reason of partial revert is that we found the UT be worth to keep as it is, preventing regression - given the UT can detect the issue on empty partition -> no actual file. This makes one more change to UT; moved intentionally to test both DSv1 and DSv2.

Why are the changes needed?

After the changes in SPARK-26081 (commit 31c4fab / #23052), CSV/JSON/TEXT don't create actual file if the partition is empty. This optimization causes a problem in ManifestFileCommitProtocol: the API newTaskTempFile is called without actual file creation. Then fs.getFileStatus throws FileNotFoundException since the file is not created.

SPARK-29999 (commit 0d3d46d / #26639) fixes the problem. But it is too costly to check file existence on each task commit. We should simply restore the behavior before SPARK-26081.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Jenkins build will follow.

…r empty partition" This reverts commit 0d3d46d.

…n Text datasources" This reverts commit 31c4fab.

HeartSaVioR · 2019-11-26T02:01:14Z

cc. @dongjoon-hyun @gatorsmile @gengliangwang @zsxwing - this is the PR to revert both commits. Once this is merged I'll raise a separate PR for adding UT. Please take a look. Thanks!

HeartSaVioR · 2019-11-26T02:05:59Z

I skipped copying full title on each PR as we're reverting two issues which would make very long title.

gengliangwang · 2019-11-26T03:58:21Z

Well I was expecting this PR comes with UTs...
@dongjoon-hyun @gatorsmile @zsxwing WDYT?

dongjoon-hyun · 2019-11-26T04:07:01Z

Yes. I also expect only a revert of SPARK-26081 (including the SPARK-29999 body part).

HeartSaVioR · 2019-11-26T04:18:00Z

Sorry I'm not clear on understanding the suggestion.

Looks like you don't suggest to "do clean revert on both issues", but suggest to "do dirty revert on SPARK-26081 - revert SPARK-26081 and also the things depending on SPARK-26081 like body part of SPARK-29999", right? The way I understand the suggestion was former, as community has been preferred clean revert for most of cases. So I might want to confirm this again.

And if we want to do the latter (clean revert on SPARK-26081 & partial revert on SPARK-29999), would we like to just mention the PR that the PR reverts SPARK-26081, or keep this as it is?

…h DSv1/DSv2

HeartSaVioR · 2019-11-26T04:34:56Z

For now I restored the UT and move the UT to the place where it will be checked with both DSv1 and DSv2, and modified the description of PR. I left the title of the PR as it is, as it can be modified via committers so easier to handle directly if desired.

gengliangwang · 2019-11-26T04:51:00Z

@HeartSaVioR the file CsvOutputWriter.scala was moved in another PR, and this PR is not directly reverting from the original commit 31c4fab.
I prefer resolving it all in one PR to make the commit history cleaner. We can also add a new test case to check the behavior that an empty Dataframe will output exactly one empty file.
The title can explain the detail changes.

But I am open with this and not strongly suggesting doing so. You can do revert and add tests as follow-up.

SparkQA · 2019-11-26T05:41:39Z

Test build #114435 has finished for PR 26671 at commit 5bddefe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-11-26T06:16:17Z

Ah yes you're right that it cannot be reverted cleanly - so there's physically no clean revert.

Maybe I overthought here; I thought about how we deal with JIRA issue for SPARK-26081/SPARK-29999. If we reopen them (at least SPARK-26081) and open a chance to try to do the right fix, it'd be ideal if we have a "minimized" commit to revert the SPARK-26081 - so we can track how SPARK-26081 was introduced and reverted later, and re-introduced. If we would want to abandon the original idea of SPARK-26081 and close the issue as won't fix, any approach would be OK for me.

Btw, would you mind if I ask for elaboration on the new suggestion on the new UT?

We can also add a new test case to check the behavior that an empty Dataframe will output exactly one empty file.

I'm not familiar enough to understand the expectations/requirements on file sink; I feel the UT in SPARK-29999 can reside with reverting commit as the UT tests the regression what we've broken - we're reverting and adding the guard to prevent we don't break again. Is the new UT same case - did SPARK-26081 break the expectation? If not, that sounds to be on different purpose.

gengliangwang · 2019-11-26T06:54:25Z

OK, I think the current PR is good :)

HeartSaVioR · 2019-11-26T07:00:47Z

@gengliangwang Thanks for understanding and bearing with me. :)

SparkQA · 2019-11-26T08:05:01Z

Test build #114439 has finished for PR 26671 at commit 4e09411.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-11-26T08:12:55Z

retest this, please

SparkQA · 2019-11-26T12:05:06Z

Test build #114449 has finished for PR 26671 at commit 4e09411.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-11-26T12:24:32Z

...ore/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala

    if (addedFiles.nonEmpty) {
      val fs = new Path(addedFiles.head).getFileSystem(taskContext.getConfiguration)
      val statuses: Seq[SinkFileStatus] =
-        addedFiles.flatMap { f =>


So, this if-else looked an overhead and it was added to avoid files not being written?

FWIW, there's still an old case when files are not written:

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

Lines 285 to 297 in 73183b3

private lazy val recordWriter: RecordWriter[NullWritable, Writable] = {

recordWriterInstantiated = true

new OrcOutputFormat().getRecordWriter(

new Path(path).getFileSystem(context.getConfiguration),

context.getConfiguration.asInstanceOf[JobConf],

path,

Reporter.NULL

).asInstanceOf[RecordWriter[NullWritable, Writable]]

}

override def write(row: InternalRow): Unit = {

recordWriter.write(NullWritable.get(), serializer.serialize(row))

}

spark.conf.set("spark.sql.orc.impl", "hive") spark.range(10).filter(_ => false).write.orc("test.orc")

But I suspect it's a-okay since this behaviour will be superseded by "native" implementation completely in the future.

Thanks for the pointer. That would also break streaming sink if there's empty partition then. Ideally it should be fixed for streaming query, but I feel its scope is beyond the PR.

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala

HyukjinKwon · 2019-11-26T12:26:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

  }

 }
-


I would remove this line in a revert.

I'm taking this as OK to remove the line; please let me know if you meant removing the line in "changeset".

I think @HyukjinKwon means removing this line in the "changeset".

Not a big deal. Will do.

HyukjinKwon · 2019-11-26T12:45:52Z

cc @cloud-fan as well since I talked about empty files here and there (e.g. #18654) with him.

cloud-fan · 2019-11-26T14:44:34Z

now we at least write one file for each partition?

HyukjinKwon · 2019-11-26T14:55:05Z

I guess you mean if we write at least one file for each task? Yes, I think so except old Hive case (#26671 (comment))

HeartSaVioR · 2019-11-26T15:09:16Z

now we at least write one file for each partition?

Yes, that has been the assumption of ManifestFileCommitProtocol, as it provides the path of temp file and there's no interface notifying whether the file is created or not when committing. So either assuming the file should be created, or need to check existence (that was SPARK-29999).

cloud-fan · 2019-11-26T17:17:58Z

I'm fine to revert. one file per task is not that bad, and we won't have regressions.

gengliangwang · 2019-11-26T17:54:35Z

now we at least write one file for each partition?

No, in DSV1, only partition 0 will write empty file: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L260

gengliangwang · 2019-11-26T19:30:49Z

@HeartSaVioR Please update the PR description with more details

HeartSaVioR · 2019-11-26T21:32:02Z

@gengliangwang I added the reason to move the UT. If you meant others, would you mind if I ask to be more specific in review comments?

gengliangwang · 2019-11-26T21:47:18Z

@HeartSaVioR I have updated the PR description from

We found a bug on SPARK-26081 and SPARK-29999 was proposed to fix it, but we decided to revert both as it's too costly to apply SPARK-29999 for SPARK-26081; SPARK-26081 may be resubmitted if there's viable approach for dealing with bug.

to

For Spark file sources, in case of an empty job, we leave the first partition to save meta for file format like parquet.
After the changes in SPARK-26081, CSV/JSON/TEXT won't be able to output an empty file for an empty job. This optimization causes a problem in `ManifestFileCommitProtocol`: the API `newTaskTempFile` is called without actual file creation. Then `fs.getFileStatus` throws FileNotFoundException since the file is not created.

SPARK-29999 fixes the problem. But it is too costly to check file existence on each task commit. We should simply restore the behavior before SPARK-26081.

So that the context is more straightforward to developers.

HeartSaVioR · 2019-11-26T22:22:26Z

Looks like we are mixing up two different problems.

Let me define the problem properly; the root issue of SPARK-29999 was due to "empty partition", not "empty job". SPARK-26081 also optimizes about "empty partition". (If empty job is also affected then that's an unintended side-effect.) That's why I have been trying to decouple "empty job" with this PR. (So that's why I said "different purpose" on request to add UT for verifying empty job.)

Btw, I'm seeing what you meant about "details"; thanks for providing the detailed explanation! Let me update the description so that we say about "empty partition", not "empty job".

HeartSaVioR · 2019-11-26T22:32:00Z

Just updated the description.

HyukjinKwon

Looks fine to me too

gengliangwang · 2019-11-26T23:20:37Z

I will merge it once jenkins test passes.

SparkQA · 2019-11-27T01:26:08Z

Test build #114481 has finished for PR 26671 at commit 3fa7236.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-11-27T02:33:36Z

Thanks, merging to master

HeartSaVioR · 2019-11-27T03:33:09Z

Thanks all for reviewing and merging!

### What changes were proposed in this pull request? This reverts commit 31c4fab (apache#23052) to make sure the partition calling `ManifestFileCommitProtocol.newTaskTempFile` creates actual file. This also reverts part of commit 0d3d46d (apache#26639) since the commit fixes the issue raised from 31c4fab and we're reverting back. The reason of partial revert is that we found the UT be worth to keep as it is, preventing regression - given the UT can detect the issue on empty partition -> no actual file. This makes one more change to UT; moved intentionally to test both DSv1 and DSv2. ### Why are the changes needed? After the changes in SPARK-26081 (commit 31c4fab / apache#23052), CSV/JSON/TEXT don't create actual file if the partition is empty. This optimization causes a problem in `ManifestFileCommitProtocol`: the API `newTaskTempFile` is called without actual file creation. Then `fs.getFileStatus` throws `FileNotFoundException` since the file is not created. SPARK-29999 (commit 0d3d46d / apache#26639) fixes the problem. But it is too costly to check file existence on each task commit. We should simply restore the behavior before SPARK-26081. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Jenkins build will follow. Closes apache#26671 from HeartSaVioR/revert-SPARK-26081-SPARK-29999. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>

HeartSaVioR added 2 commits November 26, 2019 10:06

Revert "[SPARK-29999][SS] Handle FileStreamSink metadata correctly fo…

92d8b56

…r empty partition" This reverts commit 0d3d46d.

Revert "[SPARK-26081][SQL] Prevent empty files for empty partitions i…

5bddefe

…n Text datasources" This reverts commit 31c4fab.

Restore UT in SPARK-29999 and place it to FileStreamSinkSuite for bot…

4e09411

…h DSv1/DSv2

gengliangwang approved these changes Nov 26, 2019

View reviewed changes

HyukjinKwon reviewed Nov 26, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala Show resolved Hide resolved

HyukjinKwon reviewed Nov 26, 2019

View reviewed changes

Remove unintended changes

3fa7236

HyukjinKwon approved these changes Nov 26, 2019

View reviewed changes

gengliangwang closed this in 5b628f8 Nov 27, 2019

HeartSaVioR deleted the revert-SPARK-26081-SPARK-29999 branch November 27, 2019 03:33

HeartSaVioR mentioned this pull request Jun 26, 2020

[SPARK-29999][SS][FOLLOWUP] Fix test to check the actual metadata log directory #28930

Closed

	private lazy val recordWriter: RecordWriter[NullWritable, Writable] = {
	recordWriterInstantiated = true
	new OrcOutputFormat().getRecordWriter(
	new Path(path).getFileSystem(context.getConfiguration),
	context.getConfiguration.asInstanceOf[JobConf],
	path,
	Reporter.NULL
	).asInstanceOf[RecordWriter[NullWritable, Writable]]
	}

	override def write(row: InternalRow): Unit = {
	recordWriter.write(NullWritable.get(), serializer.serialize(row))
	}

Revert "[SPARK-26081][SPARK-29999]" #26671

Revert "[SPARK-26081][SPARK-29999]" #26671

Uh oh!

Conversation

HeartSaVioR commented Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Nov 26, 2019

Uh oh!

HeartSaVioR commented Nov 26, 2019

Uh oh!

gengliangwang commented Nov 26, 2019

Uh oh!

dongjoon-hyun commented Nov 26, 2019

Uh oh!

HeartSaVioR commented Nov 26, 2019

Uh oh!

HeartSaVioR commented Nov 26, 2019

Uh oh!

gengliangwang commented Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Nov 26, 2019

Uh oh!

HeartSaVioR commented Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang commented Nov 26, 2019

Uh oh!

HeartSaVioR commented Nov 26, 2019

Uh oh!

SparkQA commented Nov 26, 2019

Uh oh!

HeartSaVioR commented Nov 26, 2019

Uh oh!

SparkQA commented Nov 26, 2019

Uh oh!

HyukjinKwon Nov 26, 2019

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 26, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HyukjinKwon Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 26, 2019

Choose a reason for hiding this comment

Uh oh!

gengliangwang Nov 26, 2019

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 26, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 26, 2019

Uh oh!

cloud-fan commented Nov 26, 2019

Uh oh!

HyukjinKwon commented Nov 26, 2019

Uh oh!

HeartSaVioR commented Nov 26, 2019

Uh oh!

cloud-fan commented Nov 26, 2019

Uh oh!

gengliangwang commented Nov 26, 2019

Uh oh!

gengliangwang commented Nov 26, 2019

Uh oh!

HeartSaVioR commented Nov 26, 2019

Uh oh!

gengliangwang commented Nov 26, 2019

Uh oh!

HeartSaVioR commented Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

HeartSaVioR commented Nov 26, 2019 •

edited

Loading

gengliangwang commented Nov 26, 2019 •

edited

Loading

HeartSaVioR commented Nov 26, 2019 •

edited

Loading

HyukjinKwon Nov 26, 2019 •

edited

Loading

HeartSaVioR commented Nov 26, 2019 •

edited

Loading