Skip to content

Conversation

@ypcat
Copy link
Contributor

@ypcat ypcat commented Mar 16, 2015

Add a DirectParquetOutputCommitter class that skips _temporary directory when saving to s3. To enable it, set hadoop configuration "spark.sql.parquet.output.committer.class" to "org.apache.spark.sql.parquet.DirectParquetOutputCommitter".

@ash211
Copy link
Contributor

ash211 commented Mar 16, 2015

Jenkins this is ok to test

@SparkQA
Copy link

SparkQA commented Mar 16, 2015

Test build #28650 has finished for PR 5042 at commit 769bd67.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DirectParquetOutputCommitter(outputPath: Path, context: TaskAttemptContext)

@shaneknapp
Copy link
Contributor

jenkins, test this please

@SparkQA
Copy link

SparkQA commented Mar 16, 2015

Test build #28653 has finished for PR 5042 at commit 769bd67.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DirectParquetOutputCommitter(outputPath: Path, context: TaskAttemptContext)

@SparkQA
Copy link

SparkQA commented Mar 17, 2015

Test build #28709 has finished for PR 5042 at commit c42468c.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 17, 2015

Test build #28712 has finished for PR 5042 at commit 0d540b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Pei-Lun Lee added 2 commits March 23, 2015 18:32
Add a new configuration key: spark.sql.parquet.output.committer.class
which should be a sub-class of ParquetOutputCommitter
Conflicts:
	sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala
@ypcat
Copy link
Contributor Author

ypcat commented Mar 23, 2015

I change the design to allow more general usage. User can set "spark.sql.parquet.output.committer.class" to a class extending ParquetOutputFormat.
I still include the implementation of DirectParquetOutputCommitter, if it is unwanted, we can move it to be only used in test case.

@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #28993 has finished for PR 5042 at commit e17bf47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #28992 has finished for PR 5042 at commit 9ae7545.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Hey @ypcat, sorry for the late review. If I understand this correctly, with DirectParquetOutputCommitter, all tasks writes to the final output directory directly. Then what if speculative tasks are scheduled? This may end up with two tasks trying to write to the same file simultaneously, right?

@liancheng
Copy link
Contributor

At least we should state explicitly that speculative tasks should be disabled when using the direct output committer.

@JoshRosen
Copy link
Contributor

@liancheng, if this is intended only for writing directly to S3, then I think we purposely want to avoid writing to _temporary because the move / rename of the temporary file to the final file is expensive and because S3 PUT operations are atomic. If you have two tasks racing to write to the same file, then the last writer should win and we shouldn't have to worry about partially-written files. As with DirectOutputCommitter, this is not safe for use on filesystems where file creation / overwrite isn't atomic, such as HDFS, but that's not a use-case that we have to worry about here (I think this is only for use by advanced users who know that they're writing directly to S3).

@ypcat
Copy link
Contributor Author

ypcat commented Apr 13, 2015

@liancheng this DirectParquetOutputCommitter is based on DirectOutputCommitter in this thread and was intended to use on s3. I was not aware of the problem of speculation. I think we can either:

  1. exclude DirectParquetOutputCommitter and let user implement their own like DirectOutputCommitter.
  2. write down the consequences in documentation as @JoshRosen described.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check for mapreduce.output.fileoutputformat.outputdir instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, this is for Hadoop 1 compatibility.

@liancheng
Copy link
Contributor

@JoshRosen @ypcat Thanks for the detailed explanation, I didn't realize S3 PUT operations are atomic at first. This LGTM. Merging to master.

@ypcat Thanks again for working on this!

@asfgit asfgit closed this in b29663e Apr 13, 2015
@JoshRosen
Copy link
Contributor

We can't change this now, but note that the commit message / PR description here is slightly out of date w.r.t. the actual changes being committed, since we now use the spark.sql.parquet.output.committer.class configuration instead of spark.sql.parquet.useDirectParquetOutputCommitter.

We may want to document this configuration at some point; if we do, I think the right place is the Parquet Configuration section in the SQL programming guide: https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration

@JoshRosen
Copy link
Contributor

Actually, on giving this a closer look I'm not sure whether this faithfully respects all of the Parquet configurations for controlling OutputCommitter behavior. For instance, it looks like there's a ParquetOutputFormat.ENABLE_JOB_SUMMARY configuration that can be used to suppress the writing of the summary file (https://github.com/Parquet/parquet-mr/blame/fa8957d7939b59e8d391fa17000b34e865de015d/parquet-hadoop/src/main/java/parquet/hadoop/ParquetOutputCommitter.java#L48). I don't really know much about Parquet, so I can't really speak to whether this is a configuration that would be commonly used, but we should probably try to support it.

@ypcat
Copy link
Contributor Author

ypcat commented Apr 14, 2015

@JoshRosen I updated PR description. It looks like the parquet code base I used was too old that it did not include the ParquetOutputFormat.ENABLE_JOB_SUMMARY thing. I will add support to this.

@JoshRosen
Copy link
Contributor

It looks like this patch broke the Hadoop 1.x build because the test code uses Configuration.unset, which isn't present in Hadoop 1.x:

[info] Compiling 1 Scala source to /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/examples/target/scala-2.10/classes...
[error] /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala:402: value unset is not a member of org.apache.hadoop.conf.Configuration
[error]       configuration.unset("spark.sql.parquet.output.committer.class")
[error]                     ^
[error] one error found

To unbreak the build, I'm going to revert this patch and re-open the JIRA. Let's open a new PR to fix this. At the same time, we can also address the ENABLE_JOB_SUMMARY flag.

@ypcat
Copy link
Contributor Author

ypcat commented Apr 15, 2015

I cannot find a way to unset a config value in hadoop 1.x API. The closest thing is to set it to a default value, which I think should be fine in test code.
And I found I cannot add more commits to this PR since it is closed. Should we reopen it or use a new PR?

@ash211
Copy link
Contributor

ash211 commented Apr 15, 2015

I think it's best to open a new PR

@ypcat
Copy link
Contributor Author

ypcat commented Apr 15, 2015

New PR is #5525

asfgit pushed a commit that referenced this pull request Apr 28, 2015
Add new config "spark.sql.parquet.output.committer.class" to allow custom parquet output committer and an output committer class specific to use on s3.
Fix compilation error introduced by #5042.
Respect ParquetOutputFormat.ENABLE_JOB_SUMMARY flag.

Author: Pei-Lun Lee <pllee@appier.com>

Closes #5525 from ypcat/spark-6352 and squashes the following commits:

54c6b15 [Pei-Lun Lee] error handling
472870e [Pei-Lun Lee] add back custom parquet output committer
ddd0f69 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ece5c5 [Pei-Lun Lee] compatibility with hadoop 1.x
8413fcd [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
fe65915 [Pei-Lun Lee] add support for parquet config parquet.enable.summary-metadata
e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer.
0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license
c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case
0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter
769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter
f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 14, 2015
Add new config "spark.sql.parquet.output.committer.class" to allow custom parquet output committer and an output committer class specific to use on s3.
Fix compilation error introduced by apache#5042.
Respect ParquetOutputFormat.ENABLE_JOB_SUMMARY flag.

Author: Pei-Lun Lee <pllee@appier.com>

Closes apache#5525 from ypcat/spark-6352 and squashes the following commits:

54c6b15 [Pei-Lun Lee] error handling
472870e [Pei-Lun Lee] add back custom parquet output committer
ddd0f69 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ece5c5 [Pei-Lun Lee] compatibility with hadoop 1.x
8413fcd [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
fe65915 [Pei-Lun Lee] add support for parquet config parquet.enable.summary-metadata
e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer.
0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license
c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case
0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter
769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter
f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter
mccheah pushed a commit to palantir/spark that referenced this pull request May 14, 2015
Add new config "spark.sql.parquet.output.committer.class" to allow custom parquet output committer and an output committer class specific to use on s3.
Fix compilation error introduced by apache#5042.
Respect ParquetOutputFormat.ENABLE_JOB_SUMMARY flag.

Author: Pei-Lun Lee <pllee@appier.com>

Closes apache#5525 from ypcat/spark-6352 and squashes the following commits:

54c6b15 [Pei-Lun Lee] error handling
472870e [Pei-Lun Lee] add back custom parquet output committer
ddd0f69 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ece5c5 [Pei-Lun Lee] compatibility with hadoop 1.x
8413fcd [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
fe65915 [Pei-Lun Lee] add support for parquet config parquet.enable.summary-metadata
e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer.
0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license
c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case
0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter
769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter
f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter

Conflicts:
	sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala
mccheah pushed a commit to palantir/spark that referenced this pull request May 15, 2015
Add new config "spark.sql.parquet.output.committer.class" to allow custom parquet output committer and an output committer class specific to use on s3.
Fix compilation error introduced by apache#5042.
Respect ParquetOutputFormat.ENABLE_JOB_SUMMARY flag.

Author: Pei-Lun Lee <pllee@appier.com>

Closes apache#5525 from ypcat/spark-6352 and squashes the following commits:

54c6b15 [Pei-Lun Lee] error handling
472870e [Pei-Lun Lee] add back custom parquet output committer
ddd0f69 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ece5c5 [Pei-Lun Lee] compatibility with hadoop 1.x
8413fcd [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
fe65915 [Pei-Lun Lee] add support for parquet config parquet.enable.summary-metadata
e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer.
0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license
c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case
0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter
769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter
f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter
christopherbozeman pushed a commit to christopherbozeman/spark that referenced this pull request May 25, 2015
Add new config "spark.sql.parquet.output.committer.class" to allow custom parquet output committer and an output committer class specific to use on s3.
Fix compilation error introduced by apache#5042.
Respect ParquetOutputFormat.ENABLE_JOB_SUMMARY flag.

Author: Pei-Lun Lee <pllee@appier.com>

Closes apache#5525 from ypcat/spark-6352 and squashes the following commits:

54c6b15 [Pei-Lun Lee] error handling
472870e [Pei-Lun Lee] add back custom parquet output committer
ddd0f69 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ece5c5 [Pei-Lun Lee] compatibility with hadoop 1.x
8413fcd [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
fe65915 [Pei-Lun Lee] add support for parquet config parquet.enable.summary-metadata
e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer.
0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license
c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case
0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter
769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter
f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Add new config "spark.sql.parquet.output.committer.class" to allow custom parquet output committer and an output committer class specific to use on s3.
Fix compilation error introduced by apache#5042.
Respect ParquetOutputFormat.ENABLE_JOB_SUMMARY flag.

Author: Pei-Lun Lee <pllee@appier.com>

Closes apache#5525 from ypcat/spark-6352 and squashes the following commits:

54c6b15 [Pei-Lun Lee] error handling
472870e [Pei-Lun Lee] add back custom parquet output committer
ddd0f69 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ece5c5 [Pei-Lun Lee] compatibility with hadoop 1.x
8413fcd [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
fe65915 [Pei-Lun Lee] add support for parquet config parquet.enable.summary-metadata
e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer.
0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license
c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case
0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter
769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter
f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants