[SPARK-18024][SQL] Introduce an internal commit protocol API #15696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

rxin wants to merge 8 commits into apache:master from rxin:SPARK-18024

Contributor

rxin commented Oct 31, 2016 •

edited

Loading

What changes were proposed in this pull request?

This patch introduces an internal commit protocol API that is used by the batch data source to do write commits. It currently has only one implementation that uses Hadoop MapReduce's OutputCommitter API. In the future, this commit API can be used to unify streaming and batch commits.

How was this patch tested?

Should be covered by existing write tests.

rxin added 2 commits

October 31, 2016 10:56


          WIP - commit API

72c4294


          Add commit protocol itself

2a61351

Contributor Author

rxin commented Oct 31, 2016

SparkQA commented Oct 31, 2016

Test build #67818 has finished for PR 15696 at commit 2a61351.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TaskCommitMessage(obj: Any) extends Serializable
- abstract class FileCommitProtocol
- class MapReduceFileCommitterProtocol(committer: OutputCommitter) extends FileCommitProtocol

rxin added 2 commits

October 31, 2016 13:46


          Move output committer instantiation into MapReduceFileCommitterProtocol.

6af14b5


          Specify that implementations must be serializable.

SparkQA commented Oct 31, 2016

Test build #67832 has finished for PR 15696 at commit 6af14b5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class MapReduceFileCommitterProtocol(path: String, isAppend: Boolean)
- logInfo(s\"Using user defined output committer class $
- logInfo(s\"Using output committer class $

SparkQA commented Oct 31, 2016

Test build #67835 has finished for PR 15696 at commit 6166093.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin added 2 commits

October 31, 2016 15:16


          Specify path

040bbba


          Add documentation.

51d0919

marmbrus reviewed

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala

    
                override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {

                  new Path(stagingDir, fileNamePrefix)

                  new Path(path)

Contributor

marmbrus Oct 31, 2016

+ extension?

Contributor Author

rxin Oct 31, 2016

Nope -- no more extension coming from Hadoop.

Contributor Author

rxin Oct 31, 2016

(This is now specified explicitly in OutputWriterFactory.getFileExtension)


          Make MapReduceFileCommitterProtocol serializable.

2d7d373

SparkQA commented Oct 31, 2016

Test build #67838 has finished for PR 15696 at commit 040bbba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA commented Oct 31, 2016

Test build #67840 has finished for PR 15696 at commit 51d0919.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin changed the title ~~[SPARK-18024][SQL] Introduce an internal commit protocol API - WIP~~ [SPARK-18024][SQL] Introduce an internal commit protocol API


          Make protocol configurable.

cd23d2f

ericl reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileCommitProtocol.scala

    
              /**

               * An interface to define how a Spark job commits its outputs. Implementations must be serializable.

Contributor

ericl Nov 1, 2016

add: since the same committer instance setup on the driver will be used for tasks.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileCommitProtocol.scala

    
                 * The "dir" parameter specifies 2, and "ext" parameter specifies both 4 and 5, and the rest

                 * are left to the commit protocol implementation to decide.

                 */

                def addTaskTempFile(taskContext: TaskAttemptContext, dir: Option[String], ext: String): String

Contributor

ericl Nov 1, 2016

s/add/new?

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileCommitProtocol.scala

    
                /**

                 * Notifies the commit protocol to add a new file, and gets back the full path that should be

                 * used. Must be called on the executors when running tasks.

Contributor

ericl Nov 1, 2016

add: Note that the returned temp file may have an arbitrary path. The commit protocol only promises that the file will be at the location specified by the arguments after job commit.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileCommitProtocol.scala

    
                /**

                 * Aborts a task after the writes have failed. Must be called on the executors when running tasks.

                 */

                def abortTask(taskContext: TaskAttemptContext): Unit

Contributor

ericl Nov 1, 2016

Is this also best-effort?

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileCommitProtocol.scala

    
               *

               * Unlike Hadoop's OutputCommitter, this implementation is serializable.

               */

              class MapReduceFileCommitterProtocol(path: String, isAppend: Boolean)

Contributor

ericl Nov 1, 2016

Should we call this HadoopCommitProtocolWrapper or something to be more clear?

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteOutput.scala

    
                  final def filePrefix(split: Int, uuid: String, bucketId: Option[Int]): String = {

                    val bucketString = bucketId.map(BucketingUtils.bucketIdToString).getOrElse("")

                    f"part-r-$split%05d-$uuid$bucketString"

                    f"part-$split%05d-$uuid$bucketString"

Contributor

ericl Nov 1, 2016

Is this still used?

rxin mentioned this pull request

[SPARK-18024][SQL] Introduce an internal commit protocol API #15707

Closed

SparkQA commented Nov 1, 2016

Test build #67841 has finished for PR 15696 at commit 2d7d373.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA commented Nov 1, 2016

Test build #67852 has finished for PR 15696 at commit cd23d2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Contributor Author

rxin commented Nov 1, 2016

Closing this in favor of #15707

rxin closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet