Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented Oct 31, 2016

What changes were proposed in this pull request?

This patch introduces an internal commit protocol API that is used by the batch data source to do write commits. It currently has only one implementation that uses Hadoop MapReduce's OutputCommitter API. In the future, this commit API can be used to unify streaming and batch commits.

How was this patch tested?

Should be covered by existing write tests.

@rxin
Copy link
Contributor Author

rxin commented Oct 31, 2016

cc @ericl

@SparkQA
Copy link

SparkQA commented Oct 31, 2016

Test build #67818 has finished for PR 15696 at commit 2a61351.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TaskCommitMessage(obj: Any) extends Serializable
    • abstract class FileCommitProtocol
    • class MapReduceFileCommitterProtocol(committer: OutputCommitter) extends FileCommitProtocol

@SparkQA
Copy link

SparkQA commented Oct 31, 2016

Test build #67832 has finished for PR 15696 at commit 6af14b5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MapReduceFileCommitterProtocol(path: String, isAppend: Boolean)
    • logInfo(s\"Using user defined output committer class $
    • logInfo(s\"Using output committer class $

@SparkQA
Copy link

SparkQA commented Oct 31, 2016

Test build #67835 has finished for PR 15696 at commit 6166093.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {
new Path(stagingDir, fileNamePrefix)
new Path(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ extension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope -- no more extension coming from Hadoop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is now specified explicitly in OutputWriterFactory.getFileExtension)

@SparkQA
Copy link

SparkQA commented Oct 31, 2016

Test build #67838 has finished for PR 15696 at commit 040bbba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 31, 2016

Test build #67840 has finished for PR 15696 at commit 51d0919.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin rxin changed the title [SPARK-18024][SQL] Introduce an internal commit protocol API - WIP [SPARK-18024][SQL] Introduce an internal commit protocol API Oct 31, 2016


/**
* An interface to define how a Spark job commits its outputs. Implementations must be serializable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add: since the same committer instance setup on the driver will be used for tasks.

* The "dir" parameter specifies 2, and "ext" parameter specifies both 4 and 5, and the rest
* are left to the commit protocol implementation to decide.
*/
def addTaskTempFile(taskContext: TaskAttemptContext, dir: Option[String], ext: String): String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/add/new?


/**
* Notifies the commit protocol to add a new file, and gets back the full path that should be
* used. Must be called on the executors when running tasks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add: Note that the returned temp file may have an arbitrary path. The commit protocol only promises that the file will be at the location specified by the arguments after job commit.

/**
* Aborts a task after the writes have failed. Must be called on the executors when running tasks.
*/
def abortTask(taskContext: TaskAttemptContext): Unit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also best-effort?

*
* Unlike Hadoop's OutputCommitter, this implementation is serializable.
*/
class MapReduceFileCommitterProtocol(path: String, isAppend: Boolean)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this HadoopCommitProtocolWrapper or something to be more clear?

final def filePrefix(split: Int, uuid: String, bucketId: Option[Int]): String = {
val bucketString = bucketId.map(BucketingUtils.bucketIdToString).getOrElse("")
f"part-r-$split%05d-$uuid$bucketString"
f"part-$split%05d-$uuid$bucketString"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still used?

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67841 has finished for PR 15696 at commit 2d7d373.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67852 has finished for PR 15696 at commit cd23d2f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor Author

rxin commented Nov 1, 2016

Closing this in favor of #15707

@rxin rxin closed this Nov 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants