[SPARK-18024][SQL] Introduce an internal commit protocol API #15707

rxin · 2016-11-01T01:05:05Z

What changes were proposed in this pull request?

This patch introduces an internal commit protocol API that is used by the batch data source to do write commits. It currently has only one implementation that uses Hadoop MapReduce's OutputCommitter API. In the future, this commit API can be used to unify streaming and batch commits.

How was this patch tested?

Should be covered by existing write tests.

[SPARK-18087] [SQL] Optimize insert to not require REPAIR TABLE

rxin · 2016-11-01T01:05:26Z

This is the same as #15696

but rebased with #15633

ericl · 2016-11-01T02:21:51Z

This lgtm, modulo the comments in #15696

ericl · 2016-11-01T03:15:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteOutput.scala

              committer,
              iterator = iter)
-          }).flatten.distinct
+          })


Move the distinct to updatedPartitions?

SparkQA · 2016-11-01T03:33:59Z

Test build #67855 has finished for PR 15707 at commit 0647959.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-11-01T03:35:06Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala


+  val STREAMING_FILE_COMMIT_PROTOCOL_CLASS =
+    SQLConfigBuilder("spark.sql.streaming.commitProtocolClass")
+        .internal()


nit: two spaces

ericl · 2016-11-01T03:35:49Z

This LGTM, just a minor comment

SparkQA · 2016-11-01T05:11:19Z

Test build #67865 has finished for PR 15707 at commit 65ba5c1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-01T05:22:58Z

Looks like the test failed due to a flaky test, but other than that everything else was fine. I'm going to merge this optimistically.

SparkQA · 2016-11-01T05:36:49Z

Test build #3384 has finished for PR 15707 at commit 0177ded.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HadoopCommitProtocolWrapper(path: String, isAppend: Boolean)

SparkQA · 2016-11-01T05:59:07Z

Test build #3386 has finished for PR 15707 at commit 65ba5c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? This patch introduces an internal commit protocol API that is used by the batch data source to do write commits. It currently has only one implementation that uses Hadoop MapReduce's OutputCommitter API. In the future, this commit API can be used to unify streaming and batch commits. ## How was this patch tested? Should be covered by existing write tests. Author: Reynold Xin <rxin@databricks.com> Author: Eric Liang <ekl@databricks.com> Closes apache#15707 from rxin/SPARK-18024-2.

ericl and others added 12 commits October 27, 2016 14:45

Thu Oct 27 14:45:52 PDT 2016

8c4ae5e

Thu Oct 27 17:53:13 PDT 2016

2484809

Thu Oct 27 17:53:29 PDT 2016

4d96725

WIP - commit API

72c4294

Add commit protocol itself

2a61351

Move output committer instantiation into MapReduceFileCommitterProtocol.

6af14b5

Specify that implementations must be serializable.

6166093

Specify path

040bbba

Add documentation.

51d0919

Make MapReduceFileCommitterProtocol serializable.

2d7d373

Make protocol configurable.

cd23d2f

Merge pull request apache#15633 from ericl/spark-18087

0647959

[SPARK-18087] [SQL] Optimize insert to not require REPAIR TABLE

rxin added 2 commits October 31, 2016 19:48

Merge remote-tracking branch 'apache/master' into SPARK-18024-2

ac331f3

Fix compilation error

98a1792

ericl reviewed Nov 1, 2016

View reviewed changes

Code review

0177ded

rxin changed the title ~~[SPARK-18024][SQL] Introduce an internal commit protocol API - rebased~~ [SPARK-18024][SQL] Introduce an internal commit protocol API Nov 1, 2016

rxin mentioned this pull request Nov 1, 2016

[SPARK-18024][SQL] Introduce an internal commit protocol API #15696

Closed

Add back distinct

66b79e5

ericl reviewed Nov 1, 2016

View reviewed changes

Indent

65ba5c1

asfgit closed this in d9d1465 Nov 1, 2016

bdrillard mentioned this pull request Nov 29, 2016

Add support for DataSets of Avro records databricks/spark-avro#169

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18024][SQL] Introduce an internal commit protocol API #15707

[SPARK-18024][SQL] Introduce an internal commit protocol API #15707

Uh oh!

rxin commented Nov 1, 2016

Uh oh!

rxin commented Nov 1, 2016

Uh oh!

ericl commented Nov 1, 2016

Uh oh!

ericl Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

ericl Nov 1, 2016

Uh oh!

ericl commented Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

rxin commented Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-18024][SQL] Introduce an internal commit protocol API #15707

[SPARK-18024][SQL] Introduce an internal commit protocol API #15707

Uh oh!

Conversation

rxin commented Nov 1, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented Nov 1, 2016

Uh oh!

ericl commented Nov 1, 2016

Uh oh!

ericl Nov 1, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

ericl Nov 1, 2016

Choose a reason for hiding this comment

Uh oh!

ericl commented Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

rxin commented Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants