[SPARK-16034][SQL] Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable #13749

clockfly · 2016-06-18T01:06:46Z

What changes were proposed in this pull request?

DataFrameWriter can be used to append data to existing data source tables. It becomes tricky when partition columns used in DataFrameWriter.partitionBy(columns) don't match the actual partition columns of the underlying table. This pull request enforces the check so that the partition columns of these two always match.

How was this patch tested?

Unit test.

SparkQA · 2016-06-18T02:17:15Z

Test build #60741 has finished for PR 13749 at commit 44a22dd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-18T02:23:34Z

Test build #60742 has finished for PR 13749 at commit c6a7773.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-18T02:31:40Z

Test build #60743 has finished for PR 13749 at commit 8bacffb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-18T02:34:54Z

Test build #60745 has finished for PR 13749 at commit f6b0fad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-18T04:19:11Z

Test build #60750 has finished for PR 13749 at commit 72fdeaf.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-18T06:27:44Z

Test build #60753 has finished for PR 13749 at commit 5224802.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-18T06:28:19Z

Test build #60754 has finished for PR 13749 at commit 7a4293b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-06-18T06:30:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+            existingColumns.map(_.toLowerCase) == partitionColumns.map(_.toLowerCase)
+          if (existingColumns.size > 0 && !sameColumns) {
+            throw new AnalysisException(
+              s"""Requested partitioning does not match existing partitioning.


can you add "Requested partitioning does not match existing partitioning for table $table" ?

Thanks, updated

SparkQA · 2016-06-18T09:26:59Z

Test build #60776 has finished for PR 13749 at commit 611545c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

clockfly · 2016-06-18T14:22:45Z

retest this please.

SparkQA · 2016-06-18T16:00:59Z

Test build #60783 has finished for PR 13749 at commit 611545c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-06-18T17:35:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

+      case ex: AnalysisException =>
+        logError(s"Failed to write to table ${tableIdent.identifier} in $mode mode", ex)
+        throw ex
+    }


This log entry is mainly for catching the table name and mode, right?

yhuai · 2016-06-18T17:41:03Z

LGTM. Let's address the case-sensitivity issue in a separate PR (together with issue found in #13754). I will take care the minor comments (i.e. variable naming).

Merging to master and branch 2.0.

…e.write.mode("append").saveAsTable ## What changes were proposed in this pull request? `DataFrameWriter` can be used to append data to existing data source tables. It becomes tricky when partition columns used in `DataFrameWriter.partitionBy(columns)` don't match the actual partition columns of the underlying table. This pull request enforces the check so that the partition columns of these two always match. ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13749 from clockfly/SPARK-16034. (cherry picked from commit ce3b98b) Signed-off-by: Yin Huai <yhuai@databricks.com>

yhuai · 2016-06-19T02:48:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-                s"$ex != ${partitionColumns.toSet}.")
-            }
+          val existingColumns = Try {
+            resolveRelation()


Actually, the returned partitioning columns are user-provided instead of existing dataset's partitioning columns.

Also, this triggers a partitioning discovery. We should avoid it.

…and improvement ## What changes were proposed in this pull request? This PR is the follow-up PR for https://github.com/apache/spark/pull/13754/files and #13749. I will comment inline to explain my changes. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #13766 from yhuai/caseSensitivity. (cherry picked from commit 6d0f921) Signed-off-by: Yin Huai <yhuai@databricks.com>

…and improvement ## What changes were proposed in this pull request? This PR is the follow-up PR for https://github.com/apache/spark/pull/13754/files and #13749. I will comment inline to explain my changes. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #13766 from yhuai/caseSensitivity.

clockfly force-pushed the SPARK-16034 branch 4 times, most recently from 9ac949c to f6b0fad Compare June 18, 2016 01:19

clockfly force-pushed the SPARK-16034 branch from f6b0fad to 72fdeaf Compare June 18, 2016 04:16

clockfly force-pushed the SPARK-16034 branch from 72fdeaf to 5224802 Compare June 18, 2016 04:34

SPARK-16034

7a4293b

clockfly force-pushed the SPARK-16034 branch from 5224802 to 7a4293b Compare June 18, 2016 04:37

clockfly changed the title ~~[SPARK-16034][SQL][WIP] Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable~~ [SPARK-16034][SQL] Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable Jun 18, 2016

rxin reviewed Jun 18, 2016
View reviewed changes

On Reynold's comment

611545c

clockfly force-pushed the SPARK-16034 branch from c004e18 to 611545c Compare June 18, 2016 08:06

yhuai reviewed Jun 18, 2016
View reviewed changes

asfgit closed this in ce3b98b Jun 18, 2016

yhuai reviewed Jun 19, 2016
View reviewed changes

yhuai mentioned this pull request Jun 19, 2016

[SPARK-16036][SPARK-16037][SPARK-16034][SQL] Follow up code clean up and improvement #13766

Closed

[SPARK-16034][SQL] Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable #13749

[SPARK-16034][SQL] Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable #13749

Uh oh!

Conversation

clockfly commented Jun 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

rxin Jun 18, 2016

Choose a reason for hiding this comment

Uh oh!

clockfly Jun 18, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

clockfly commented Jun 18, 2016

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

yhuai Jun 18, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai commented Jun 18, 2016

Uh oh!

yhuai Jun 19, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai Jun 19, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

clockfly commented Jun 18, 2016 •

edited

Loading