[SPARK-15676] [SQL] Disallow Column Names as Partition Columns For Hive Tables #13415

gatorsmile · 2016-05-31T18:18:02Z

What changes were proposed in this pull request?

When creating a Hive Table (not data source tables), a common error users might make is to specify an existing column name as a partition column. Below is what Hive returns in this case:

hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data string, part string);
FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns

Currently, the error we issued is very confusing:

org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.);

This PR is to fix the above issue by capturing the usage error in Parser.

How was this patch tested?

Added a test case to DDLCommandSuite

gatorsmile · 2016-05-31T18:20:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

      comment = comment)

    selectQuery match {
      case Some(q) => CreateTableAsSelectLogicalPlan(tableDesc, q, ifNotExists)


For CTAS, another PR (#13395) resolves the issue by disallowing users to specify Partitioned By clauses.

SparkQA · 2016-05-31T19:45:09Z

Test build #59666 has finished for PR 13415 at commit 48ddb08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-05-31T20:39:09Z

cc @cloud-fan @yhuai @andrewor14

andrewor14 · 2016-05-31T21:21:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+        val partitionColsInTable = partitionCols.map(_.name).toSet.intersect(cols.map(_.name).toSet)
+        if (partitionColsInTable.nonEmpty) {
+          throw new ParseException(s"Column repeated in partitioning columns: " +
+            partitionColsInTable.mkString("[", ",", "]"), ctx)


Hi @gatorsmile, this looks OK but it seems a better place to do it is up there in L885, where we just concatenate the schema with the partition columns together. There we can just check if schema.map(_.name) has any duplicate values.

I see. I can move it there.

The reason why I put here is because CTAS should not see the partitioning columns. If we move there, we could issue this error message before the expected message: https://github.com/yhuai/spark/blob/fa8908122a238d6cdc0a9fc0f003221ef5601565/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala#L940-L948

that's fine. I would still move it. Maybe I would even move the datasource partition check before this exception; we don't have to throw that one so late.

gatorsmile · 2016-05-31T23:29:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala


+    // Ensuring whether no duplicate name is used in table definition;
+    // Also ensuring the existing columns are not used as partition columns
+    checkDuplicateNames(colNames = schema.map(_.name), ctx)


After code changes, we are verifying two cases: one is duplicate names in table definition; another is column repeated in partitioning columns.

actually it might be better to explicitly check if there are common columns between cols and partitionCols. Then we can give a better error message.

I see. Thanks!

SparkQA · 2016-06-01T01:20:28Z

Test build #59695 has finished for PR 13415 at commit 6c5c2d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-01T02:52:37Z

Test build #59705 has finished for PR 13415 at commit 942366f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-06-01T03:14:08Z

retest this please

SparkQA · 2016-06-01T05:00:30Z

Test build #59712 has finished for PR 13415 at commit 942366f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-06-02T18:43:40Z

retest this please

cloud-fan · 2016-06-02T19:07:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+      val duplicateColumns = colNames.groupBy(identity).collect {
+        case (x, ys) if ys.length > 1 => "\"" + x + "\""
+      }
+      throw new ParseException(s"Duplicate column name key(s) in the table definition: " +


what does "column name key(s)" means? I think we should just say: Duplicated column names found in table definition: ...

also would be good to throw operationNotAllowed here

can you also print the table name? found in table definition for 'my_table'

: ) This just follows the error message of Hive. Will change it. Thanks!

cloud-fan · 2016-06-02T19:11:17Z

I'm thinking about case sensitivity, maybe we should put this check in analyzer instead of parser?

gatorsmile · 2016-06-02T19:28:56Z

@cloud-fan Yeah. Agree. I knew you will say that. : )

SparkQA · 2016-06-02T20:35:06Z

Test build #59866 has finished for PR 13415 at commit 942366f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-06-02T21:07:21Z

We should still do it in the parser, but use the SQLConf

gatorsmile · 2016-06-02T22:22:55Z

Sure, will do it. Thanks!

gatorsmile · 2016-06-03T03:09:54Z

@cloud-fan @andrewor14 In this scenario, we do not have the case sensitivity issues. The names of all the catalog columns are converted to lower case by

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

Line 1230 in d109a1b

col.identifier.getText.toLowerCase,

I remember we gave up the case sensitivity support in this release.

Let me know if you have any question regarding the current implementation. Thanks!

SparkQA · 2016-06-03T04:48:29Z

Test build #59912 has finished for PR 13415 at commit f4207e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-06-07T02:41:24Z

retest this please

SparkQA · 2016-06-07T04:21:13Z

Test build #60099 has finished for PR 13415 at commit f4207e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-06-07T05:11:04Z

LGTM, cc @andrewor14 for final sign off

gatorsmile · 2016-06-07T18:38:49Z

Thank you! @cloud-fan

gatorsmile · 2016-06-10T05:45:28Z

retest this please

SparkQA · 2016-06-10T07:33:11Z

Test build #60275 has finished for PR 13415 at commit f4207e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-06-10T20:39:26Z

LGTM sorry for the wait

gatorsmile · 2016-06-11T04:29:08Z

Thank you! @andrewor14

gatorsmile · 2016-06-13T17:09:21Z

retest this please

SparkQA · 2016-06-13T18:48:58Z

Test build #60408 has finished for PR 13415 at commit f4207e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-06-13T20:21:15Z

Thanks. Merging to master and branch 2.0.

…e Tables #### What changes were proposed in this pull request? When creating a Hive Table (not data source tables), a common error users might make is to specify an existing column name as a partition column. Below is what Hive returns in this case: ``` hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data string, part string); FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns ``` Currently, the error we issued is very confusing: ``` org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.); ``` This PR is to fix the above issue by capturing the usage error in `Parser`. #### How was this patch tested? Added a test case to `DDLCommandSuite` Author: gatorsmile <gatorsmile@gmail.com> Closes #13415 from gatorsmile/partitionColumnsInTableSchema. (cherry picked from commit 3b7fb84) Signed-off-by: Yin Huai <yhuai@databricks.com>

fix

48ddb08

gatorsmile reviewed May 31, 2016
View reviewed changes

andrewor14 reviewed May 31, 2016
View reviewed changes

address comments.

6c5c2d9

gatorsmile reviewed May 31, 2016
View reviewed changes

address comments.

942366f

cloud-fan reviewed Jun 2, 2016
View reviewed changes

address comments.

f4207e3

asfgit closed this in 3b7fb84 Jun 13, 2016

[SPARK-15676] [SQL] Disallow Column Names as Partition Columns For Hive Tables #13415

[SPARK-15676] [SQL] Disallow Column Names as Partition Columns For Hive Tables #13415

Uh oh!

Conversation

gatorsmile commented May 31, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 31, 2016

Uh oh!

gatorsmile commented May 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 May 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 1, 2016

Uh oh!

SparkQA commented Jun 1, 2016

Uh oh!

gatorsmile commented Jun 1, 2016

Uh oh!

SparkQA commented Jun 1, 2016

Uh oh!

gatorsmile commented Jun 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 Jun 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 2, 2016

Uh oh!

gatorsmile commented Jun 2, 2016

Uh oh!

SparkQA commented Jun 2, 2016

Uh oh!

andrewor14 commented Jun 2, 2016

Uh oh!

gatorsmile commented Jun 2, 2016

Uh oh!

gatorsmile commented Jun 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 3, 2016

Uh oh!

gatorsmile commented Jun 7, 2016

Uh oh!

SparkQA commented Jun 7, 2016

Uh oh!

cloud-fan commented Jun 7, 2016

Uh oh!

gatorsmile commented Jun 7, 2016

Uh oh!

gatorsmile commented Jun 10, 2016

Uh oh!

SparkQA commented Jun 10, 2016

Uh oh!

andrewor14 commented Jun 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jun 11, 2016

gatorsmile commented May 31, 2016 •

edited

Loading

andrewor14 May 31, 2016 •

edited

Loading

andrewor14 Jun 2, 2016 •

edited

Loading

gatorsmile commented Jun 3, 2016 •

edited

Loading

andrewor14 commented Jun 10, 2016 •

edited

Loading