Skip to content

Conversation

@gatorsmile
Copy link
Member

What changes were proposed in this pull request?

When creating a Hive Table (not data source tables), a common error users might make is to specify an existing column name as a partition column. Below is what Hive returns in this case:

hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data string, part string);
FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns

Currently, the error we issued is very confusing:

org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.);

This PR is to fix the above issue by capturing the usage error in Parser.

How was this patch tested?

Added a test case to DDLCommandSuite

comment = comment)

selectQuery match {
case Some(q) => CreateTableAsSelectLogicalPlan(tableDesc, q, ifNotExists)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CTAS, another PR (#13395) resolves the issue by disallowing users to specify Partitioned By clauses.

@SparkQA
Copy link

SparkQA commented May 31, 2016

Test build #59666 has finished for PR 13415 at commit 48ddb08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

gatorsmile commented May 31, 2016

cc @cloud-fan @yhuai @andrewor14

val partitionColsInTable = partitionCols.map(_.name).toSet.intersect(cols.map(_.name).toSet)
if (partitionColsInTable.nonEmpty) {
throw new ParseException(s"Column repeated in partitioning columns: " +
partitionColsInTable.mkString("[", ",", "]"), ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gatorsmile, this looks OK but it seems a better place to do it is up there in L885, where we just concatenate the schema with the partition columns together. There we can just check if schema.map(_.name) has any duplicate values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I can move it there.

The reason why I put here is because CTAS should not see the partitioning columns. If we move there, we could issue this error message before the expected message: https://github.com/yhuai/spark/blob/fa8908122a238d6cdc0a9fc0f003221ef5601565/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala#L940-L948

Copy link
Contributor

@andrewor14 andrewor14 May 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine. I would still move it. Maybe I would even move the datasource partition check before this exception; we don't have to throw that one so late.


// Ensuring whether no duplicate name is used in table definition;
// Also ensuring the existing columns are not used as partition columns
checkDuplicateNames(colNames = schema.map(_.name), ctx)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After code changes, we are verifying two cases: one is duplicate names in table definition; another is column repeated in partitioning columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it might be better to explicitly check if there are common columns between cols and partitionCols. Then we can give a better error message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59695 has finished for PR 13415 at commit 6c5c2d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59705 has finished for PR 13415 at commit 942366f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59712 has finished for PR 13415 at commit 942366f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

val duplicateColumns = colNames.groupBy(identity).collect {
case (x, ys) if ys.length > 1 => "\"" + x + "\""
}
throw new ParseException(s"Duplicate column name key(s) in the table definition: " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "column name key(s)" means? I think we should just say: Duplicated column names found in table definition: ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also would be good to throw operationNotAllowed here

Copy link
Contributor

@andrewor14 andrewor14 Jun 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also print the table name? found in table definition for 'my_table'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: ) This just follows the error message of Hive. Will change it. Thanks!

@cloud-fan
Copy link
Contributor

I'm thinking about case sensitivity, maybe we should put this check in analyzer instead of parser?

@gatorsmile
Copy link
Member Author

@cloud-fan Yeah. Agree. I knew you will say that. : )

@SparkQA
Copy link

SparkQA commented Jun 2, 2016

Test build #59866 has finished for PR 13415 at commit 942366f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

We should still do it in the parser, but use the SQLConf

@gatorsmile
Copy link
Member Author

Sure, will do it. Thanks!

@gatorsmile
Copy link
Member Author

gatorsmile commented Jun 3, 2016

@cloud-fan @andrewor14 In this scenario, we do not have the case sensitivity issues. The names of all the catalog columns are converted to lower case by

I remember we gave up the case sensitivity support in this release.

Let me know if you have any question regarding the current implementation. Thanks!

@SparkQA
Copy link

SparkQA commented Jun 3, 2016

Test build #59912 has finished for PR 13415 at commit f4207e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 7, 2016

Test build #60099 has finished for PR 13415 at commit f4207e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, cc @andrewor14 for final sign off

@gatorsmile
Copy link
Member Author

Thank you! @cloud-fan

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 10, 2016

Test build #60275 has finished for PR 13415 at commit f4207e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

andrewor14 commented Jun 10, 2016

LGTM sorry for the wait

@gatorsmile
Copy link
Member Author

Thank you! @andrewor14

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 13, 2016

Test build #60408 has finished for PR 13415 at commit f4207e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jun 13, 2016

Thanks. Merging to master and branch 2.0.

asfgit pushed a commit that referenced this pull request Jun 13, 2016
…e Tables

#### What changes were proposed in this pull request?
When creating a Hive Table (not data source tables), a common error users might make is to specify an existing column name as a partition column. Below is what Hive returns in this case:
```
hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data string, part string);
FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns
```
Currently, the error we issued is very confusing:
```
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.);
```
This PR is to fix the above issue by capturing the usage error in `Parser`.

#### How was this patch tested?
Added a test case to `DDLCommandSuite`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13415 from gatorsmile/partitionColumnsInTableSchema.

(cherry picked from commit 3b7fb84)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@asfgit asfgit closed this in 3b7fb84 Jun 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants