-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15676] [SQL] Disallow Column Names as Partition Columns For Hive Tables #13415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15676] [SQL] Disallow Column Names as Partition Columns For Hive Tables #13415
Conversation
| comment = comment) | ||
|
|
||
| selectQuery match { | ||
| case Some(q) => CreateTableAsSelectLogicalPlan(tableDesc, q, ifNotExists) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CTAS, another PR (#13395) resolves the issue by disallowing users to specify Partitioned By clauses.
|
Test build #59666 has finished for PR 13415 at commit
|
| val partitionColsInTable = partitionCols.map(_.name).toSet.intersect(cols.map(_.name).toSet) | ||
| if (partitionColsInTable.nonEmpty) { | ||
| throw new ParseException(s"Column repeated in partitioning columns: " + | ||
| partitionColsInTable.mkString("[", ",", "]"), ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gatorsmile, this looks OK but it seems a better place to do it is up there in L885, where we just concatenate the schema with the partition columns together. There we can just check if schema.map(_.name) has any duplicate values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I can move it there.
The reason why I put here is because CTAS should not see the partitioning columns. If we move there, we could issue this error message before the expected message: https://github.com/yhuai/spark/blob/fa8908122a238d6cdc0a9fc0f003221ef5601565/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala#L940-L948
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine. I would still move it. Maybe I would even move the datasource partition check before this exception; we don't have to throw that one so late.
|
|
||
| // Ensuring whether no duplicate name is used in table definition; | ||
| // Also ensuring the existing columns are not used as partition columns | ||
| checkDuplicateNames(colNames = schema.map(_.name), ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After code changes, we are verifying two cases: one is duplicate names in table definition; another is column repeated in partitioning columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it might be better to explicitly check if there are common columns between cols and partitionCols. Then we can give a better error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks!
|
Test build #59695 has finished for PR 13415 at commit
|
|
Test build #59705 has finished for PR 13415 at commit
|
|
retest this please |
|
Test build #59712 has finished for PR 13415 at commit
|
|
retest this please |
| val duplicateColumns = colNames.groupBy(identity).collect { | ||
| case (x, ys) if ys.length > 1 => "\"" + x + "\"" | ||
| } | ||
| throw new ParseException(s"Duplicate column name key(s) in the table definition: " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "column name key(s)" means? I think we should just say: Duplicated column names found in table definition: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also would be good to throw operationNotAllowed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also print the table name? found in table definition for 'my_table'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: ) This just follows the error message of Hive. Will change it. Thanks!
|
I'm thinking about case sensitivity, maybe we should put this check in analyzer instead of parser? |
|
@cloud-fan Yeah. Agree. I knew you will say that. : ) |
|
Test build #59866 has finished for PR 13415 at commit
|
|
We should still do it in the parser, but use the |
|
Sure, will do it. Thanks! |
|
@cloud-fan @andrewor14 In this scenario, we do not have the case sensitivity issues. The names of all the catalog columns are converted to lower case by spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala Line 1230 in d109a1b
I remember we gave up the case sensitivity support in this release. Let me know if you have any question regarding the current implementation. Thanks! |
|
Test build #59912 has finished for PR 13415 at commit
|
|
retest this please |
|
Test build #60099 has finished for PR 13415 at commit
|
|
LGTM, cc @andrewor14 for final sign off |
|
Thank you! @cloud-fan |
|
retest this please |
|
Test build #60275 has finished for PR 13415 at commit
|
|
LGTM sorry for the wait |
|
Thank you! @andrewor14 |
|
retest this please |
|
Test build #60408 has finished for PR 13415 at commit
|
|
Thanks. Merging to master and branch 2.0. |
…e Tables #### What changes were proposed in this pull request? When creating a Hive Table (not data source tables), a common error users might make is to specify an existing column name as a partition column. Below is what Hive returns in this case: ``` hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data string, part string); FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns ``` Currently, the error we issued is very confusing: ``` org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.); ``` This PR is to fix the above issue by capturing the usage error in `Parser`. #### How was this patch tested? Added a test case to `DDLCommandSuite` Author: gatorsmile <gatorsmile@gmail.com> Closes #13415 from gatorsmile/partitionColumnsInTableSchema. (cherry picked from commit 3b7fb84) Signed-off-by: Yin Huai <yhuai@databricks.com>
What changes were proposed in this pull request?
When creating a Hive Table (not data source tables), a common error users might make is to specify an existing column name as a partition column. Below is what Hive returns in this case:
Currently, the error we issued is very confusing:
This PR is to fix the above issue by capturing the usage error in
Parser.How was this patch tested?
Added a test case to
DDLCommandSuite