-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20680][SQL] Spark-sql do not support for creating table with void column datatype #28833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Show resolved
Hide resolved
|
cc @maropu, @cloud-fan, @dongjoon-hyun, @hvanhovell, @gatorsmile FYI |
sql/catalyst/src/main/scala/org/apache/spark/sql/types/NullType.scala
Outdated
Show resolved
Hide resolved
|
Test build #124046 has finished for PR 28833 at commit
|
| spark.sharedState.externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client | ||
| client.runSqlHive("CREATE TABLE t (t1 int)") | ||
| client.runSqlHive("INSERT INTO t VALUES (3)") | ||
| client.runSqlHive("CREATE VIEW tabNullType AS SELECT NULL AS col FROM t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the t table for this test? We cannot write CREATE VIEW tabNullType AS SELECT NULL AS col?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. table t is needed, otherwise InvalidTableException: Table not found _dummy_table exception throws.
|
Test build #124056 has finished for PR 28833 at commit
|
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DataTypeParserSuite.scala
Show resolved
Hide resolved
|
Btw, could you brush up the PR description for better commit logs? what's the proposal of this PR, what's a behaivor change before/after this PR, brabrabra... I feel the curren one looks a bit ambiougous... |
|
Let's also follow the new GitHub PR template - https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE |
|
Test build #124080 has finished for PR 28833 at commit
|
|
retest this please |
|
Test build #124098 has finished for PR 28833 at commit
|
|
retest this please |
|
To confirm: In Hive, people can't create tables with the void type (including void type inside struct/array/map). The only way is CTAS. Is this true? And how about Spark? |
|
Test build #124121 has finished for PR 28833 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
How about adding another
HiveNullType, likeHiveStringType?
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/HiveStringType.scala
Lines 65 to 81 in 4f274a4
/** * Hive char type. Similar to other HiveStringType's, these datatypes should only used for * parsing, and should NOT be used anywhere else. Any instance of these data types should be * replaced by a [[StringType]] before analysis. */ case class CharType(length: Int) extends HiveStringType { override def simpleString: String = s"char($length)" } /** * Hive varchar type. Similar to other HiveStringType's, these datatypes should only used for * parsing, and should NOT be used anywhere else. Any instance of these data types should be * replaced by a [[StringType]] before analysis. */ case class VarcharType(length: Int) extends HiveStringType { override def simpleString: String = s"varchar($length)" } -
Could we add another test case?
create table t1 stored as parquet as select null as null_col;CTAS statements throws error when select clause has NULL/VOID type column since HIVE-11217
|
@wangyum that said, only legacy Hive tables can have VOID column type? It's also good to list the current Spark behaviors. I think it makes sense to forbid creating tables with VOID column type, maybe we can do that with an analyzer rule. |
|
Emmm, thanks @wangyum . I think we should keep the same behavior with Hive2.x. Throw more readable exceptions for below SQLs. create table t as select 1 x, null z from dual;
create table t as select null as null_col
create table t (v void); |
|
Does hive support inner void like |
|
Success in Hive 2.3.7: create table t (col1 struct<name:STRING, id: BIGINT>);
create table t (col1 array<STRING>);Fail with create table t (col1 struct<name:VOID, id: BIGINT>);
create table t (col1 array<VOID>); |
|
Can we have another PR to forbid creating tables with void type, via an analyzer rule? |
|
Merged to master. Thank you for your patience, @LantaoJin . |
|
Test build #125270 has finished for PR 28833 at commit
|
|
Thank you for all kindly review. |
|
Re: #28833 (comment) Sorry I read the comments just now. So the decision here is we allow to parse How about other cases when we directly use DDL-formatted string as its type? These simple type strings can be used in many places such as |
|
If we're going to treat If we'll still care and have If we're not sure, let's don't change |
|
First of all, it's not a good idea to add
Previously, this was supported until Apache Spark 2.0.0. After that, Apache Spark didn't support void. This PR also tried to forbid
In any way, since this is a legitimate suggestion from @HyukjinKwon , cc @gatorsmile , too. |
|
The intention is to only allow parsing We don't document NullType in SQL reference. I think it's better to hide NullType from end-users. It's usually type-coercioned to other official types, and this PR forbids |
|
Could you make a follow-up(full revert or partial revert) as what you suggest, @HyukjinKwon ? |
|
Thanks guys, sure. I will make a followup. |
…own' to 'null' ### What changes were proposed in this pull request? This PR proposes to partially reverts the simple string in `NullType` at #28833: `NullType.simpleString` back from `unknown` to `null`. ### Why are the changes needed? - Technically speaking, it's orthogonal with the issue itself, SPARK-20680. - It needs some more discussion, see #28833 (comment) ### Does this PR introduce _any_ user-facing change? It reverts back the user-facing changes at #28833. The simple string of `NullType` is back to `null`. ### How was this patch tested? I just logically reverted. Jenkins should test it out. Closes #29041 from HyukjinKwon/SPARK-20680. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
|
What about view ? When execute |
|
@ulysses-you does it work before this PR? |
|
@cloud-fan doesn't work. We should choose a plan that forbid or support. |
|
Hi, @ulysses-you . We already choose the plan. This is a step to forbid that gracefully. |
|
@dongjoon-hyun @cloud-fan Sorry for this, but I need to reconfirm that do we decide to forbid both of in-memory and hive ? Currently, we support |
|
I guess we can forbid that too consistently as a continuation of this approach. |
|
I don't think it's a good idea to diverge the behavior between in-memory and hive catalogs. |
|
I have created a SPARK-32356/#29152 to forbid this. |
| // session catalog and the table provider is not v2. | ||
| case c @ CreateTableStatement( | ||
| SessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _) => | ||
| assertNoNullTypeInSchema(c.tableSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we could add a legacy flag for such behavior change in future. This changes the behavior for both v1 and v2 Catalogs in order to fix a compatibility issue with Hive Metastore. But Hive Metastore is not the only Catalog Spark supports since we have opened the Catalog APIs in DSv2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know any database that supports creating tables with null/void type column, so this change is not for hive compatibility but for reasonable SQL semantic.
I agree this is a breaking change that should be at least put in the migration guide. A legacy config can also be added but I can't find a reasonable use case for a null type column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know any database that supports creating tables with null/void type column, so this change is not for hive compatibility but for reasonable SQL semantic.
I agree this is a breaking change that should be at least put in the migration guide. A legacy config can also be added but I can't find a reasonable use case for a null type column.
I think the main reason why you would want to support it is when people are using tables / views / temp tables to structure existing workloads. We support NullType type in CTEs, but in the case where people want to reuse the same CTE in multiple queries (i.e., multi-output workloads), they have no choice but to use views or temporary tables. (With DataFrames they'd still be able to reuse the same dataframe for multiple outputs, but in SQL that doesn't work.)
One typical use case where you use CTEs to structure your code is if you have multiple sources with different structures that you then UNION ALL together into a single dataset. It is not uncommon for each of the sources to have certain columns that don't apply, and then you write explicit NULLs there. It would be pretty annoying if you had to write explicit casts of those NULLs to the right type in all of those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bart-samwel this makes sense, shall we also support CREATE TABLE t(c VOID)? Your case seems like CTAS only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bart-samwel this makes sense, shall we also support
CREATE TABLE t(c VOID)? Your case seems like CTAS only.
I think the CREATE TABLE case with explicit types is not very useful, but it could be useful if there were tools that get a table's schema and then try to recreate it, e.g. for mocking purposes. Probably best to be orthogonal here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin do you have time to fix it? I think we can simply remove the null type check and add a few tests with both in-memory and hive catalog.
### What changes were proposed in this pull request? Previously we blocked creating tables with the null column to follow the hive behavior in PR #28833 In this PR, I propose the restore the previous behavior to support the null column in a table. ### Why are the changes needed? For a complex query, it's possible to generate a column with null type. If this happens to the input query of CTAS, the query will fail due to Spark doesn't allow creating a table with null type. From the user's perspective, it’s hard to figure out why the null type column is produced in the complicated query and how to fix it. So removing this constraint is more friendly to users. ### Does this PR introduce _any_ user-facing change? Yes, this reverts the previous behavior change in #28833, for example, below command will success after this PR ```sql CREATE TABLE t (col_1 void, col_2 int) ``` ### How was this patch tested? newly added and existing test cases Closes #33488 from linhongliu-db/SPARK-36241-support-void-column. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Previously we blocked creating tables with the null column to follow the hive behavior in PR #28833 In this PR, I propose the restore the previous behavior to support the null column in a table. ### Why are the changes needed? For a complex query, it's possible to generate a column with null type. If this happens to the input query of CTAS, the query will fail due to Spark doesn't allow creating a table with null type. From the user's perspective, it’s hard to figure out why the null type column is produced in the complicated query and how to fix it. So removing this constraint is more friendly to users. ### Does this PR introduce _any_ user-facing change? Yes, this reverts the previous behavior change in #28833, for example, below command will success after this PR ```sql CREATE TABLE t (col_1 void, col_2 int) ``` ### How was this patch tested? newly added and existing test cases Closes #33488 from linhongliu-db/SPARK-36241-support-void-column. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8e7e14d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType` ### Why are the changes needed? This PR is intended to address the type name discussion in PR #28833. Here are the reasons: 1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name 2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL 3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work ### Does this PR introduce _any_ user-facing change? Yes, the type name of "NULL" is changed from "null" to "void". for example: ``` scala> sql("select null as a, 1 as b").schema.catalogString res5: String = struct<a:void,b:int> ``` ### How was this patch tested? existing test cases Closes #33437 from linhongliu-db/SPARK-36224-void-type-name. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType` ### Why are the changes needed? This PR is intended to address the type name discussion in PR #28833. Here are the reasons: 1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name 2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL 3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work ### Does this PR introduce _any_ user-facing change? Yes, the type name of "NULL" is changed from "null" to "void". for example: ``` scala> sql("select null as a, 1 as b").schema.catalogString res5: String = struct<a:void,b:int> ``` ### How was this patch tested? existing test cases Closes #33437 from linhongliu-db/SPARK-36224-void-type-name. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 2f70077) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This is the new PR which to address the close one #17953
AstBuilder, point it toNullTypeWhy are the changes needed?
In Spark2.0.x, the behaviour to read this view is normal:
But in lastest Spark version, it failed with SparkException: Cannot recognize hive type string: void
In Spark, creating table with a VOID/NULL column should throw readable exception message, include
Does this PR introduce any user-facing change?
No
How was this patch tested?
Add unit tests