Skip to content

Conversation

@LantaoJin
Copy link
Contributor

@LantaoJin LantaoJin commented Jun 15, 2020

What changes were proposed in this pull request?

This is the new PR which to address the close one #17953

  1. support "void" primitive data type in the AstBuilder, point it to NullType
  2. forbid creating tables with VOID/NULL column type

Why are the changes needed?

  1. Spark is incompatible with hive void type. When Hive table schema contains void type, DESC table will throw an exception in Spark.

hive> create table bad as select 1 x, null z from dual;
hive> describe bad;
OK
x int
z void

In Spark2.0.x, the behaviour to read this view is normal:

spark-sql> describe bad;
x int NULL
z void NULL
Time taken: 4.431 seconds, Fetched 2 row(s)

But in lastest Spark version, it failed with SparkException: Cannot recognize hive type string: void

spark-sql> describe bad;
17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
org.apache.spark.SparkException: Cannot recognize hive type string: void
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
DataType void() is not supported.(line 1, pos 0)
== SQL ==
void
^^^
... 61 more
org.apache.spark.SparkException: Cannot recognize hive type string: void

  1. Hive CTAS statements throws error when select clause has NULL/VOID type column since HIVE-11217
    In Spark, creating table with a VOID/NULL column should throw readable exception message, include
  • create data source table (using parquet, json, ...)
  • create hive table (with or without stored as)
  • CTAS

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add unit tests

@HyukjinKwon
Copy link
Member

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124046 has finished for PR 28833 at commit 3b8ddec.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

spark.sharedState.externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client
client.runSqlHive("CREATE TABLE t (t1 int)")
client.runSqlHive("INSERT INTO t VALUES (3)")
client.runSqlHive("CREATE VIEW tabNullType AS SELECT NULL AS col FROM t")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the t table for this test? We cannot write CREATE VIEW tabNullType AS SELECT NULL AS col?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. table t is needed, otherwise InvalidTableException: Table not found _dummy_table exception throws.

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124056 has finished for PR 28833 at commit cf0db98.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 16, 2020

Btw, could you brush up the PR description for better commit logs? what's the proposal of this PR, what's a behaivor change before/after this PR, brabrabra... I feel the curren one looks a bit ambiougous...

@HyukjinKwon
Copy link
Member

Let's also follow the new GitHub PR template - https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE

@LantaoJin LantaoJin changed the title [SPARK-20680][SQL] Spark-sql do not support for void column datatype [SPARK-20680][SQL] Make null type in Spark sql to be compatible with Hive void datatype Jun 16, 2020
@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124080 has finished for PR 28833 at commit 479901d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 16, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124098 has finished for PR 28833 at commit 479901d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 16, 2020

retest this please

@cloud-fan
Copy link
Contributor

To confirm: In Hive, people can't create tables with the void type (including void type inside struct/array/map). The only way is CTAS. Is this true?

And how about Spark?

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124121 has finished for PR 28833 at commit 479901d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@wangyum wangyum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. How about adding another HiveNullType, like HiveStringType?

    /**
    * Hive char type. Similar to other HiveStringType's, these datatypes should only used for
    * parsing, and should NOT be used anywhere else. Any instance of these data types should be
    * replaced by a [[StringType]] before analysis.
    */
    case class CharType(length: Int) extends HiveStringType {
    override def simpleString: String = s"char($length)"
    }
    /**
    * Hive varchar type. Similar to other HiveStringType's, these datatypes should only used for
    * parsing, and should NOT be used anywhere else. Any instance of these data types should be
    * replaced by a [[StringType]] before analysis.
    */
    case class VarcharType(length: Int) extends HiveStringType {
    override def simpleString: String = s"varchar($length)"
    }

  2. Could we add another test case?

create table t1 stored as parquet as select null as null_col;

CTAS statements throws error when select clause has NULL/VOID type column since HIVE-11217

@cloud-fan
Copy link
Contributor

@wangyum that said, only legacy Hive tables can have VOID column type?

It's also good to list the current Spark behaviors. I think it makes sense to forbid creating tables with VOID column type, maybe we can do that with an analyzer rule.

@LantaoJin
Copy link
Contributor Author

Emmm, thanks @wangyum . I think we should keep the same behavior with Hive2.x. Throw more readable exceptions for below SQLs.

create table t as select 1 x, null z from dual;
create table t as select null as null_col
create table t (v void);

@cloud-fan

@cloud-fan
Copy link
Contributor

Does hive support inner void like struct<v: void>, array<void>, etc.?

@LantaoJin
Copy link
Contributor Author

LantaoJin commented Jun 24, 2020

Success in Hive 2.3.7:

create table t (col1 struct<name:STRING, id: BIGINT>);
create table t (col1 array<STRING>);

Fail with NoViableAltException in Hive 2.3.7:

create table t (col1 struct<name:VOID, id: BIGINT>);
create table t (col1 array<VOID>);

@cloud-fan
Copy link
Contributor

cloud-fan commented Jun 24, 2020

Can we have another PR to forbid creating tables with void type, via an analyzer rule?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 8, 2020

Merged to master. Thank you for your patience, @LantaoJin .
(The last commit is only about HiveDDLSuite. I tested it locally.)

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125270 has finished for PR 28833 at commit 9ad57d1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

Thank you for all kindly review.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 8, 2020

Re: #28833 (comment)

Sorry I read the comments just now. So the decision here is we allow to parse void as NullType but doesn't allow it in some commands like CREATE TABLEs.

How about other cases when we directly use DDL-formatted string as its type? These simple type strings can be used in many places such as from_csv (schema as DDL formatted string), from_json (schema as DDL formatted string), createDataFrame (Python), etc. However, StructType.simpleString cannot still be parsed as the valid types.

@HyukjinKwon
Copy link
Member

If we're going to treat void as Hive legacy, let's don't support it at all and make the direction to deprecate and remove NullType away.

If we'll still care and have NullType, let's make it a proper type in Spark.

If we're not sure, let's don't change simpleString to something else for now.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 8, 2020

First of all, it's not a good idea to add NullType into a new Spark datatype officially. Not only the exposure causes more complexity, but also what can we do in Spark SQL world (https://spark.apache.org/docs/latest/sql-ref-datatypes.html) if that is an official type?

If we'll still care and have NullType, let's make it a proper type in Spark.

Previously, this was supported until Apache Spark 2.0.0. After that, Apache Spark didn't support void. This PR also tried to forbid VOID. AstBuilder provides a way for graceful warning. Currently, we are very careful even in the error message, we didn't mention void type. We called it unknown type. I believe this PR is one way to implement your idea, too. Of course, we can add more messages, too.

If we're going to treat void as Hive legacy, let's don't support it at all and make the direction to deprecate and remove NullType away.

In any way, since this is a legitimate suggestion from @HyukjinKwon , cc @gatorsmile , too.

@cloud-fan
Copy link
Contributor

NullType is a stable public class, I don't think we can drop it.

The intention is to only allow parsing NullType for the type string of legacy hive tables. But @HyukjinKwon is right that it also affects places like from_csv. Let's revert this part and think of a better solution.

We don't document NullType in SQL reference. I think it's better to hide NullType from end-users. It's usually type-coercioned to other official types, and this PR forbids NullType if it leaks to the end (top columns). df.show is still OK to have NullType though. I agree that NullType.simpleString update can be put in a separate PR and discussed separately.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 8, 2020

Could you make a follow-up(full revert or partial revert) as what you suggest, @HyukjinKwon ?

@HyukjinKwon
Copy link
Member

Thanks guys, sure. I will make a followup.

dongjoon-hyun pushed a commit that referenced this pull request Jul 10, 2020
…own' to 'null'

### What changes were proposed in this pull request?

This PR proposes to partially reverts the simple string in `NullType` at #28833: `NullType.simpleString` back from `unknown` to `null`.

### Why are the changes needed?

- Technically speaking, it's orthogonal with the issue itself, SPARK-20680.
- It needs some more discussion, see #28833 (comment)

### Does this PR introduce _any_ user-facing change?

It reverts back the user-facing changes at #28833.
The simple string of `NullType` is back to `null`.

### How was this patch tested?

I just logically reverted. Jenkins should test it out.

Closes #29041 from HyukjinKwon/SPARK-20680.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@ulysses-you
Copy link
Contributor

What about view ?
After this pr we have no readable exception, and also not support to create view with NullType.

When execute create view v1 as select null as col, the exception is

org.apache.spark.SparkException: Cannot recognize hive type string: null, column: col
  at org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:999)
  at org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1021)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
  at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType(HiveClientImpl.scala:1021)
  at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:547)

@cloud-fan
Copy link
Contributor

@ulysses-you does it work before this PR?

@ulysses-you
Copy link
Contributor

@cloud-fan doesn't work. We should choose a plan that forbid or support.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 13, 2020

Hi, @ulysses-you . We already choose the plan. This is a step to forbid that gracefully.
For create view v1 as select null as col, we can add an AnalysisException if you want.
Could you file a JIRA for that?

@ulysses-you
Copy link
Contributor

@dongjoon-hyun @cloud-fan Sorry for this, but I need to reconfirm that do we decide to forbid both of in-memory and hive ?

Currently, we support create view v as select null with in-memory, but failed with hive.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 16, 2020

I guess we can forbid that too consistently as a continuation of this approach.
BTW, until now, it's beyond of the scope because this PR was designed to prevent Hive void type.
Since Apache Spark doesn't talk to Apache Hive Metastore in case of in-memory catalog, other PMC member may has a different opinion.

@cloud-fan
Copy link
Contributor

I don't think it's a good idea to diverge the behavior between in-memory and hive catalogs.

@ulysses-you
Copy link
Contributor

I have created a SPARK-32356/#29152 to forbid this.

// session catalog and the table provider is not v2.
case c @ CreateTableStatement(
SessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _) =>
assertNoNullTypeInSchema(c.tableSchema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could add a legacy flag for such behavior change in future. This changes the behavior for both v1 and v2 Catalogs in order to fix a compatibility issue with Hive Metastore. But Hive Metastore is not the only Catalog Spark supports since we have opened the Catalog APIs in DSv2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know any database that supports creating tables with null/void type column, so this change is not for hive compatibility but for reasonable SQL semantic.

I agree this is a breaking change that should be at least put in the migration guide. A legacy config can also be added but I can't find a reasonable use case for a null type column.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know any database that supports creating tables with null/void type column, so this change is not for hive compatibility but for reasonable SQL semantic.

I agree this is a breaking change that should be at least put in the migration guide. A legacy config can also be added but I can't find a reasonable use case for a null type column.

I think the main reason why you would want to support it is when people are using tables / views / temp tables to structure existing workloads. We support NullType type in CTEs, but in the case where people want to reuse the same CTE in multiple queries (i.e., multi-output workloads), they have no choice but to use views or temporary tables. (With DataFrames they'd still be able to reuse the same dataframe for multiple outputs, but in SQL that doesn't work.)

One typical use case where you use CTEs to structure your code is if you have multiple sources with different structures that you then UNION ALL together into a single dataset. It is not uncommon for each of the sources to have certain columns that don't apply, and then you write explicit NULLs there. It would be pretty annoying if you had to write explicit casts of those NULLs to the right type in all of those cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bart-samwel this makes sense, shall we also support CREATE TABLE t(c VOID)? Your case seems like CTAS only.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bart-samwel this makes sense, shall we also support CREATE TABLE t(c VOID)? Your case seems like CTAS only.

I think the CREATE TABLE case with explicit types is not very useful, but it could be useful if there were tools that get a table's schema and then try to recreate it, e.g. for mocking purposes. Probably best to be orthogonal here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LantaoJin do you have time to fix it? I think we can simply remove the null type check and add a few tests with both in-memory and hive catalog.

cloud-fan pushed a commit that referenced this pull request Jul 27, 2021
### What changes were proposed in this pull request?
Previously we blocked creating tables with the null column to follow the hive behavior in PR #28833
In this PR, I propose the restore the previous behavior to support the null column in a table.

### Why are the changes needed?
For a complex query, it's possible to generate a column with null type. If this happens to the input query of
CTAS, the query will fail due to Spark doesn't allow creating a table with null type. From the user's perspective,
it’s hard to figure out why the null type column is produced in the complicated query and how to fix it. So removing
this constraint is more friendly to users.

### Does this PR introduce _any_ user-facing change?
Yes, this reverts the previous behavior change in #28833, for example, below command will success after this PR
```sql
CREATE TABLE t (col_1 void, col_2 int)
```

### How was this patch tested?
newly added and existing test cases

Closes #33488 from linhongliu-db/SPARK-36241-support-void-column.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jul 27, 2021
### What changes were proposed in this pull request?
Previously we blocked creating tables with the null column to follow the hive behavior in PR #28833
In this PR, I propose the restore the previous behavior to support the null column in a table.

### Why are the changes needed?
For a complex query, it's possible to generate a column with null type. If this happens to the input query of
CTAS, the query will fail due to Spark doesn't allow creating a table with null type. From the user's perspective,
it’s hard to figure out why the null type column is produced in the complicated query and how to fix it. So removing
this constraint is more friendly to users.

### Does this PR introduce _any_ user-facing change?
Yes, this reverts the previous behavior change in #28833, for example, below command will success after this PR
```sql
CREATE TABLE t (col_1 void, col_2 int)
```

### How was this patch tested?
newly added and existing test cases

Closes #33488 from linhongliu-db/SPARK-36241-support-void-column.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 8e7e14d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Aug 2, 2021
### What changes were proposed in this pull request?
Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType`

### Why are the changes needed?
This PR is intended to address the type name discussion in PR #28833. Here are the reasons:
1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name
2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL
3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work

### Does this PR introduce _any_ user-facing change?
Yes, the type name of "NULL" is changed from "null" to "void". for example:
```
scala> sql("select null as a, 1 as b").schema.catalogString
res5: String = struct<a:void,b:int>
```

### How was this patch tested?
existing test cases

Closes #33437 from linhongliu-db/SPARK-36224-void-type-name.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Aug 2, 2021
### What changes were proposed in this pull request?
Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType`

### Why are the changes needed?
This PR is intended to address the type name discussion in PR #28833. Here are the reasons:
1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name
2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL
3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work

### Does this PR introduce _any_ user-facing change?
Yes, the type name of "NULL" is changed from "null" to "void". for example:
```
scala> sql("select null as a, 1 as b").schema.catalogString
res5: String = struct<a:void,b:int>
```

### How was this patch tested?
existing test cases

Closes #33437 from linhongliu-db/SPARK-36224-void-type-name.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2f70077)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.