Skip to content

Conversation

@LantaoJin
Copy link
Contributor

@LantaoJin LantaoJin commented Jun 28, 2020

What changes were proposed in this pull request?

This is to address the close one #17953, as a refactor for #28833
Adding HiveVoidType, like HiveStringType, to prevent from exception when describe tables/views which the schema contain Hive VOID/NULL type.

Why are the changes needed?

Spark is incompatible with hive void type. When Hive table schema contains void type, DESC table will throw an exception in Spark.

hive> create table bad as select 1 x, null z from dual;
hive> describe bad;
OK
x int
z void

In Spark2.0.x, the behaviour to read this view is normal:

spark-sql> describe bad;
x int NULL
z void NULL
Time taken: 4.431 seconds, Fetched 2 row(s)

But in lastest Spark version, it failed with SparkException: Cannot recognize hive type string: void

spark-sql> describe bad;
17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
org.apache.spark.SparkException: Cannot recognize hive type string: void
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
DataType void() is not supported.(line 1, pos 0)
== SQL ==
void
^^^
... 61 more
org.apache.spark.SparkException: Cannot recognize hive type string: void

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add unit tests

Also can manual tests

spark-sql> describe bad;
x int NULL
z null NULL
Time taken: 0.486 seconds, Fetched 2 row(s)

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also fix

and add null type at
_all_atomic_types = dict((t.typeName(), t) for t in _atomic_types)

I am okay if you're not used to Python side - I can do it in a followup.

@LantaoJin
Copy link
Contributor Author

Thanks @HyukjinKwon, if this could be merged, can you help on python side?

* and should NOT be used anywhere else. Any instance of these data types should be
* replaced by a [[NullType]] before analysis.
*/
class HiveNullType private() extends DataType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the context, but can we name this HiveVoidType literally?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the description is interpreted like "hive null type should be replaced by a NullType before analysis".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null is a value and Hive exposes void as a type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null is a value and Hive exposes void as a type.

You are right.

@dongjoon-hyun
Copy link
Member

Thank you for working on this, @LantaoJin !

@LantaoJin LantaoJin changed the title [SPARK-20680][SQL] Adding HiveNullType in Spark to be compatible with Hive [SPARK-20680][SQL] Adding HiveVoidType in Spark to be compatible with Hive Jun 28, 2020
@SparkQA
Copy link

SparkQA commented Jun 28, 2020

Test build #124583 has finished for PR 28935 at commit 17b1853.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 28, 2020

Test build #124584 has finished for PR 28935 at commit ba2ef06.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 28, 2020

Test build #124590 has finished for PR 28935 at commit 17aace2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 28, 2020

Test build #124597 has finished for PR 28935 at commit 17aace2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you for updating, @LantaoJin .

@dongjoon-hyun
Copy link
Member

@LantaoJin . Is there a reason why you use $ at the Scala file name?

  • sql/catalyst/src/main/scala/org/apache/spark/sql/types/HiveVoidType$.scala

@LantaoJin
Copy link
Contributor Author

@LantaoJin . Is there a reason why you use $ at the Scala file name?

  • sql/catalyst/src/main/scala/org/apache/spark/sql/types/HiveVoidType$.scala

No, just typo. Fixed.

case ("decimal" | "dec" | "numeric", precision :: scale :: Nil) =>
DecimalType(precision.getText.toInt, scale.getText.toInt)
case ("interval", Nil) => CalendarIntervalType
case ("void", Nil) => HiveVoidType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just reuse NullType. We should forbid creating table with null type completely, including spark.catalog.createTable.

@SparkQA
Copy link

SparkQA commented Jun 29, 2020

Test build #124614 has finished for PR 28935 at commit a3a1cef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2020

Test build #124639 has started for PR 28935 at commit 9b9d021.

@cloud-fan
Copy link
Contributor

I think we need two changes:

  1. forbid creating tables with void column type. This could be done in the rule CommandCheck. We can test it with spark.catalog.createTable
  2. support "void" in the parser to fix SPARK-20680. This is only for legacy hive tables, as new tables can't have void type columns.

These two changes can be done with 2 PRs.

@LantaoJin
Copy link
Contributor Author

Yes. This PR is for the second change.

@cloud-fan
Copy link
Contributor

we need to do 1 first, otherwise it makes users be able to create tables with void type via CREATE TABLE command, while it was not possible before as the parser doesn't support it.

@LantaoJin
Copy link
Contributor Author

LantaoJin commented Jun 29, 2020

Ah, now I understood the context. The 5th commit should be reverted in this PR, otherwise the UT will fail. And we need to do 1 first. I will work on that tomorrow. My laptop is out of power now.

@LantaoJin
Copy link
Contributor Author

@cloud-fan I refactor some codes, now I think this PR could be no dependency.

}
// Add Hive type string to metadata.
val cleanedDataType = HiveStringType.replaceCharType(dataType)
// Add Hive type 'string' and 'void' to metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can be more aggressive here: forbid void type in all cases, including hive tables.

*/
private def visitSparkDataType(ctx: DataTypeContext): DataType = {
HiveStringType.replaceCharType(typedVisit(ctx))
HiveVoidType.replaceVoidType(HiveStringType.replaceCharType(typedVisit(ctx)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it why we need HiveVoidType. What happens if we just parse void to NullType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that could indicate VOID is a Hive type, the handle processing is more unified. Or, we can just use the PR #28833

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, below function will point the failure is due to the legacy hive void type. If we mix VOID and NULL, I am not sure it would be better than separation.

  def failVoidType(dt: DataType): Unit = {
    if (HiveVoidType.containsVoidType(dt)) {
      throw new AnalysisException(
        "Cannot create tables with Hive VOID type.")
    }
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VOID and NULL are indeed the same type. We can just check null type and fail with error message: Cannot create tables with VOID type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is consistency: The VOID type in SQL statement should be the same as NullType specified by Scala API in spark.catalog.createTable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan , ok. I will follow your suggestion to fix it in #28833 , since this PR is a refator with new type HiveVoidType. Now we don't need it.

@SparkQA
Copy link

SparkQA commented Jun 30, 2020

Test build #124657 has finished for PR 28935 at commit 3fa76cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

Close this since #28833 merged. Thank you!

@LantaoJin LantaoJin closed this Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants