Skip to content

Conversation

@yhuai
Copy link
Contributor

@yhuai yhuai commented May 28, 2016

What changes were proposed in this pull request?

When spark.sql.hive.convertCTAS is true, for a CTAS statement, we will create a data source table using the default source (i.e. parquet) if the CTAS does not specify any Hive storage format. However, there are two issues with this conversion logic.

  1. First, we determine if a CTAS statement defines storage format by checking the serde. However, TEXTFILE/SEQUENCEFILE does not have a default serde. When we do the check, we have not set the default serde. So, a query like CREATE TABLE abc STORED AS TEXTFILE AS SELECT ... actually creates a data source parquet table.
  2. In the conversion logic, we are ignoring the user-specified location.

This PR fixes the above two issues.

Also, this PR makes the parser throws an exception when a CTAS statement has a PARTITIONED BY clause. This change is made because Hive's syntax does not allow it and our current implementation actually does not work for this case (the insert operation always throws an exception because the insertion does not pick up the partitioning info).

How was this patch tested?

I am adding new tests in SQLQuerySuite and HiveDDLCommandSuite.

@yhuai
Copy link
Contributor Author

yhuai commented May 28, 2016

@ericl @andrewor14 @liancheng Can you review this PR?

@SparkQA
Copy link

SparkQA commented May 29, 2016

Test build #59570 has finished for PR 13386 at commit 2615f67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 29, 2016

Test build #59579 has finished for PR 13386 at commit c5cb32c.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor Author

yhuai commented May 29, 2016

test this please

@SparkQA
Copy link

SparkQA commented May 29, 2016

Test build #59581 has finished for PR 13386 at commit c5cb32c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai yhuai changed the title [SPARK-14507] [SPARK-15646] [SQL] When spark.sql.hive.convertCTAS is true, we should not convert the table stored as TEXTFILE/SEQUENCEFILE and we need respect the user-defined location [SPARK-15646] [SQL] When spark.sql.hive.convertCTAS is true, we should not convert the table stored as TEXTFILE/SEQUENCEFILE and we need respect the user-defined location May 29, 2016
@yhuai yhuai changed the title [SPARK-15646] [SQL] When spark.sql.hive.convertCTAS is true, we should not convert the table stored as TEXTFILE/SEQUENCEFILE and we need respect the user-defined location [SPARK-15646] [SQL] When spark.sql.hive.convertCTAS is true, the conversion rule needs to respect TEXTFILE/SEQUENCEFILE format and the user-defined location May 29, 2016
@yhuai
Copy link
Contributor Author

yhuai commented May 29, 2016

OK. External related changes will be handled by #13395.

@SparkQA
Copy link

SparkQA commented May 29, 2016

Test build #59594 has finished for PR 13386 at commit fa89081.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor Author

yhuai commented May 31, 2016

test this please

@SparkQA
Copy link

SparkQA commented May 31, 2016

Test build #59660 has finished for PR 13386 at commit fa89081.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

q
)
} else {
CreateTableAsSelectLogicalPlan(tableDesc, q, ifNotExists)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this one also be renamed to CreateHiveTableAsSelectLogicalPlan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at its implementation, it is not hive-specific. So, seems fine to leave it as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like HiveMetastoreCatalog is the only user though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Let me change it

@ericl
Copy link
Contributor

ericl commented May 31, 2016

looks good, just have a couple questions

val hasStorageProperties = (ctx.createFileFormat != null) || (ctx.rowFormat != null)
if (conf.convertCTAS && !hasStorageProperties) {
val mode = if (ifNotExists) SaveMode.Ignore else SaveMode.ErrorIfExists
val options = rowStorage.serdeProperties ++ fileStorage.serdeProperties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these will always be empty if we've reached here, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea

@andrewor14
Copy link
Contributor

LGTM, minor comments only.

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59720 has finished for PR 13386 at commit b137cba.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateHiveTableAsSelectLogicalPlan(

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59748 has finished for PR 13386 at commit 88e7422.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Merging into master 2.0

asfgit pushed a commit that referenced this pull request Jun 2, 2016
…rsion rule needs to respect TEXTFILE/SEQUENCEFILE format and the user-defined location

## What changes were proposed in this pull request?
When `spark.sql.hive.convertCTAS` is true, for a CTAS statement, we will create a data source table using the default source (i.e. parquet) if the CTAS does not specify any Hive storage format. However, there are two issues with this conversion logic.
1. First, we determine if a CTAS statement defines storage format by checking the serde. However, TEXTFILE/SEQUENCEFILE does not have a default serde. When we do the check, we have not set the default serde. So, a query like `CREATE TABLE abc STORED AS TEXTFILE AS SELECT ...` actually creates a data source parquet table.
2. In the conversion logic, we are ignoring the user-specified location.

This PR fixes the above two issues.

Also, this PR makes the parser throws an exception when a CTAS statement has a PARTITIONED BY clause. This change is made because Hive's syntax does not allow it and our current implementation actually does not work for this case (the insert operation always throws an exception because the insertion does not pick up the partitioning info).

## How was this patch tested?
I am adding new tests in SQLQuerySuite and HiveDDLCommandSuite.

Author: Yin Huai <yhuai@databricks.com>

Closes #13386 from yhuai/SPARK-14507.

(cherry picked from commit 6dddb70)
Signed-off-by: Andrew Or <andrew@databricks.com>
@asfgit asfgit closed this in 6dddb70 Jun 2, 2016
@gatorsmile
Copy link
Member

Just realized this PR introduced the original changes. Could you also review my PR: #13907?

When users create table as query with STORED AS or ROW FORMAT and spark.sql.hive.convertCTAS is set to true, we do not convert them to data source tables. I am wondering whether we still can convert the tables specified in parquet and orc formats to data source tables? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants