Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Jun 25, 2016

What changes were proposed in this pull request?

Currently, the following created tables will be Hive Table.

CREATE TABLE t STORED AS parquet SELECT 1 as a, 1 as b
CREATE TABLE t1
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
SELECT col1, col2 from t3

When users create table as query with STORED AS or ROW FORMAT and spark.sql.hive.convertCTAS is set to true, we will not convert them to data source tables. Actually, for parquet and orc formats, we still can convert them to data source table even if the users use STORED AS or ROW FORMAT.

How was this patch tested?

Added test cases for both ORC and PARQUET

@SparkQA
Copy link

SparkQA commented Jun 26, 2016

Test build #61243 has finished for PR 13907 at commit c4bde02.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile gatorsmile changed the title [SPARK-16209] [SQL] Convert Hive Tables to Data Source Tables for CREATE TABLE AS SELECT [SPARK-16209] [SQL] Convert Create Hive Tables As Select in Parquet/Orc to Data Source Tables for CREATE TABLE AS SELECT Jun 26, 2016
@gatorsmile gatorsmile changed the title [SPARK-16209] [SQL] Convert Create Hive Tables As Select in Parquet/Orc to Data Source Tables for CREATE TABLE AS SELECT [SPARK-16209] [SQL] Convert Hive Tables in PARQUET/ORC to Data Source Tables for CREATE TABLE AS SELECT Jun 26, 2016
@SparkQA
Copy link

SparkQA commented Jun 26, 2016

Test build #61253 has finished for PR 13907 at commit a9ce0d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jun 27, 2016

With your PR, if users specify ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde', will we convert?

@gatorsmile
Copy link
Member Author

gatorsmile commented Jun 27, 2016

Nope. If users do not specify the intput and output formats. We will use the default INPUTFORMAT, which is org.apache.hadoop.mapred.TextInputFormat and the default OUTPUTFORMAT, which is org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat. This is different from the standard input and output formats for ORC: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat and org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.

I am not sure whether we should still convert it. Please let me know if you think we should still convert them. Thanks!

BTW, I also confirmed Spark SQL and Hive have the same default input and output formats.

@gatorsmile
Copy link
Member Author

gatorsmile commented Aug 4, 2016

cc @cloud-fan This is not contained in #14482. Should I leave it open? Or should I fix the conflict after #14482 is merged?

@cloud-fan
Copy link
Contributor

I don't think it's a very useful feature, and we may surprise users as they do use hive syntax to specify row format.

For advanced users, they can easily use USING xxx to explicitly create a data source table for better performance.

@gatorsmile
Copy link
Member Author

I see. Let me close it.

@gatorsmile gatorsmile closed this Aug 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants