[SPARK-15674][SQL] Deprecates "CREATE TEMPORARY TABLE USING...", uses "CREAT TEMPORARY VIEW USING..." instead #13414

clockfly · 2016-05-31T17:42:09Z

What changes were proposed in this pull request?

The current implementation of "CREATE TEMPORARY TABLE USING datasource..." is NOT creating any intermediate temporary data directory like temporary HDFS folder, instead, it only stores a SQL string in memory. Probably we should use "TEMPORARY VIEW" instead.

This PR assumes a temporary table has to link with some temporary intermediate data. It follows the definition of temporary table like this (from hortonworks doc):

A temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query

Example:

scala> spark.sql("CREATE temporary view  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
scala> spark.sql("select c1, c2 from my_tab7").show()
+----+-----+
|  c1|   c2|
+----+-----+
|year| make|
|2012|Tesla|
...

It NOW prints a deprecation warning if "CREATE TEMPORARY TABLE USING..." is used.

scala> spark.sql("CREATE temporary table  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
16/05/31 10:39:27 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE tableName USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead

How was this patch tested?

Unit test.

…E TEMPORARY VIEW USING..." instead

hvanhovell · 2016-05-31T17:51:24Z

So I am not sure I understand this one. Why should we deprecate this in favour of creating a view? A create temp table ... using statement describes the access to a physical storage; which in my book is a table.

Could you elaborate on why we need this?

SparkQA · 2016-05-31T19:21:45Z

Test build #59662 has finished for PR 13414 at commit 94d66c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateTempViewUsing(

SparkQA · 2016-05-31T19:27:53Z

Test build #59663 has finished for PR 13414 at commit c2a29b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateTempViewUsing(

clockfly · 2016-05-31T20:14:20Z

@hvanhovell

create temp table ... using statement describes the access to a physical storage; which in my book is a table.

We still allow create table using..., what we deprecate is "create temporary table using...". We still treats external data source as tables. To wrap the external data source as a table, we can use "create table using..." .

temporary views and temporary tables are intermediate layers between user and the actual table:

User --> Temporary view/table --> External data source table (Can be wrapped by create table using...)

Currently, in our implementation, we don't support temporary table, we only supports temporary view. The difference is that:

Temporary view is backed by a SQL string, which acts like a pointer. So every time we use the temporary view, the SQL is RE-executed again, which will asks data from the original data source.
Temporary table is supposed to execute the SQL for ONLY once, and store the result in a temporary HDFS directory. Every time you use the temporary table, we are actually using the data in the temporary HDFS directory directly, without bothering the original data source.

hvanhovell · 2016-05-31T21:09:02Z

I think the name of the SessionCatalog.createTempView is a misnomer - this is strengthened by the fact that the documentation and usage all refer to create temp tables...

I am pretty sure that no query is executed in this case. It will just scan the data. For example the following REPL code:

import java.nio.file.Files
val location = Files.createTempDirectory("data").resolve("src")
spark.range(0, 100000).
  select($"id".as("key"), rand().as("value")).
  write.parquet(location.toString)
spark.sql(s"create temporary table my_src using parquet options(path '$location')")
spark.table("my_src").explain(true)

Yields the following plan:

== Parsed Logical Plan ==
SubqueryAlias my_src
+- Relation[key#14L,value#15] parquet

== Analyzed Logical Plan ==
key: bigint, value: double
SubqueryAlias my_src
+- Relation[key#14L,value#15] parquet

== Optimized Logical Plan ==
Relation[key#14L,value#15] parquet

== Physical Plan ==
*BatchedScan parquet [key#14L,value#15] Format: ParquetFormat, InputPaths: file:/tmp/data8602759574255545993/src, PushedFilters: [], ReadSchema: struct<key:bigint,value:double>

Am I missing something?

clockfly · 2016-06-01T00:36:06Z

@hvanhovell

I updated the description, please check whether it makes more sense now.

hvanhovell · 2016-06-01T13:37:10Z

@clockfly the description is getting there. IIUC the problem we are solving is the following:

CREATE TEMPORARY TABLE ... USING ... allows us to create a temporary (session bound) connection to a (potentially) permanent data store. When the session finishes, the table definition (connection) is dropped, but the data is not. This is more-or-less the behavior you expect with a TEMPORARY EXTERNAL table (do we have those?), and this actually violates the common definition of a temporary table in which both the table definition and the data are session bound.

Using CREATE TEMPORARY VIEW ... USING ... accomplishes two things:

It doesn't make assumption about the underlying data (it can both be permanent or session bound).
It doesn't allow user to write to the datasource.

I do have a couple of issues with this:

Using CREATE TEMPORARY TABLE ... USING ... should still be allowed to use if you are using an actual session local temporary table. We could detect these by checking if a schema is defined (the location is also an issue). How do we deal with this use case?
I would support creating a CREATE TEMPORARY EXTERNAL TABLE ... USING ... to retain the current behavior.

What do you think?

clockfly · 2016-06-02T06:49:57Z

@hvanhovell Probably we can talk more face to face next week.

hvanhovell · 2016-06-06T20:51:22Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

        identifierCommentList? (COMMENT STRING)?
        (PARTITIONED ON identifierList)?
        (TBLPROPERTIES tablePropertyList)? AS query                    #createView
+    | CREATE (OR REPLACE)? TEMPORARY VIEW tableIdentifier ('(' colTypeList ')')? tableProvider


NIT: Could you break this line up so we keep all #... hooks on the same column...

hvanhovell · 2016-06-06T21:00:51Z

@clockfly this looks pretty good. I have left some (minor) comments.

clockfly · 2016-06-06T23:14:26Z

@hvanhovell Thanks for the review.

Updated.

hvanhovell · 2016-06-07T00:03:08Z

LGTM pending Jenkins

SparkQA · 2016-06-07T01:09:05Z

Test build #60079 has finished for PR 13414 at commit e521310.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-07T01:12:01Z

Test build #60081 has finished for PR 13414 at commit 127c309.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-06-07T22:21:12Z

Thanks! Merging to master/2.0

… "CREAT TEMPORARY VIEW USING..." instead ## What changes were proposed in this pull request? The current implementation of "CREATE TEMPORARY TABLE USING datasource..." is NOT creating any intermediate temporary data directory like temporary HDFS folder, instead, it only stores a SQL string in memory. Probably we should use "TEMPORARY VIEW" instead. This PR assumes a temporary table has to link with some temporary intermediate data. It follows the definition of temporary table like this (from [hortonworks doc](https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/temp-tables.html)): > A temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query **Example**: ``` scala> spark.sql("CREATE temporary view my_tab7 (c1: String, c2: String) USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')") scala> spark.sql("select c1, c2 from my_tab7").show() +----+-----+ | c1| c2| +----+-----+ |year| make| |2012|Tesla| ... ``` It NOW prints a **deprecation warning** if "CREATE TEMPORARY TABLE USING..." is used. ``` scala> spark.sql("CREATE temporary table my_tab7 (c1: String, c2: String) USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')") 16/05/31 10:39:27 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE tableName USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13414 from clockfly/create_temp_view_using. (cherry picked from commit 890baac) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

SPARK-15674: Deprecates "CREATE TEMPORARY TABLE USING...", use "CREAT…

c2a29b3

…E TEMPORARY VIEW USING..." instead

clockfly force-pushed the create_temp_view_using branch from 94d66c2 to c2a29b3 Compare May 31, 2016 17:49

hvanhovell reviewed Jun 6, 2016
View reviewed changes

on Herman's comment

127c309

clockfly force-pushed the create_temp_view_using branch from e521310 to 127c309 Compare June 6, 2016 23:12

asfgit closed this in 890baac Jun 7, 2016

xwu0226 mentioned this pull request Feb 10, 2017

[SPARK-19539][SQL] Block duplicate temp table during creation #16878

Closed

[SPARK-15674][SQL] Deprecates "CREATE TEMPORARY TABLE USING...", uses "CREAT TEMPORARY VIEW USING..." instead #13414

[SPARK-15674][SQL] Deprecates "CREATE TEMPORARY TABLE USING...", uses "CREAT TEMPORARY VIEW USING..." instead #13414

Uh oh!

Conversation

clockfly commented May 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hvanhovell commented May 31, 2016

Uh oh!

SparkQA commented May 31, 2016

Uh oh!

SparkQA commented May 31, 2016

Uh oh!

clockfly commented May 31, 2016

Uh oh!

hvanhovell commented May 31, 2016

Uh oh!

clockfly commented Jun 1, 2016

Uh oh!

hvanhovell commented Jun 1, 2016

Uh oh!

clockfly commented Jun 2, 2016

Uh oh!

hvanhovell Jun 6, 2016

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Jun 6, 2016

Uh oh!

clockfly commented Jun 6, 2016

Uh oh!

hvanhovell commented Jun 7, 2016

Uh oh!

SparkQA commented Jun 7, 2016

Uh oh!

SparkQA commented Jun 7, 2016

Uh oh!

hvanhovell commented Jun 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clockfly commented May 31, 2016 •

edited

Loading