Skip to content

Conversation

@clockfly
Copy link
Contributor

@clockfly clockfly commented May 31, 2016

What changes were proposed in this pull request?

The current implementation of "CREATE TEMPORARY TABLE USING datasource..." is NOT creating any intermediate temporary data directory like temporary HDFS folder, instead, it only stores a SQL string in memory. Probably we should use "TEMPORARY VIEW" instead.

This PR assumes a temporary table has to link with some temporary intermediate data. It follows the definition of temporary table like this (from hortonworks doc):

A temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query

Example:

scala> spark.sql("CREATE temporary view  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
scala> spark.sql("select c1, c2 from my_tab7").show()
+----+-----+
|  c1|   c2|
+----+-----+
|year| make|
|2012|Tesla|
...

It NOW prints a deprecation warning if "CREATE TEMPORARY TABLE USING..." is used.

scala> spark.sql("CREATE temporary table  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
16/05/31 10:39:27 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE tableName USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead

How was this patch tested?

Unit test.

@clockfly clockfly force-pushed the create_temp_view_using branch from 94d66c2 to c2a29b3 Compare May 31, 2016 17:49
@hvanhovell
Copy link
Contributor

So I am not sure I understand this one. Why should we deprecate this in favour of creating a view? A create temp table ... using statement describes the access to a physical storage; which in my book is a table.

Could you elaborate on why we need this?

@SparkQA
Copy link

SparkQA commented May 31, 2016

Test build #59662 has finished for PR 13414 at commit 94d66c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateTempViewUsing(

@SparkQA
Copy link

SparkQA commented May 31, 2016

Test build #59663 has finished for PR 13414 at commit c2a29b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateTempViewUsing(

@clockfly
Copy link
Contributor Author

@hvanhovell

create temp table ... using statement describes the access to a physical storage; which in my book is a table.

We still allow create table using..., what we deprecate is "create temporary table using...". We still treats external data source as tables. To wrap the external data source as a table, we can use "create table using..." .

temporary views and temporary tables are intermediate layers between user and the actual table:

User --> Temporary view/table --> External data source table (Can be wrapped by create table using...)

Currently, in our implementation, we don't support temporary table, we only supports temporary view. The difference is that:

  1. Temporary view is backed by a SQL string, which acts like a pointer. So every time we use the temporary view, the SQL is RE-executed again, which will asks data from the original data source.
  2. Temporary table is supposed to execute the SQL for ONLY once, and store the result in a temporary HDFS directory. Every time you use the temporary table, we are actually using the data in the temporary HDFS directory directly, without bothering the original data source.

@hvanhovell
Copy link
Contributor

I think the name of the SessionCatalog.createTempView is a misnomer - this is strengthened by the fact that the documentation and usage all refer to create temp tables...

I am pretty sure that no query is executed in this case. It will just scan the data. For example the following REPL code:

import java.nio.file.Files
val location = Files.createTempDirectory("data").resolve("src")
spark.range(0, 100000).
  select($"id".as("key"), rand().as("value")).
  write.parquet(location.toString)
spark.sql(s"create temporary table my_src using parquet options(path '$location')")
spark.table("my_src").explain(true)

Yields the following plan:

== Parsed Logical Plan ==
SubqueryAlias my_src
+- Relation[key#14L,value#15] parquet

== Analyzed Logical Plan ==
key: bigint, value: double
SubqueryAlias my_src
+- Relation[key#14L,value#15] parquet

== Optimized Logical Plan ==
Relation[key#14L,value#15] parquet

== Physical Plan ==
*BatchedScan parquet [key#14L,value#15] Format: ParquetFormat, InputPaths: file:/tmp/data8602759574255545993/src, PushedFilters: [], ReadSchema: struct<key:bigint,value:double>

Am I missing something?

@clockfly
Copy link
Contributor Author

clockfly commented Jun 1, 2016

@hvanhovell

I updated the description, please check whether it makes more sense now.

@hvanhovell
Copy link
Contributor

@clockfly the description is getting there. IIUC the problem we are solving is the following:

CREATE TEMPORARY TABLE ... USING ... allows us to create a temporary (session bound) connection to a (potentially) permanent data store. When the session finishes, the table definition (connection) is dropped, but the data is not. This is more-or-less the behavior you expect with a TEMPORARY EXTERNAL table (do we have those?), and this actually violates the common definition of a temporary table in which both the table definition and the data are session bound.

Using CREATE TEMPORARY VIEW ... USING ... accomplishes two things:

  • It doesn't make assumption about the underlying data (it can both be permanent or session bound).
  • It doesn't allow user to write to the datasource.

I do have a couple of issues with this:

  • Using CREATE TEMPORARY TABLE ... USING ... should still be allowed to use if you are using an actual session local temporary table. We could detect these by checking if a schema is defined (the location is also an issue). How do we deal with this use case?
  • I would support creating a CREATE TEMPORARY EXTERNAL TABLE ... USING ... to retain the current behavior.

What do you think?

@clockfly
Copy link
Contributor Author

clockfly commented Jun 2, 2016

@hvanhovell Probably we can talk more face to face next week.

identifierCommentList? (COMMENT STRING)?
(PARTITIONED ON identifierList)?
(TBLPROPERTIES tablePropertyList)? AS query #createView
| CREATE (OR REPLACE)? TEMPORARY VIEW tableIdentifier ('(' colTypeList ')')? tableProvider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Could you break this line up so we keep all #... hooks on the same column...

@hvanhovell
Copy link
Contributor

@clockfly this looks pretty good. I have left some (minor) comments.

@clockfly clockfly force-pushed the create_temp_view_using branch from e521310 to 127c309 Compare June 6, 2016 23:12
@clockfly
Copy link
Contributor Author

clockfly commented Jun 6, 2016

@hvanhovell Thanks for the review.

Updated.

@hvanhovell
Copy link
Contributor

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented Jun 7, 2016

Test build #60079 has finished for PR 13414 at commit e521310.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 7, 2016

Test build #60081 has finished for PR 13414 at commit 127c309.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Thanks! Merging to master/2.0

asfgit pushed a commit that referenced this pull request Jun 7, 2016
… "CREAT TEMPORARY VIEW USING..." instead

## What changes were proposed in this pull request?

The current implementation of "CREATE TEMPORARY TABLE USING datasource..." is NOT creating any intermediate temporary data directory like temporary HDFS folder, instead, it only stores a SQL string in memory. Probably we should use "TEMPORARY VIEW" instead.

This PR assumes a temporary table has to link with some temporary intermediate data. It follows the definition of temporary table like this (from [hortonworks doc](https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/temp-tables.html)):
> A temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query

**Example**:

```
scala> spark.sql("CREATE temporary view  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
scala> spark.sql("select c1, c2 from my_tab7").show()
+----+-----+
|  c1|   c2|
+----+-----+
|year| make|
|2012|Tesla|
...
```

It NOW prints a **deprecation warning** if "CREATE TEMPORARY TABLE USING..." is used.

```
scala> spark.sql("CREATE temporary table  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
16/05/31 10:39:27 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE tableName USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13414 from clockfly/create_temp_view_using.

(cherry picked from commit 890baac)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
@asfgit asfgit closed this in 890baac Jun 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants