Skip to content

Conversation

@windpiger
Copy link
Contributor

@windpiger windpiger commented Feb 15, 2017

What changes were proposed in this pull request?

  spark.sql(
          s"""
             |CREATE TABLE t
             |USING parquet
             |PARTITIONED BY(a, b)
             |LOCATION '$dir'
             |AS SELECT 3 as a, 4 as b, 1 as c, 2 as d
           """.stripMargin)

Failed with the error message:

path file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4c0000gn/T/spark-195cd513-428a-4df9-b196-87db0c73e772 already exists.;
org.apache.spark.sql.AnalysisException: path file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4c0000gn/T/spark-195cd513-428a-4df9-b196-87db0c73e772 already exists.;
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:102)

while hive table is ok ,so we should fix it for datasource table.

The reason is that the SaveMode check is put in InsertIntoHadoopFsRelationCommand , and the SaveMode check actually use path, this is fine when we use DataFrameWriter.save(), because this situation of SaveMode act on path.

While when we use CreateDataSourceAsSelectCommand, the situation of SaveMode act on table, and
we have already do SaveMode check in CreateDataSourceAsSelectCommand for table , so we should not do SaveMode check in the following logic in InsertIntoHadoopFsRelationCommand for path, this is redundant and wrong logic for CreateDataSourceAsSelectCommand

After this PR, the following DDL will succeed, when the location has been created we will append it or overwrite it.

CREATE TABLE ... (PARTITIONED BY ...) LOCATION path AS SELECT ...

How was this patch tested?

unit test added

@SparkQA
Copy link

SparkQA commented Feb 15, 2017

Test build #72932 has finished for PR 16938 at commit 058865b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

cc @gatorsmile @cloud-fan

@cloud-fan
Copy link
Contributor

I don't think we should treat it as a bug just because hive supports it, we should think more. Does it make sense to specify an existing directory in CTAS?

@gatorsmile
Copy link
Member

We need to define a consistent rule in Catalog how to handle the scenario when the to-be-created directory already exists. So far, in most DDL scenarios, when trying to create a directory but it already exists, we just simply use the existing directory without an error message. mkdir does not complain if the destination directory exists.

@tejasapatil
Copy link
Contributor

From what I understand, this change is applicable for EXTERNAL tables only.

There are two main uses of EXTERNAL tables I am aware of (repost from #16868 (comment)):

  • Ingest data from non-hive locations into Hive tables.
  • Create a logical "pointer" to an existing hive table / partition (without creating multiple copies of the underlying data).

Ability to point to random location (which already has data) and create an EXTERNAL table over it is important for supporting EXTERNAL tables. If we don't allow this PR, then the options left to users are:

  • Create an external table and point to some non-existing location.
  • Later do either of these 2 things:
    • issue ALTER TABLE SET LOCATION to set the external table's location to the source location having desired data.
    • do a dfs -mv from the source location of the data to the new location which the table points at. This will be nasty in case your source data was a managed table location.

@cloud-fan : I don't think Spark's interpretation of EXTERNAL tables is different from Hive's. If it is, can you share the differences ? I think we should allow this. If you have specific concerns, lets discuss those.

@windpiger
Copy link
Contributor Author

I think in CTAS,it is not allowed an existed table, no strict for the path exists. In DataFrameWriter.save with errorifnotexist mode,path existed is not allowed.

@windpiger
Copy link
Contributor Author

@cloud-fan @gatorsmile @tejasapatil let's discuss this together?

@cloud-fan
Copy link
Contributor

ok let's discuss it case by case:

  1. CREATE TABLE ... LOCATION path works if path exists, it's expected
  2. CREATE TABLE ... LOCATION path fails if path doesn't exist, is it expected?
  3. CREATE TABLE ... LOCATION path AS SELECT ..., shall we fail if path exists?
  4. ALTER TABLE ... SET LOCATION path, shall we fail if path not exist?

@gatorsmile
Copy link
Member

gatorsmile commented Feb 17, 2017

One more case for managed tables:
5. CREATE TABLE or CTAS without the location spec: if the default path exists, should we succeed or fail?

After we finishing the TABLE-level DDLs, we also need to do the same things for DATABASE-level DDLs and PARTITION-level DDLs.

@windpiger
Copy link
Contributor Author

windpiger commented Feb 17, 2017

Updated(test Hive 2.0.0 adding external).
Compare spark-master branch and hive-2.0.0.
There are some different actions. @cloud-fan @gatorsmile @tejasapatil

Summary:
spark(hive with HiveExternalCatalog) -> all below command are ok
spark(parquet with HiveExternalCatalog) and spark(parquet with InMemoryCatalog) have the same actions.

1. CREATE TABLE ... LOCATION path

a) path exists
      hive(external) -> ok
      spark(hive with HiveExternalCatalog) -> ok
      spark(parquet with HiveExternalCatalog) -> ok
      spark(parquet with InMemoryCatalog) -> ok
b) path not exists
      hive(external) -> ok
      spark(hive with HiveExternalCatalog) -> ok
      spark(parquet with HiveExternalCatalog) -> throw exception(path does not exists)
      spark(parquet with InMemoryCatalog) -> throw exception(path does not exists)

2. CREATE TABLE ... LOCATION path AS SELECT ...

a) path exists
	  hive(external) -> FAILED: SemanticException [Error 10070]: CREATE-TABLE-AS-SELECT cannot create external table
          spark(hive with HiveExternalCatalog) -> ok
	  spark(parquet with HiveExternalCatalog) -> throw exception(path already exists)
         spark(parquet with InMemoryCatalog) -> throw exception(path already exists)
b) path not exists
      hive(external) -> FAILED: SemanticException [Error 10070]: CREATE-TABLE-AS-SELECT cannot create external table
      spark(hive with HiveExternalCatalog) -> ok
      spark(parquet with HiveExternalCatalog) -> ok
      spark(parquet with InMemoryCatalog) -> ok

3. ALTER TABLE ... SET LOCATION path

a) path exists
	   hive -> ok
           spark(hive with HiveExternalCatalog) -> ok
	   spark(parquet with HiveExternalCatalog) -> ok
           spark(parquet with InMemoryCatalog) -> ok
b) path not exists
	   hive -> ok
           spark(hive with HiveExternalCatalog) -> ok
	   spark(parquet with HiveExternalCatalog) -> ok
           spark(parquet with InMemoryCatalog) -> ok

4. CREATE TABLE ....

a) default warehouse table path exists
       hive -> ok
       spark(hive with HiveExternalCatalog) -> ok
       spark(parquet with HiveExternalCatalog) -> ok
       spark(parquet with InMemoryCatalog) -> ok
b) default warehouse table path not exists
       hive -> ok
       spark(hive with HiveExternalCatalog) -> ok
       spark(parquet with HiveExternalCatalog) -> ok
       spark(parquet with InMemoryCatalog) -> ok

5. CREATE TABLE ... AS SELECT ...

a) default warehouse table path exists
       hive ->  ok
       spark(hive with HiveExternalCatalog) -> ok
       spark(parquet with HiveExternalCatalog) -> throw exception(path already exists)
       spark(parquet with InMemoryCatalog) -> throw exception(path already exists)
b) default warehouse table path not exists
       hive ->  ok
       spark(hive with HiveExternalCatalog) -> ok
       spark(parquet with HiveExternalCatalog) -> ok
       spark(parquet with InMemoryCatalog) -> ok

@windpiger
Copy link
Contributor Author

windpiger commented Feb 17, 2017

Updated(test Hive 2.0.0 adding external)..
Compare spark-master branch and hive-2.0.0(partition)

Summary:
spark(hive with HiveExternalCatalog) -> all below command are ok
spark(parquet with HiveExternalCatalog) and spark(parquet with InMemoryCatalog) have the same actions.

1. CREATE TABLE ... PARTITIONED BY ... LOCATION path

a) path exists
      hive(external) -> ok
      spark(hive with HiveExternalCatalog) -> ok
      spark(parquet with HiveExternalCatalog) -> ok
      spark(parquet with InMemoryCatalog) -> ok 
b) path not exists
      hive(external) -> ok
      spark(hive with HiveExternalCatalog) -> ok
      spark(parquet with HiveExternalCatalog) -> throw exception(path does not exists)
      spark(parquet with InMemoryCatalog) -> throw exception(path does not exists)

2. CREATE TABLE ...PARTITIONED BY ... LOCATION path AS SELECT ...

a) path exists
	  hive(external) -> not support
          spark(hive with HiveExternalCatalog) -> ok
	  spark(parquet with HiveExternalCatalog) -> throw exception(path already exists)
          spark(parquet with InMemoryCatalog) -> throw exception(path already exists)
b) path not exists
      hive(external) -> not support
      spark(hive with HiveExternalCatalog) -> ok
      spark(parquet with HiveExternalCatalog) -> ok
      spark(parquet with InMemoryCatalog) -> ok

3. ALTER TABLE ... PARTITION(...) ... SET LOCATION path

a) path exists
	   hive -> ok
           spark(hive with HiveExternalCatalog) -> ok
	   spark(parquet with HiveExternalCatalog) -> ok
           spark(parquet with InMemoryCatalog) -> ok
b) path not exists
	   hive -> ok
           spark(hive with HiveExternalCatalog) -> ok
	   spark(parquet with HiveExternalCatalog) -> ok
           spark(parquet with InMemoryCatalog) -> ok

4. CREATE TABLE ....PARTITIONED BY ...

a) default warehouse table path exists
       hive -> ok
       spark(hive with HiveExternalCatalog) -> ok
       spark(parquet with HiveExternalCatalog) -> ok
       spark(parquet with InMemoryCatalog) -> ok
b) default warehouse table path not exists
       hive -> ok
       spark(hive with HiveExternalCatalog) -> ok
       spark(parquet with HiveExternalCatalog) -> ok
       spark(parquet with InMemoryCatalog) -> ok

5. CREATE TABLE ... PARTITIONED BY ... AS SELECT ...

a) default warehouse table path exists
       hive -> not support
       spark(hive with HiveExternalCatalog) -> ok
       spark(parquet with HiveExternalCatalog) -> throw exception(path already exists)
       spark(parquet with InMemoryCatalog) -> throw exception(path already exists)
b) default warehouse table path not exists
       hive -> not support
       spark(hive with HiveExternalCatalog) -> ok
       spark(parquet with HiveExternalCatalog) -> ok
       spark(parquet with InMemoryCatalog) ->  ok

@gatorsmile
Copy link
Member

@windpiger Thank you for your efforts! What you did above need to be written as the test cases. Could you do it as a separate PR?

In addition, all the cases you tried are only for hive serve tables, right?

@gatorsmile
Copy link
Member

Could you check the behaviors for both data source tables and hive serde tables? Later, we also need to check the behaviors of InMemoryCatalog for data source tables without enabling Hive supports.

@windpiger
Copy link
Contributor Author

windpiger commented Feb 17, 2017

@gatorsmile
Sorry, I forget to declare that ,Above tests, spark represents parquet table with HiveExternalCatalog , hive represents hive table in hive2.0.0.

I will add hive serde table for spark with HiveExternalCatalog, and parquet table with InMemoryCatalog soon.

@windpiger
Copy link
Contributor Author

windpiger commented Feb 19, 2017

@gatorsmile @cloud-fan @tejasapatil
I have test all the cases above updated. The result shows that
spark for datasource table with HiveExternalCatalog and InMemoryCatatlog have the same actions.
spark for hive table passed all the tests above.
there are some differences between spark for hive table and spark for datasource table, we should determinate what are the expected actions between the differences.

  1. CREATE TABLE ... (PARTITIONED BY ...) LOCATION path
  a) path not exists
      hive -> ok
      spark(hive with HiveExternalCatalog) -> ok
      spark(parquet with HiveExternalCatalog) -> throw exception(path does not exists)
      spark(parquet with InMemoryCatalog) -> throw exception(path does not exists)
  1. CREATE TABLE ...(PARTITIONED BY ...) LOCATION path AS SELECT ...
 a) path exists
	  hive -> not support
      spark(hive with HiveExternalCatalog) -> ok
	  spark(parquet with HiveExternalCatalog) -> throw exception(path already exists)
      spark(parquet with InMemoryCatalog) -> throw exception(path already exists)
  1. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
   a) default warehouse table path exists
       hive -> not support
       spark(hive with HiveExternalCatalog) -> ok
       spark(parquet with HiveExternalCatalog) -> throw exception(path already exists)
       spark(parquet with InMemoryCatalog) -> throw exception(path already exists)

@tejasapatil
Copy link
Contributor

@windpiger :

  • what does throw exception(...) mean ? Operation is supported OR not ? it might throw exception but the operation itself might have happened.
  • for 2nd point, you said hive does not support that. Can you share the error message ? I am trying to understand if there is a reason why hive does not allow that and with Spark we would also need to think about that.
  • I could not understand what default warehouse table path exists in 3rd point means

@cloud-fan
Copy link
Contributor

CREATE TABLE ... (PARTITIONED BY ...) LOCATION path

I think hive's behavior makes more sense. Users may wanna insert data to this table and put the data in a specified location, even it doesn't exist at the beginning.

CREATE TABLE ...(PARTITIONED BY ...) LOCATION path AS SELECT ...

The reason applies here too.

CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...

When users don't specify the location, mostly they would expect this is a fresh table and the table path should not exist.

@windpiger
Copy link
Contributor Author

@tejasapatil

  • throw exception is the result of the test, It is really happened in current spark master branch

  • Hive CTAS not support for partition table hive-doc

  • default warehouse table path exist means that we already create a table path under warehouse path, before we create the table.

@windpiger
Copy link
Contributor Author

@cloud-fan
situation 2. CREATE TABLE ...(PARTITIONED BY ...) LOCATION path AS SELECT ...
is different for path exists, which is this PR going to resolve. It is ok to make it consist with hive with HiveExternalCatalog in spark?

situation 3. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
is also different for default warehouse table path exists, do you mean that the parquet action is expected that throw an already exist exception, and hive should make consist with it?

@cloud-fan
Copy link
Contributor

@windpiger yes for both questions.

@gatorsmile
Copy link
Member

gatorsmile commented Feb 23, 2017

Based on the doc, Hive does not support CTAS when the target table is external.

1. CREATE TABLE ... LOCATION path
2. CREATE TABLE ... LOCATION path AS SELECT ...

When you testing the above two cases for checking Hive's behaviors, you need to change the syntax a little bit by manually adding EXTERNAL. Spark SQL actually is creating an external table. This is different from Hive.

1. CREATE EXTERNAL TABLE ... LOCATION path
2. CREATE EXTERNAL TABLE ... LOCATION path AS SELECT ...

Could you update the comparison? Thanks!

@windpiger
Copy link
Contributor Author

@gatorsmile sorry, I make a mistake of this, I have updated the compare test above.

@windpiger
Copy link
Contributor Author

@cloud-fan @gatorsmile @tejasapatil As we discussed above, we have three actions to do:

1. CREATE TABLE ... (PARTITIONED BY ...) LOCATION path

situation:path not exists

Item Before After
spark(parquet with HiveExternalCatalog) throw exception(path does not exists) ok
spark(parquet with InMemoryCatalog) throw exception(path does not exists) ok
2. CREATE TABLE ...(PARTITIONED BY ...) LOCATION path AS SELECT

situation:path exists

Item Before After
spark(parquet with HiveExternalCatalog) throw exception(path already exists) ok
spark(parquet with InMemoryCatalog) throw exception(path already exists) ok
3. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...

situation:default warehouse table path exists

Item Before After
spark(hive with HiveExternalCatalog) ok throw exception(path already exists)

please help to confirm the actions above, if it is ok, situation 2 is this PR going to resolve, and I will make another PR to resolve situation 1&3, thanks~

@gatorsmile
Copy link
Member

I found you also changed the following cases:
4. CREATE TABLE ....
5. CREATE TABLE ... AS SELECT ...

Actually, they are managed tables. You do not need to update them. Can you roll back the changes? Thanks!

@gatorsmile
Copy link
Member

gatorsmile commented Feb 23, 2017

Basically, the rules you proposed can be summarized below,

  • When users specify the location in CT or CTAS (i.e., creating an external table), we should create a new directory if not existed, or overwrite the directory if already created.
  • When users do not specify the location in CT or CTAS (i.e., creating a manged table), we should create a new directory if not existed. If the directory already exists, we should issue the error.

Is my understanding right?

@windpiger
Copy link
Contributor Author

windpiger commented Feb 23, 2017

oh, you are right~ thanks! I have roll back the changes.

Yes, summary is consist with our discussed.

@gatorsmile
Copy link
Member

Thank you for your work!

Maybe the last question.

**2. CREATE TABLE ...PARTITIONED BY ... LOCATION path AS SELECT ...**
a) path exists
	  hive(external) -> not support
          spark(hive with HiveExternalCatalog) -> ok
	  spark(parquet with HiveExternalCatalog) -> throw exception(path already exists)
          spark(parquet with InMemoryCatalog) -> throw exception(path already exists)

In the above case, you used path exists. I assumed this is the existence of the table directory. Are these behaviors still the same when the specific partition directory exists?

@tejasapatil
Copy link
Contributor

@windpiger : I realised that you are checking the hive behavior against Hive 2.0.0. Spark is expected to support semantics for Hive 1.2.1 :

val hiveExecutionVersion: String = "1.2.1"

I am not upto date with the differences between those two releases of hive wrt this discussion. Can you confirm if the observations reported earlier in the discussion are valid against Hive 1.2.1 ?

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73399 has started for PR 16938 at commit 8559e4e.

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73400 has finished for PR 16938 at commit afa1313.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73405 has finished for PR 16938 at commit 1f2ce17.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73411 has finished for PR 16938 at commit 5a3e5ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73424 has finished for PR 16938 at commit 416ea37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

@gatorsmile @cloud-fan could you help to review this pr? thanks :)


saveDataIntoTable(
sparkSession, table, table.storage.locationUri, query, mode, tableExists = true)
saveDataIntoTable(sparkSession, table, table.storage.locationUri, query, mode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we just pass APPEND mode instead of creating a new overwrite parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also explain the behavior in the PR description: When path already exists, we should append to it or overwrite it?

Copy link
Contributor Author

@windpiger windpiger Feb 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateTableAsSelectCommand overwrite table path or not has no relation with the SaveMode of CTAS, that is if the table has already exist, we will overwrite it whatever the SaveMode is except that we should still check the path if exists for a managed(another PR to resolve this), so I think overwrite can't be replaced by SaveMode.

@cloud-fan do I understand right? thanks~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @cloud-fan ~

!exists
case (s, exists) =>
throw new IllegalStateException(s"unsupported save mode $s ($exists)")
if (overwrite) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is simple for InsertIntoHadoopFsRelationCommand , just to care if it should overwrite the original already existed path.

}
val result = saveDataIntoTable(
sparkSession, table, tableLocation, query, mode, tableExists = false)
sparkSession, table, tableLocation, query, mode, overwrite = true, tableExists = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this branch, table not exist, and the expected behavior is to create a directory or overwrite it if it's already existed. So we can just pass a OverWrite mode here, and everything should work. Did I miss something?

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73587 has finished for PR 16938 at commit 304ae31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

s"got: ${allPaths.mkString(", ")}")
}

if (pathExists) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we move this check from InsertIntoHadoopFsRelationCommand to here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in InsertIntoHadoopFsRelationCommand , it just care the overwrite logic, like InsertIntoHiveTableCommand https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L80

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in InsertIntoHadoopFsRelationCommand we still need to check path existence, I do not quite agree with this refactor.

Anyway, let's revert it and put the refactor in a new PR, to unblock this bug fix patch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks~

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73634 has finished for PR 16938 at commit 2498dfd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73648 has finished for PR 16938 at commit a8dbcca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

I am modifying the hacky code

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73680 has finished for PR 16938 at commit d78b7d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants