[SPARK-15269][SQL] Set provided path to CatalogTable.storage.locationURI when creating external non-hive compatible table #13120

xwu0226 · 2016-05-14T21:52:35Z

What changes were proposed in this pull request?

Symptom

scala> spark.range(1).write.json("/home/xwu0226/spark-test/data/spark-15269")
Datasource.write -> Path: file:/home/xwu0226/spark-test/data/spark-15269

scala> spark.sql("create table spark_15269 using json options(PATH '/home/xwu0226/spark-test/data/spark-15269')")
16/05/11 14:51:00 WARN CreateDataSourceTableUtils: Couldn't find corresponding Hive SerDe for data source provider json. Persisting data source relation `spark_15269` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
going through newSparkSQLSpecificMetastoreTable()
res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("drop table spark_15269")
res2: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create table spark_15269 using json as select 1 as a")
org.apache.spark.sql.AnalysisException: path file:/user/hive/warehouse/spark_15269 already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
...

The 2nd creation of the table fails complaining about the path exists.

Root cause:

When the first table is created as external table with the data source path, but as json, createDataSourceTablesconsiders it as non-hive compatible table because jsonis not a Hive SerDe. Then, newSparkSQLSpecificMetastoreTableis invoked to create the CatalogTablebefore asking HiveClient to create the metastore table. In this call, locationURIis not set. So when we convert CatalogTable to HiveTable before passing to Hive Metastore, hive table's data location is not set. Then, Hive metastore implicitly creates a data location as /tableName, which is, file:/user/hive/warehouse/spark_15269 in the above case.

When dropping the table, hive does not delete this implicitly created path because the table is external.

when we create the 2nd table with select and without a path, the table is created as managed table, provided a default path in the options as following:

val optionsWithPath =
      if (!new CaseInsensitiveMap(options).contains("path")) {
        isExternal = false
        options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent))
      } else {
        options
      }

This default path happens to be the hive's warehouse directory + the table name, which is the same as the one hive metastore implicitly created earlier for the 1st table. So when trying to write the provided data to this data source table by InsertIntoHadoopFsRelation, which complains about the path existence since the SaveMode is SaveMode.ErrorIfExists.

Solution:

When creating an external datasource table that is non-hive compatible, make sure we set the provided path to CatalogTable.storage.locationURI, so we avoid hive metastore from implicitly creating a data location for the table.

How was this patch tested?

Testcase is added. And run regtest.

xwu0226 · 2016-05-14T21:53:44Z

cc @liancheng @yhuai @gatorsmile Thanks!

AmplabJenkins · 2016-05-14T21:57:14Z

Can one of the admins verify this patch?

gatorsmile · 2016-05-15T22:03:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

In this case, locationUri is still None. Does that mean we still let Hive generate the path?

Hive metastore will generate the path for internal table too. But when dropping, it will be deleted by hive too.

@gatorsmile The else branch is for managed tables.

Got it, thanks!

…lowing up PR12974

…ause following up PR12974" This reverts commit 98a1f804d7343ba77731f9aa400c00f1a26c03fe.

xwu0226 · 2016-05-18T22:04:08Z

@yhuai @liancheng I updated the code and also did some manual tests for creating table with a real hdfs path to one of my clusters. For example:
scala> spark.sql("create table json_t3 (c1 int) using json options (path 'hdfs://bdavm009.svl.ibm.com:8020/tmp/json_t3')")
and my hdfs environment shows

hdfs dfs -ls /tmp
drwxrwxrwx   - xwu0226   hdfs          0 2016-05-18 14:52 /tmp/json_t3

Then, I create another table with the previously created data path

scala> spark.sql("create table json_t4 (c1 int) using json options (path 'hdfs://bdavm009.svl.ibm.com:8020/tmp/json_t3/part-r-00003-8382e0e2-8518-48df-82c8-b6c84ab03c45.json')")

scala> spark.sql("select * from json_t4").show
16/05/18 14:59:50 WARN DataSource: Error while looking for metadata directory.
+---+
| c1|
+---+
|  1|
+---+

Please take a look at the change again! Thank you very much!

liancheng · 2016-05-19T12:33:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

+            }
+          } else {
+            None
+          },


The following line should be enough for localtionUri:

locationUri = new CaseInsensitiveMap(options).get("path")

Consider the following directory layout containing two Parquet files:

/tmp/dir/ part-00001.parquet part-00002.parquet

If we pass "/tmp/dir/part-00001.parquet" as file path, the logic above will use the "/tmp/dir/" as locationUri, thus "part-00002.parquet" is also included, which is not the expected behavior.

@liancheng Thanks! I tried this before, but hive complained that the path is either not a directory or it can not create one with the path.. This was the reason it failed the testcases in MetastoreDataSourcesSuite, wherever we create a datasource (non-hive compatible) table with an exact file name. Example:

[info] - CTAS a managed table *** FAILED *** (365 milliseconds) [info] org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/home/xwu0226/spark/sql/hive/target/scala-2.11/test-classes/sample.json is not a directory or unable to create one);

I also tried in hive shell:

hive> create external table t_txt1 (c1 int) location '/tmp/test1.txt'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt is not a directory or unable to create one)

So it seems hive only takes a directory as table location. In our case, we need to give hive a directory via locationURI.

For your concern of having a directory containing multiple files, in this case, we are in the non-hive compatible code path, do we still expect the consistency between hive and spark sql? Querying from spark sql will return expected results. while the results will be different from hive. But current behavior of non-hive compatible table is like this already.

Hm... Then I think we probably should save the path as a SerDe property (similar to schema of persisted data source tables). @yhuai How do you think? It breaks existing functionality if we can't read individual files.

The full path is already populated in the SerDe Property with the options, see lines. The select will still work from spark sql because HiveMetastoreCatalog.lookupRelation uses cachedDataSourceTables that loads the persisted data source table with the options = CatalogTable.storage.serdeProperties and DataSource.resolveRelation takes options.get("path") for the relation's file location. See

(For future reference, the above comment is replied here.)

… while creating external Spark SQL data sourcet tables. This PR is an alternative to #13120 authored by xwu0226. ## What changes were proposed in this pull request? When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external). This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table. Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround. [1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408 ## How was this patch tested? 1. A new test case is added in `HiveQuerySuite` for this case 2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.) Author: Cheng Lian <lian@databricks.com> Closes #13270 from liancheng/spark-15269-unpleasant-fix. (cherry picked from commit 7bb64aa) Signed-off-by: Cheng Lian <lian@databricks.com>

… while creating external Spark SQL data sourcet tables. This PR is an alternative to #13120 authored by xwu0226. ## What changes were proposed in this pull request? When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external). This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table. Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround. [1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408 ## How was this patch tested? 1. A new test case is added in `HiveQuerySuite` for this case 2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.) Author: Cheng Lian <lian@databricks.com> Closes #13270 from liancheng/spark-15269-unpleasant-fix.

xwu0226 · 2016-06-02T00:18:46Z

Closed as #13270 resolves the issue.

gatorsmile reviewed May 15, 2016
View reviewed changes

xwu0226 added 5 commits May 17, 2016 15:16

spark-15206 add testcases for distinct aggregate in having clause fol…

ee1d21d

…lowing up PR12974

Revert "spark-15206 add testcases for distinct aggregate in having cl…

720cc72

…ause following up PR12974" This reverts commit 98a1f804d7343ba77731f9aa400c00f1a26c03fe.

SPARK-15269: set locationUFI to the non-hive compatible metastore table

c3027a1

SPARK-15269: only for external datasource table

f13bf6e

SPARK-15269: addresses review comments

017f172

xwu0226 force-pushed the SPARK-15269 branch from 58ad82d to 017f172 Compare May 18, 2016 21:36

SPARK-15269: code style

1c903d2

liancheng reviewed May 19, 2016
View reviewed changes

liancheng mentioned this pull request May 24, 2016

[SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables. #13270

Closed

xwu0226 closed this Jun 2, 2016

[SPARK-15269][SQL] Set provided path to CatalogTable.storage.locationURI when creating external non-hive compatible table #13120

[SPARK-15269][SQL] Set provided path to CatalogTable.storage.locationURI when creating external non-hive compatible table #13120

Uh oh!

Conversation

xwu0226 commented May 14, 2016

What changes were proposed in this pull request?

Symptom

Root cause:

Solution:

How was this patch tested?

Uh oh!

xwu0226 commented May 14, 2016

Uh oh!

AmplabJenkins commented May 14, 2016

Uh oh!

gatorsmile May 15, 2016

Choose a reason for hiding this comment

Uh oh!

xwu0226 May 16, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng May 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 18, 2016

Choose a reason for hiding this comment

Uh oh!

xwu0226 commented May 18, 2016

Uh oh!

liancheng May 19, 2016

Choose a reason for hiding this comment

Uh oh!

xwu0226 May 19, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng May 20, 2016

Choose a reason for hiding this comment

Uh oh!

xwu0226 May 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng May 25, 2016

Choose a reason for hiding this comment

Uh oh!

xwu0226 commented Jun 2, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

liancheng May 18, 2016 •

edited

Loading

xwu0226 May 21, 2016 •

edited

Loading