Skip to content

Conversation

@xwu0226
Copy link
Contributor

@xwu0226 xwu0226 commented May 14, 2016

What changes were proposed in this pull request?

Symptom

scala> spark.range(1).write.json("/home/xwu0226/spark-test/data/spark-15269")
Datasource.write -> Path: file:/home/xwu0226/spark-test/data/spark-15269

scala> spark.sql("create table spark_15269 using json options(PATH '/home/xwu0226/spark-test/data/spark-15269')")
16/05/11 14:51:00 WARN CreateDataSourceTableUtils: Couldn't find corresponding Hive SerDe for data source provider json. Persisting data source relation `spark_15269` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
going through newSparkSQLSpecificMetastoreTable()
res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("drop table spark_15269")
res2: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create table spark_15269 using json as select 1 as a")
org.apache.spark.sql.AnalysisException: path file:/user/hive/warehouse/spark_15269 already exists.;
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
...

The 2nd creation of the table fails complaining about the path exists.

Root cause:

When the first table is created as external table with the data source path, but as json, createDataSourceTablesconsiders it as non-hive compatible table because jsonis not a Hive SerDe. Then, newSparkSQLSpecificMetastoreTableis invoked to create the CatalogTablebefore asking HiveClient to create the metastore table. In this call, locationURIis not set. So when we convert CatalogTable to HiveTable before passing to Hive Metastore, hive table's data location is not set. Then, Hive metastore implicitly creates a data location as /tableName, which is, file:/user/hive/warehouse/spark_15269 in the above case.

When dropping the table, hive does not delete this implicitly created path because the table is external.

when we create the 2nd table with select and without a path, the table is created as managed table, provided a default path in the options as following:

val optionsWithPath =
      if (!new CaseInsensitiveMap(options).contains("path")) {
        isExternal = false
        options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent))
      } else {
        options
      }

This default path happens to be the hive's warehouse directory + the table name, which is the same as the one hive metastore implicitly created earlier for the 1st table. So when trying to write the provided data to this data source table by InsertIntoHadoopFsRelation, which complains about the path existence since the SaveMode is SaveMode.ErrorIfExists.

Solution:

When creating an external datasource table that is non-hive compatible, make sure we set the provided path to CatalogTable.storage.locationURI, so we avoid hive metastore from implicitly creating a data location for the table.

How was this patch tested?

Testcase is added. And run regtest.

@xwu0226
Copy link
Contributor Author

xwu0226 commented May 14, 2016

cc @liancheng @yhuai @gatorsmile Thanks!

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, locationUri is still None. Does that mean we still let Hive generate the path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive metastore will generate the path for internal table too. But when dropping, it will be deleted by hive too.

Copy link
Contributor

@liancheng liancheng May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile The else branch is for managed tables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

@xwu0226
Copy link
Contributor Author

xwu0226 commented May 18, 2016

@yhuai @liancheng I updated the code and also did some manual tests for creating table with a real hdfs path to one of my clusters. For example:
scala> spark.sql("create table json_t3 (c1 int) using json options (path 'hdfs://bdavm009.svl.ibm.com:8020/tmp/json_t3')")
and my hdfs environment shows

hdfs dfs -ls /tmp
drwxrwxrwx   - xwu0226   hdfs          0 2016-05-18 14:52 /tmp/json_t3 

Then, I create another table with the previously created data path

scala> spark.sql("create table json_t4 (c1 int) using json options (path 'hdfs://bdavm009.svl.ibm.com:8020/tmp/json_t3/part-r-00003-8382e0e2-8518-48df-82c8-b6c84ab03c45.json')")

scala> spark.sql("select * from json_t4").show
16/05/18 14:59:50 WARN DataSource: Error while looking for metadata directory.
+---+
| c1|
+---+
|  1|
+---+

Please take a look at the change again! Thank you very much!

}
} else {
None
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following line should be enough for localtionUri:

locationUri = new CaseInsensitiveMap(options).get("path")

Consider the following directory layout containing two Parquet files:

/tmp/dir/
  part-00001.parquet
  part-00002.parquet

If we pass "/tmp/dir/part-00001.parquet" as file path, the logic above will use the "/tmp/dir/" as locationUri, thus "part-00002.parquet" is also included, which is not the expected behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liancheng Thanks! I tried this before, but hive complained that the path is either not a directory or it can not create one with the path.. This was the reason it failed the testcases in MetastoreDataSourcesSuite, wherever we create a datasource (non-hive compatible) table with an exact file name. Example:

[info] - CTAS a managed table *** FAILED *** (365 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/home/xwu0226/spark/sql/hive/target/scala-2.11/test-classes/sample.json is not a directory or unable to create one);

I also tried in hive shell:

hive> create external table t_txt1 (c1 int) location '/tmp/test1.txt';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt is not a directory or unable to create one)

So it seems hive only takes a directory as table location. In our case, we need to give hive a directory via locationURI.

For your concern of having a directory containing multiple files, in this case, we are in the non-hive compatible code path, do we still expect the consistency between hive and spark sql? Querying from spark sql will return expected results. while the results will be different from hive. But current behavior of non-hive compatible table is like this already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... Then I think we probably should save the path as a SerDe property (similar to schema of persisted data source tables). @yhuai How do you think? It breaks existing functionality if we can't read individual files.

Copy link
Contributor Author

@xwu0226 xwu0226 May 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full path is already populated in the SerDe Property with the options, see lines. The select will still work from spark sql because HiveMetastoreCatalog.lookupRelation uses cachedDataSourceTables that loads the persisted data source table with the options = CatalogTable.storage.serdeProperties and DataSource.resolveRelation takes options.get("path") for the relation's file location. See

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For future reference, the above comment is replied here.)

asfgit pushed a commit that referenced this pull request Jun 1, 2016
… while creating external Spark SQL data sourcet tables.

This PR is an alternative to #13120 authored by xwu0226.

## What changes were proposed in this pull request?

When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external).

This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table.

Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround.

[1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408

## How was this patch tested?

1. A new test case is added in `HiveQuerySuite` for this case
2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.)

Author: Cheng Lian <lian@databricks.com>

Closes #13270 from liancheng/spark-15269-unpleasant-fix.

(cherry picked from commit 7bb64aa)
Signed-off-by: Cheng Lian <lian@databricks.com>
asfgit pushed a commit that referenced this pull request Jun 1, 2016
… while creating external Spark SQL data sourcet tables.

This PR is an alternative to #13120 authored by xwu0226.

## What changes were proposed in this pull request?

When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external).

This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table.

Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround.

[1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408

## How was this patch tested?

1. A new test case is added in `HiveQuerySuite` for this case
2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.)

Author: Cheng Lian <lian@databricks.com>

Closes #13270 from liancheng/spark-15269-unpleasant-fix.
@xwu0226
Copy link
Contributor Author

xwu0226 commented Jun 2, 2016

Closed as #13270 resolves the issue.

@xwu0226 xwu0226 closed this Jun 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants