-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15269][SQL] Set provided path to CatalogTable.storage.locationURI when creating external non-hive compatible table #13120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @liancheng @yhuai @gatorsmile Thanks! |
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, locationUri is still None. Does that mean we still let Hive generate the path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive metastore will generate the path for internal table too. But when dropping, it will be deleted by hive too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile The else branch is for managed tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks!
…lowing up PR12974
…ause following up PR12974" This reverts commit 98a1f804d7343ba77731f9aa400c00f1a26c03fe.
|
@yhuai @liancheng I updated the code and also did some manual tests for creating table with a real hdfs path to one of my clusters. For example: Then, I create another table with the previously created data path Please take a look at the change again! Thank you very much! |
| } | ||
| } else { | ||
| None | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following line should be enough for localtionUri:
locationUri = new CaseInsensitiveMap(options).get("path")Consider the following directory layout containing two Parquet files:
/tmp/dir/
part-00001.parquet
part-00002.parquet
If we pass "/tmp/dir/part-00001.parquet" as file path, the logic above will use the "/tmp/dir/" as locationUri, thus "part-00002.parquet" is also included, which is not the expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liancheng Thanks! I tried this before, but hive complained that the path is either not a directory or it can not create one with the path.. This was the reason it failed the testcases in MetastoreDataSourcesSuite, wherever we create a datasource (non-hive compatible) table with an exact file name. Example:
[info] - CTAS a managed table *** FAILED *** (365 milliseconds)
[info] org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/home/xwu0226/spark/sql/hive/target/scala-2.11/test-classes/sample.json is not a directory or unable to create one);
I also tried in hive shell:
hive> create external table t_txt1 (c1 int) location '/tmp/test1.txt';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt is not a directory or unable to create one)
So it seems hive only takes a directory as table location. In our case, we need to give hive a directory via locationURI.
For your concern of having a directory containing multiple files, in this case, we are in the non-hive compatible code path, do we still expect the consistency between hive and spark sql? Querying from spark sql will return expected results. while the results will be different from hive. But current behavior of non-hive compatible table is like this already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... Then I think we probably should save the path as a SerDe property (similar to schema of persisted data source tables). @yhuai How do you think? It breaks existing functionality if we can't read individual files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full path is already populated in the SerDe Property with the options, see lines. The select will still work from spark sql because HiveMetastoreCatalog.lookupRelation uses cachedDataSourceTables that loads the persisted data source table with the options = CatalogTable.storage.serdeProperties and DataSource.resolveRelation takes options.get("path") for the relation's file location. See
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(For future reference, the above comment is replied here.)
… while creating external Spark SQL data sourcet tables. This PR is an alternative to #13120 authored by xwu0226. ## What changes were proposed in this pull request? When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external). This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table. Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround. [1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408 ## How was this patch tested? 1. A new test case is added in `HiveQuerySuite` for this case 2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.) Author: Cheng Lian <lian@databricks.com> Closes #13270 from liancheng/spark-15269-unpleasant-fix. (cherry picked from commit 7bb64aa) Signed-off-by: Cheng Lian <lian@databricks.com>
… while creating external Spark SQL data sourcet tables. This PR is an alternative to #13120 authored by xwu0226. ## What changes were proposed in this pull request? When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external). This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table. Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround. [1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408 ## How was this patch tested? 1. A new test case is added in `HiveQuerySuite` for this case 2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.) Author: Cheng Lian <lian@databricks.com> Closes #13270 from liancheng/spark-15269-unpleasant-fix.
|
Closed as #13270 resolves the issue. |
What changes were proposed in this pull request?
Symptom
The 2nd creation of the table fails complaining about the path exists.
Root cause:
When the first table is created as external table with the data source path, but as json,
createDataSourceTablesconsiders it as non-hive compatible table becausejsonis not a Hive SerDe. Then,newSparkSQLSpecificMetastoreTableis invoked to create theCatalogTablebefore asking HiveClient to create the metastore table. In this call,locationURIis not set. So when we convertCatalogTableto HiveTable before passing to Hive Metastore, hive table's data location is not set. Then, Hive metastore implicitly creates a data location as /tableName, which is,file:/user/hive/warehouse/spark_15269in the above case.When dropping the table, hive does not delete this implicitly created path because the table is external.
when we create the 2nd table with select and without a path, the table is created as managed table, provided a default path in the options as following:
This default path happens to be the hive's warehouse directory + the table name, which is the same as the one hive metastore implicitly created earlier for the 1st table. So when trying to write the provided data to this data source table by
InsertIntoHadoopFsRelation, which complains about the path existence since the SaveMode is SaveMode.ErrorIfExists.Solution:
When creating an external datasource table that is non-hive compatible, make sure we set the provided path to
CatalogTable.storage.locationURI, so we avoid hive metastore from implicitly creating a data location for the table.How was this patch tested?
Testcase is added. And run regtest.