-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17990][SPARK-18302][SQL] correct several partition related behaviours of ExternalCatalog #15797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method just copies the code in the above if branch: https://github.com/apache/spark/pull/15797/files#diff-159191585e10542f013cb3a714f26075R199
|
cc @mallman this should fix https://issues.apache.org/jira/browse/SPARK-17990 |
|
Test build #68281 has finished for PR 15797 at commit
|
|
@cloud-fan Thanks for taking this on. I will test out a build from your branch to verify that it fixes the problem I was seeing. |
|
I tried your patch and it works for me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to escape these, as in PartitioningUtils.escapePathName?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we pull this into PartitioningUtils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this apply to external tables as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, Hive only update the partition location when renaming partitions for managed table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it had a custom location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RENAME PARTITION can only change the partition spec, not partition path. The SQL syntax is :ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;, users can't specify partition location
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the partition to be renamed has custom location, we will move the directory too.
Actually there is no such a concept called custom location for partitions. Partitions always have a location, if users didn't specify, we will generate a default one. The generated one and specified one are treated equally.
|
|
||
| import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec | ||
|
|
||
| object ExternalCatalogUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl PartitioningUtils is not accessible in catalyst module, so I created this.
|
Test build #68309 has finished for PR 15797 at commit
|
|
retest this please |
|
Test build #68314 has finished for PR 15797 at commit
|
|
Test build #68322 has finished for PR 15797 at commit
|
|
retest this please |
| builder.toString() | ||
| } | ||
|
|
||
| def unescapePathName(path: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If escapePathName is moved, I think we also need to move unescapePathName to the same file?
| val catalog = newBasicCatalog() | ||
| val db = catalog.getDatabase("db1") | ||
| val table = CatalogTable( | ||
| identifier = TableIdentifier("my_table", Some("db1")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we keep the original names unchanged? Now, the table/database names allow alphanumeric characters and underscores. Thus, I think they are named using underscore on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this test was written by me, and I picked myTable at the first time, but changed to my_table because of this bug.
| throw new AnalysisException(s"table $identifier did not specify database") | ||
| } | ||
|
|
||
| def location: String = storage.locationUri.getOrElse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the other functions in this file, add the function descriptions too?
|
Test build #68329 has finished for PR 15797 at commit
|
|
Test build #68336 has finished for PR 15797 at commit
|
|
Test build #68343 has finished for PR 15797 at commit
|
|
Test build #68376 has finished for PR 15797 at commit
|
| val fs = tablePath.getFileSystem(hadoopConf) | ||
| val newParts = newSpecs.map { spec => | ||
| val partition = client.getPartition(db, table, lowerCasePartitionSpec(spec)) | ||
| val wrongPath = new Path(partition.storage.locationUri.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about creating a function location for partitions too, just like what this PR did for table's location?
| fs.rename(wrongPath, rightPath) | ||
| } catch { | ||
| case e: IOException => | ||
| throw new SparkException(s"Unable to rename partition path $wrongPath", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we also need to add rightPath here
|
Test build #68392 has finished for PR 15797 at commit
|
|
I just tested this PR with #15814 modified with mixed case partition names in unit tests, and it fixes the issues there too. I'll leave it to @gatorsmile to lgtm since he has a few more comments. |
|
Test build #68435 has finished for PR 15797 at commit
|
| val dir = new Path(tableMeta.location) | ||
| try { | ||
| val fs = dir.getFileSystem(hadoopConfig) | ||
| fs.delete(dir, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the deletion is recursive and thus it can delete the lcation/directory of a partitioned table. I checked the implementation of delete. It sounds like it does not guaranttee the atomicity. The deletion could be partial. We might need to also put a TODO here, like what we did for dropPartition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's minor and not related to this PR, I'd like to not touch it.
|
In More generally, if we insert new data into a partitioned table, Hive will create a new directory for non-existent partitions. These newly created file path are not case-preserving, right? |
| val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => col.toLowerCase != col) | ||
| if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) { | ||
| val tablePath = new Path(tableMeta.location) | ||
| val fs = tablePath.getFileSystem(hadoopConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the other functions, the function call getFileSystem is also part of the try-catch block. Do we need to make it consistent with others?
This PR tries its best to fix some partition related behaviours, I think it's OK to leave some hard works for follow-ups. |
|
Test build #68461 has finished for PR 15797 at commit
|
|
LGTM |
|
Merging in master/branch-2.1. |
…aviours of ExternalCatalog ## What changes were proposed in this pull request? This PR corrects several partition related behaviors of `ExternalCatalog`: 1. default partition location should not always lower case the partition column names in path string(fix `HiveExternalCatalog`) 2. rename partition should not always lower case the partition column names in updated partition path string(fix `HiveExternalCatalog`) 3. rename partition should update the partition location only for managed table(fix `InMemoryCatalog`) 4. create partition with existing directory should be fine(fix `InMemoryCatalog`) 5. create partition with non-existing directory should create that directory(fix `InMemoryCatalog`) 6. drop partition from external table should not delete the directory(fix `InMemoryCatalog`) ## How was this patch tested? new tests in `ExternalCatalogSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #15797 from cloud-fan/partition. (cherry picked from commit 2f7461f) Signed-off-by: Reynold Xin <rxin@databricks.com>
…aviours of ExternalCatalog ## What changes were proposed in this pull request? This PR corrects several partition related behaviors of `ExternalCatalog`: 1. default partition location should not always lower case the partition column names in path string(fix `HiveExternalCatalog`) 2. rename partition should not always lower case the partition column names in updated partition path string(fix `HiveExternalCatalog`) 3. rename partition should update the partition location only for managed table(fix `InMemoryCatalog`) 4. create partition with existing directory should be fine(fix `InMemoryCatalog`) 5. create partition with non-existing directory should create that directory(fix `InMemoryCatalog`) 6. drop partition from external table should not delete the directory(fix `InMemoryCatalog`) ## How was this patch tested? new tests in `ExternalCatalogSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes apache#15797 from cloud-fan/partition.
| val rightPath = ExternalCatalogUtils.generatePartitionPath( | ||
| spec, partitionColumnNames, tablePath) | ||
| try { | ||
| tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found an issue here... When we call rename, not all the file systems have the same behaviors. For example, on mac OS, when we doing this .../tbl/a=5/b=6 -> .../tbl/A=5/B=6 . The result is .../tbl/a=5/B=6. Thus, it is not recursive. However, the file system used in Jenkin does not have such an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! let me think about how to fix it.
What changes were proposed in this pull request?
This PR corrects several partition related behaviors of
ExternalCatalog:HiveExternalCatalog)HiveExternalCatalog)InMemoryCatalog)InMemoryCatalog)InMemoryCatalog)InMemoryCatalog)How was this patch tested?
new tests in
ExternalCatalogSuite