Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Nov 7, 2016

What changes were proposed in this pull request?

This PR corrects several partition related behaviors of ExternalCatalog:

  1. default partition location should not always lower case the partition column names in path string(fix HiveExternalCatalog)
  2. rename partition should not always lower case the partition column names in updated partition path string(fix HiveExternalCatalog)
  3. rename partition should update the partition location only for managed table(fix InMemoryCatalog)
  4. create partition with existing directory should be fine(fix InMemoryCatalog)
  5. create partition with non-existing directory should create that directory(fix InMemoryCatalog)
  6. drop partition from external table should not delete the directory(fix InMemoryCatalog)

How was this patch tested?

new tests in ExternalCatalogSuite

@cloud-fan cloud-fan changed the title correct several partition related behaviours of ExternalCatalog [SPARK-17990][SPARK-18302][SQL] correct several partition related behaviours of ExternalCatalog Nov 7, 2016
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Nov 7, 2016

@SparkQA
Copy link

SparkQA commented Nov 7, 2016

Test build #68281 has finished for PR 15797 at commit e0e012c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mallman
Copy link
Contributor

mallman commented Nov 7, 2016

@cloud-fan Thanks for taking this on. I will test out a build from your branch to verify that it fixes the problem I was seeing.

@mallman
Copy link
Contributor

mallman commented Nov 7, 2016

I tried your patch and it works for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to escape these, as in PartitioningUtils.escapePathName?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pull this into PartitioningUtils?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this apply to external tables as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, Hive only update the partition location when renaming partitions for managed table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it had a custom location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RENAME PARTITION can only change the partition spec, not partition path. The SQL syntax is :ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;, users can't specify partition location

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the partition to be renamed has custom location, we will move the directory too.

Actually there is no such a concept called custom location for partitions. Partitions always have a location, if users didn't specify, we will generate a default one. The generated one and specified one are treated equally.


import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec

object ExternalCatalogUtils {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl PartitioningUtils is not accessible in catalyst module, so I created this.

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68309 has finished for PR 15797 at commit aa2536f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68314 has finished for PR 15797 at commit aa2536f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68322 has finished for PR 15797 at commit dddee47.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

builder.toString()
}

def unescapePathName(path: String): String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If escapePathName is moved, I think we also need to move unescapePathName to the same file?

val catalog = newBasicCatalog()
val db = catalog.getDatabase("db1")
val table = CatalogTable(
identifier = TableIdentifier("my_table", Some("db1")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we keep the original names unchanged? Now, the table/database names allow alphanumeric characters and underscores. Thus, I think they are named using underscore on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this test was written by me, and I picked myTable at the first time, but changed to my_table because of this bug.

throw new AnalysisException(s"table $identifier did not specify database")
}

def location: String = storage.locationUri.getOrElse {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the other functions in this file, add the function descriptions too?

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68329 has finished for PR 15797 at commit dddee47.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68336 has finished for PR 15797 at commit f85bb27.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68343 has finished for PR 15797 at commit d965a56.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68376 has finished for PR 15797 at commit d939b37.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val fs = tablePath.getFileSystem(hadoopConf)
val newParts = newSpecs.map { spec =>
val partition = client.getPartition(db, table, lowerCasePartitionSpec(spec))
val wrongPath = new Path(partition.storage.locationUri.get)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating a function location for partitions too, just like what this PR did for table's location?

fs.rename(wrongPath, rightPath)
} catch {
case e: IOException =>
throw new SparkException(s"Unable to rename partition path $wrongPath", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we also need to add rightPath here

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68392 has finished for PR 15797 at commit 9dbc3f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ericl
Copy link
Contributor

ericl commented Nov 9, 2016

I just tested this PR with #15814 modified with mixed case partition names in unit tests, and it fixes the issues there too. I'll leave it to @gatorsmile to lgtm since he has a few more comments.

@SparkQA
Copy link

SparkQA commented Nov 10, 2016

Test build #68435 has finished for PR 15797 at commit 56f3073.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val dir = new Path(tableMeta.location)
try {
val fs = dir.getFileSystem(hadoopConfig)
fs.delete(dir, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the deletion is recursive and thus it can delete the lcation/directory of a partitioned table. I checked the implementation of delete. It sounds like it does not guaranttee the atomicity. The deletion could be partial. We might need to also put a TODO here, like what we did for dropPartition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's minor and not related to this PR, I'd like to not touch it.

@gatorsmile
Copy link
Member

gatorsmile commented Nov 10, 2016

In renamePartition of HiveExternalCatalog, we update the file names to preserve the case after renaming partitions. How about another command loadPartition?

More generally, if we insert new data into a partitioned table, Hive will create a new directory for non-existent partitions. These newly created file path are not case-preserving, right?

val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => col.toLowerCase != col)
if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) {
val tablePath = new Path(tableMeta.location)
val fs = tablePath.getFileSystem(hadoopConf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other functions, the function call getFileSystem is also part of the try-catch block. Do we need to make it consistent with others?

@cloud-fan
Copy link
Contributor Author

In renamePartition of HiveExternalCatalog, we update the file names to preserve the case after renaming partitions. How about another command loadPartition? More generally, if we insert new data into a partitioned table, Hive will create a new directory for non-existent partitions. These newly created file path are not case-preserving, right?

This PR tries its best to fix some partition related behaviours, I think it's OK to leave some hard works for follow-ups.

@SparkQA
Copy link

SparkQA commented Nov 10, 2016

Test build #68461 has finished for PR 15797 at commit 0a15edc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@rxin
Copy link
Contributor

rxin commented Nov 10, 2016

Merging in master/branch-2.1.

asfgit pushed a commit that referenced this pull request Nov 10, 2016
…aviours of ExternalCatalog

## What changes were proposed in this pull request?

This PR corrects several partition related behaviors of `ExternalCatalog`:

1. default partition location should not always lower case the partition column names in path string(fix `HiveExternalCatalog`)
2. rename partition should not always lower case the partition column names in updated partition path string(fix `HiveExternalCatalog`)
3. rename partition should update the partition location only for managed table(fix `InMemoryCatalog`)
4. create partition with existing directory should be fine(fix `InMemoryCatalog`)
5. create partition with non-existing directory should create that directory(fix `InMemoryCatalog`)
6. drop partition from external table should not delete the directory(fix `InMemoryCatalog`)

## How was this patch tested?

new tests in `ExternalCatalogSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #15797 from cloud-fan/partition.

(cherry picked from commit 2f7461f)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 2f7461f Nov 10, 2016
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…aviours of ExternalCatalog

## What changes were proposed in this pull request?

This PR corrects several partition related behaviors of `ExternalCatalog`:

1. default partition location should not always lower case the partition column names in path string(fix `HiveExternalCatalog`)
2. rename partition should not always lower case the partition column names in updated partition path string(fix `HiveExternalCatalog`)
3. rename partition should update the partition location only for managed table(fix `InMemoryCatalog`)
4. create partition with existing directory should be fine(fix `InMemoryCatalog`)
5. create partition with non-existing directory should create that directory(fix `InMemoryCatalog`)
6. drop partition from external table should not delete the directory(fix `InMemoryCatalog`)

## How was this patch tested?

new tests in `ExternalCatalogSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#15797 from cloud-fan/partition.
val rightPath = ExternalCatalogUtils.generatePartitionPath(
spec, partitionColumnNames, tablePath)
try {
tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath)
Copy link
Member

@gatorsmile gatorsmile Jan 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found an issue here... When we call rename, not all the file systems have the same behaviors. For example, on mac OS, when we doing this .../tbl/a=5/b=6 -> .../tbl/A=5/B=6 . The result is .../tbl/a=5/B=6. Thus, it is not recursive. However, the file system used in Jenkin does not have such an issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! let me think about how to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants