Spark: support delete_reachable_files procedure #3719

ajantha-bhat · 2021-12-11T13:30:10Z

All the spark actions should have call procedures for easy SQL access. We don't have the call procedure for delete_reachable_files. Hence the PR.

ajantha-bhat · 2021-12-11T13:30:37Z

...ons/src/test/java/org/apache/iceberg/spark/extensions/TestDeleteReachableFilesProcedure.java

+    if (catalogName.equals("testhadoop") || catalogName.equals("testhive")) {
+      // This procedure cannot work for hadoop catalog,
+      // as after drop table metadata file will be deleted (even with purge=false)
+      // For hive catalog, drop table with purge = true is not cleaning the table in metastore.


Is it an issue ?

Two things:

I would use Assume.assumeFalse("The reason is.... ", isHadoopCatalog);. I would put the Hive one on its own line with its own explanation as well. And set up in the test constructor whether or not the catalog is a hadoop catalog or a hive catalog vs relying on the names once you're in the tests (even if you rely on the names in the constructor to determine the catalog type, it's still cleaner to use a simple boolean variable and then the boolean condition can be changed as tests evolve).

And I'd place it at the start of each test vs in a nested function in the file so it's clear why it's skipped.

Here's an example:

iceberg/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestNamespaceSQL.java

Lines 66 to 68 in f5a7537

@Test

public void testDefaultNamespace() {

Assume.assumeFalse("Hadoop has no default namespace configured", isHadoopCatalog);

But more importantly, I would see if you can use temp to forcefully drop the data in an @After (after each test) each time so the tests can be run. Or otherwise refactor your tests so that the LOCATION is simply different for each test. You might have to save the location of temp.newFolder() in a private class level variable to reference for dropping. I'm not sure. That would ideally get rid of this pre-requirement. It should be able to run for at least Hive catalog.

And then create an issue related to the purge if necessary.

An example of setting the isHadoopCatalog variable in that same file:

iceberg/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestNamespaceSQL.java

Lines 43 to 49 in f5a7537

private final boolean isHadoopCatalog;

public TestNamespaceSQL(String catalogName, String implementation, Map<String, String> config) {

super(catalogName, implementation, config);

this.fullNamespace = ("spark_catalog".equals(catalogName) ? "" : catalogName + ".") + NS;

this.isHadoopCatalog = "testhadoop".equals(catalogName);

}

Notice that it is the same call, but it's much cleaner to look at in the test when you use Assume.assumeFalse(isHadoopCatalog).

But more importantly (and sorry for the long winded comments, but I figured since you're asking these are useful things we use a lot in the repo):

I'd suggest getting the temp folder to be in the correct state in either an @Before function or an @After function so this can be run with as many catalogs as are supported by it. It's usually just HadoopCatalog or spark_catalog that I see the assumption on.

So the HiveCatalog assumption gives me a bit of pause and makes me wonder if we can't fix it temporarily within the test and then fix the behavior fully in a later patch (please open a ticket if there is an incorrect behavior taking place).

Also, have you merged in latest master? Somebody submitted a fix for properly purging data files so that might help. But this seems to be about NOT wanting to purge and having purge happen anyways (if my understanding is correct), so that seems like a new issue that needs to be looked into.

kbendick

Left some comments about the test.

I don't fully understand the issue, but I left some suggestions and links about how we typically format such things in the Iceberg repo. And then if there is behavior that shouldn't be happening that is (which it sounds like we're purging data and not respecting the purge flag), then please open another issue and we'll investigate ASAP (or at least have record of it to follow up on) 🙂

kbendick · 2021-12-11T20:59:40Z

...ons/src/test/java/org/apache/iceberg/spark/extensions/TestDeleteReachableFilesProcedure.java

+    if (catalogName.equals("testhadoop") || catalogName.equals("testhive")) {
+      // This procedure cannot work for hadoop catalog,
+      // as after drop table metadata file will be deleted (even with purge=false)
+      // For hive catalog, drop table with purge = true is not cleaning the table in metastore.


Two things:

I would use Assume.assumeFalse("The reason is.... ", isHadoopCatalog);. I would put the Hive one on its own line with its own explanation as well. And set up in the test constructor whether or not the catalog is a hadoop catalog or a hive catalog vs relying on the names once you're in the tests (even if you rely on the names in the constructor to determine the catalog type, it's still cleaner to use a simple boolean variable and then the boolean condition can be changed as tests evolve).

And I'd place it at the start of each test vs in a nested function in the file so it's clear why it's skipped.

Here's an example:

iceberg/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestNamespaceSQL.java

Lines 66 to 68 in f5a7537

@Test

public void testDefaultNamespace() {

Assume.assumeFalse("Hadoop has no default namespace configured", isHadoopCatalog);

But more importantly, I would see if you can use temp to forcefully drop the data in an @After (after each test) each time so the tests can be run. Or otherwise refactor your tests so that the LOCATION is simply different for each test. You might have to save the location of temp.newFolder() in a private class level variable to reference for dropping. I'm not sure. That would ideally get rid of this pre-requirement. It should be able to run for at least Hive catalog.

And then create an issue related to the purge if necessary.

kbendick · 2021-12-11T21:10:21Z

...ons/src/test/java/org/apache/iceberg/spark/extensions/TestDeleteReachableFilesProcedure.java

+    if (catalogName.equals("testhadoop") || catalogName.equals("testhive")) {
+      // This procedure cannot work for hadoop catalog,
+      // as after drop table metadata file will be deleted (even with purge=false)
+      // For hive catalog, drop table with purge = true is not cleaning the table in metastore.


An example of setting the isHadoopCatalog variable in that same file:

iceberg/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestNamespaceSQL.java

Lines 43 to 49 in f5a7537

private final boolean isHadoopCatalog;

public TestNamespaceSQL(String catalogName, String implementation, Map<String, String> config) {

super(catalogName, implementation, config);

this.fullNamespace = ("spark_catalog".equals(catalogName) ? "" : catalogName + ".") + NS;

this.isHadoopCatalog = "testhadoop".equals(catalogName);

}

Notice that it is the same call, but it's much cleaner to look at in the test when you use Assume.assumeFalse(isHadoopCatalog).

kbendick · 2021-12-11T21:19:23Z

...ons/src/test/java/org/apache/iceberg/spark/extensions/TestDeleteReachableFilesProcedure.java

+    if (catalogName.equals("testhadoop") || catalogName.equals("testhive")) {
+      // This procedure cannot work for hadoop catalog,
+      // as after drop table metadata file will be deleted (even with purge=false)
+      // For hive catalog, drop table with purge = true is not cleaning the table in metastore.


But more importantly (and sorry for the long winded comments, but I figured since you're asking these are useful things we use a lot in the repo):

I'd suggest getting the temp folder to be in the correct state in either an @Before function or an @After function so this can be run with as many catalogs as are supported by it. It's usually just HadoopCatalog or spark_catalog that I see the assumption on.

So the HiveCatalog assumption gives me a bit of pause and makes me wonder if we can't fix it temporarily within the test and then fix the behavior fully in a later patch (please open a ticket if there is an incorrect behavior taking place).

Also, have you merged in latest master? Somebody submitted a fix for properly purging data files so that might help. But this seems to be about NOT wanting to purge and having purge happen anyways (if my understanding is correct), so that seems like a new issue that needs to be looked into.

Spark: support delete_reachable_files procedure

83883c9

github-actions bot added docs spark labels Dec 11, 2021

ajantha-bhat commented Dec 11, 2021

View reviewed changes

kbendick reviewed Dec 11, 2021

View reviewed changes

ajantha-bhat marked this pull request as draft December 12, 2021 02:11

kbendick mentioned this pull request Dec 13, 2021

slove the problem that unable to delete table files by hivecatalog #3730

Closed

ajantha-bhat closed this Jan 24, 2023

ajantha-bhat deleted the delete branch January 24, 2023 06:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark: support delete_reachable_files procedure #3719

Spark: support delete_reachable_files procedure #3719

Uh oh!

ajantha-bhat commented Dec 11, 2021

Uh oh!

ajantha-bhat Dec 11, 2021

Uh oh!

kbendick Dec 11, 2021

Uh oh!

kbendick Dec 11, 2021

Uh oh!

kbendick Dec 11, 2021

Uh oh!

kbendick left a comment

Uh oh!

kbendick Dec 11, 2021

Uh oh!

kbendick Dec 11, 2021

Uh oh!

kbendick Dec 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	@Test
	public void testDefaultNamespace() {
	Assume.assumeFalse("Hadoop has no default namespace configured", isHadoopCatalog);

	private final boolean isHadoopCatalog;

	public TestNamespaceSQL(String catalogName, String implementation, Map<String, String> config) {
	super(catalogName, implementation, config);
	this.fullNamespace = ("spark_catalog".equals(catalogName) ? "" : catalogName + ".") + NS;
	this.isHadoopCatalog = "testhadoop".equals(catalogName);
	}

Spark: support delete_reachable_files procedure #3719

Spark: support delete_reachable_files procedure #3719

Uh oh!

Conversation

ajantha-bhat commented Dec 11, 2021

Uh oh!

ajantha-bhat Dec 11, 2021

Choose a reason for hiding this comment

Uh oh!

kbendick Dec 11, 2021

Choose a reason for hiding this comment

Uh oh!

kbendick Dec 11, 2021

Choose a reason for hiding this comment

Uh oh!

kbendick Dec 11, 2021

Choose a reason for hiding this comment

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

kbendick Dec 11, 2021

Choose a reason for hiding this comment

Uh oh!

kbendick Dec 11, 2021

Choose a reason for hiding this comment

Uh oh!

kbendick Dec 11, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants