Skip to content

Conversation

@ajantha-bhat
Copy link
Member

All the spark actions should have call procedures for easy SQL access. We don't have the call procedure for delete_reachable_files. Hence the PR.

if (catalogName.equals("testhadoop") || catalogName.equals("testhive")) {
// This procedure cannot work for hadoop catalog,
// as after drop table metadata file will be deleted (even with purge=false)
// For hive catalog, drop table with purge = true is not cleaning the table in metastore.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an issue ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. I would use Assume.assumeFalse("The reason is.... ", isHadoopCatalog);. I would put the Hive one on its own line with its own explanation as well. And set up in the test constructor whether or not the catalog is a hadoop catalog or a hive catalog vs relying on the names once you're in the tests (even if you rely on the names in the constructor to determine the catalog type, it's still cleaner to use a simple boolean variable and then the boolean condition can be changed as tests evolve).

And I'd place it at the start of each test vs in a nested function in the file so it's clear why it's skipped.

Here's an example:

@Test
public void testDefaultNamespace() {
Assume.assumeFalse("Hadoop has no default namespace configured", isHadoopCatalog);

  1. But more importantly, I would see if you can use temp to forcefully drop the data in an @After (after each test) each time so the tests can be run. Or otherwise refactor your tests so that the LOCATION is simply different for each test. You might have to save the location of temp.newFolder() in a private class level variable to reference for dropping. I'm not sure. That would ideally get rid of this pre-requirement. It should be able to run for at least Hive catalog.

And then create an issue related to the purge if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of setting the isHadoopCatalog variable in that same file:

private final boolean isHadoopCatalog;
public TestNamespaceSQL(String catalogName, String implementation, Map<String, String> config) {
super(catalogName, implementation, config);
this.fullNamespace = ("spark_catalog".equals(catalogName) ? "" : catalogName + ".") + NS;
this.isHadoopCatalog = "testhadoop".equals(catalogName);
}

Notice that it is the same call, but it's much cleaner to look at in the test when you use Assume.assumeFalse(isHadoopCatalog).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But more importantly (and sorry for the long winded comments, but I figured since you're asking these are useful things we use a lot in the repo):

I'd suggest getting the temp folder to be in the correct state in either an @Before function or an @After function so this can be run with as many catalogs as are supported by it. It's usually just HadoopCatalog or spark_catalog that I see the assumption on.

So the HiveCatalog assumption gives me a bit of pause and makes me wonder if we can't fix it temporarily within the test and then fix the behavior fully in a later patch (please open a ticket if there is an incorrect behavior taking place).

Also, have you merged in latest master? Somebody submitted a fix for properly purging data files so that might help. But this seems to be about NOT wanting to purge and having purge happen anyways (if my understanding is correct), so that seems like a new issue that needs to be looked into.

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments about the test.

I don't fully understand the issue, but I left some suggestions and links about how we typically format such things in the Iceberg repo. And then if there is behavior that shouldn't be happening that is (which it sounds like we're purging data and not respecting the purge flag), then please open another issue and we'll investigate ASAP (or at least have record of it to follow up on) 🙂

if (catalogName.equals("testhadoop") || catalogName.equals("testhive")) {
// This procedure cannot work for hadoop catalog,
// as after drop table metadata file will be deleted (even with purge=false)
// For hive catalog, drop table with purge = true is not cleaning the table in metastore.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. I would use Assume.assumeFalse("The reason is.... ", isHadoopCatalog);. I would put the Hive one on its own line with its own explanation as well. And set up in the test constructor whether or not the catalog is a hadoop catalog or a hive catalog vs relying on the names once you're in the tests (even if you rely on the names in the constructor to determine the catalog type, it's still cleaner to use a simple boolean variable and then the boolean condition can be changed as tests evolve).

And I'd place it at the start of each test vs in a nested function in the file so it's clear why it's skipped.

Here's an example:

@Test
public void testDefaultNamespace() {
Assume.assumeFalse("Hadoop has no default namespace configured", isHadoopCatalog);

  1. But more importantly, I would see if you can use temp to forcefully drop the data in an @After (after each test) each time so the tests can be run. Or otherwise refactor your tests so that the LOCATION is simply different for each test. You might have to save the location of temp.newFolder() in a private class level variable to reference for dropping. I'm not sure. That would ideally get rid of this pre-requirement. It should be able to run for at least Hive catalog.

And then create an issue related to the purge if necessary.

if (catalogName.equals("testhadoop") || catalogName.equals("testhive")) {
// This procedure cannot work for hadoop catalog,
// as after drop table metadata file will be deleted (even with purge=false)
// For hive catalog, drop table with purge = true is not cleaning the table in metastore.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of setting the isHadoopCatalog variable in that same file:

private final boolean isHadoopCatalog;
public TestNamespaceSQL(String catalogName, String implementation, Map<String, String> config) {
super(catalogName, implementation, config);
this.fullNamespace = ("spark_catalog".equals(catalogName) ? "" : catalogName + ".") + NS;
this.isHadoopCatalog = "testhadoop".equals(catalogName);
}

Notice that it is the same call, but it's much cleaner to look at in the test when you use Assume.assumeFalse(isHadoopCatalog).

if (catalogName.equals("testhadoop") || catalogName.equals("testhive")) {
// This procedure cannot work for hadoop catalog,
// as after drop table metadata file will be deleted (even with purge=false)
// For hive catalog, drop table with purge = true is not cleaning the table in metastore.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But more importantly (and sorry for the long winded comments, but I figured since you're asking these are useful things we use a lot in the repo):

I'd suggest getting the temp folder to be in the correct state in either an @Before function or an @After function so this can be run with as many catalogs as are supported by it. It's usually just HadoopCatalog or spark_catalog that I see the assumption on.

So the HiveCatalog assumption gives me a bit of pause and makes me wonder if we can't fix it temporarily within the test and then fix the behavior fully in a later patch (please open a ticket if there is an incorrect behavior taking place).

Also, have you merged in latest master? Somebody submitted a fix for properly purging data files so that might help. But this seems to be about NOT wanting to purge and having purge happen anyways (if my understanding is correct), so that seems like a new issue that needs to be looked into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants