Skip to content

Conversation

@abmo-x
Copy link
Contributor

@abmo-x abmo-x commented Apr 3, 2023

SparkCatalog in 3.3 passes the cascade flag, current implementation doesn't use the flag and ignores it quietly

public boolean dropNamespace(String[] namespace, boolean cascade)

we can use that to support drop_namespace with cascade and fix current implementation by adding new api

public boolean dropNamespace(String[] namespace, boolean cascade)

This PR adds new API to support cascade with drop_namespace in Iceberg Catalog

boolean dropNamespace(Namespace namespace, boolean cascade) throws NamespaceNotEmptyException;

drop namespace table purge behaviour by catalog for CASCADE

When namespace is dropped with CASCADE, all tables are dropped and contents are purged based on the type of
catalog.

Catalog Table Purge Nested Namespaces
Hive - [x] - [ ]
Hadoop - [ ] - [ x ]
JDBC - [ ] - [ x ]
Nessie NotSupported NotSupported
DynamoDb NotSupported NotSupported
Glue NotSupported NotSupported
ECS NotSupported NotSupported
Snowflake NotSupported NotSupported

Should Fix Issue: #3541

*
* @param context session context
* @param namespace a {@link Namespace namespace}
* @param cascade – When true, deletes all objects under the namespace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a stricter definition IMO. In some instances below here you delete folders, sometimes you delete files, sometimes it deletes entries. We need a strong definition here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One big missing thing here is how this works with "purge"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have two options, drop_namespace with cascade will:

  1. drop the tables not the data (no purge)
  2. drop the tables and delete the data (purge)

We can standardize this in Iceberg and go with the preferred option, #1 sounds better as #2 can take a long time depending on number of tables and data in them. This behavior is not consistent across all catalogs, for example hive metastore cascade deletes all the data (purge) in tables where as others don't.

should I create a vote thread for this to decide on the preference or can we decide that in this PR? I can then update the implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer documented the current purge behavior in the PR description and docs.

Only Hive catalog purges the tables as that's the default behavior of hive when we pass the cascade=true parameter.

I can make it consistent across all catalogs and modify current hive catalog behavior to just drop table without purge similar to other catalogs. Let me know your thoughts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry I didn't get back to this but I do believe we probably need a discuss thread, or we can get feedback here but I think we need to hear from @jackye1995 and others who are maintaining other Catalog implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed implementation for other catalogs, will initiate a discussion once api change in this PR is merged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glue by default does cascade drop if just call glue.deleteDatabase, but does not drop data in the table. So technically you can support that directly. But I am okay to publish another PR to support that, up to you.

For doing purge or not, I think the behavior of cascade is not clearly specified in Spark. It could also be argued as something that is catalog-specific, just like the behavior of Hive and Glue are different and it will be difficult to satisfy the other behavior on both sides.

This is related to the issue about different behaviors of list namespace that comes up recently. Maybe we should make a page documenting all catalogs and their different behaviors.

+1 to have a devlist discussion thread first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will start a devlist discussion and create a one pager with current state. Thanks @jackye1995

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 To confirm are you OK with adding this api to Iceberg? Or do you want keep it within engine implementations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I am okay with adding this to Iceberg API, otherwise there is no way to leverage catalog service level integration features. I would imagine people using REST catalog would like to just handle this behind the service.

public boolean dropNamespace(Namespace namespace, boolean cascade)
throws NamespaceNotEmptyException {
if (cascade) {
listTables(namespace).forEach(this::dropTable);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i'm missing something here since I'm not sure how dropTable get's called here with a single arg. The issue like I mentioned above is this would be a purely metadata delete for Dynamo I think in this case. Is that the expected behavior we want?

Copy link
Contributor Author

@abmo-x abmo-x Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropTable is overloaded and has two methods, 1 with just single arg TableIdentifier and another which takes the boolean purge arg, we are using the former one here where purge is false.

addressed the purge comment here (#7275 (comment)))

throws NamespaceNotEmptyException {
if (cascade) {
// recursively delete all nested namespaces
listNamespaces(namespace).forEach(n -> dropNamespace(n, true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the other catalogs do not drop nested namespaces, do we need to cover this in other cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my testing and existing tests, not all catalogs support nested namespaces only the ones I covered here do. Example Hive doesn't support nested namespaces.

If I missed any, I can add those as well.

throws NamespaceNotEmptyException {
if (cascade) {
// recursively delete all nested namespaces
listNamespaces(namespace).forEach(n -> dropNamespace(n, true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JDBC Supports nested namespaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so based on

@github-actions github-actions bot added the docs label Apr 18, 2023
@abmo-x abmo-x requested a review from RussellSpitzer April 18, 2023 22:45
Abid Mohammed added 2 commits April 18, 2023 15:52
@abmo-x
Copy link
Contributor Author

abmo-x commented Apr 19, 2023

@rdblue @kbendick
can you review when you get a chance, Thanks!

@abmo-x
Copy link
Contributor Author

abmo-x commented May 5, 2023

@RussellSpitzer
Makes sense, I removed implementation for various catalogs and left them as unsupported.
Only implemented for Hive, Hadoop and JDBC.

@RussellSpitzer
Copy link
Member

Tests are failing because of AssertHelpers usage, think that needs a minor change to AssertJ

@abmo-x
Copy link
Contributor Author

abmo-x commented May 9, 2023

@RussellSpitzer fixed build, Thanks

}

@Override
public boolean dropNamespace(Namespace namespace, boolean cascade) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, its needed to implement the interface though?

Class 'SnowflakeCatalog' must either be declared abstract or implement abstract method 'dropNamespace(Namespace, boolean)' in 'SupportsNamespaces'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh above you would add this with a "default" throw unsupported operation exception to the api. That way you don't have to add exception throwing code to all the catalogs that don't implement it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also prevents the change from breaking other catalogs not in the Iceberg repo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! I see what you mean, makes sense. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer made it default.

* @return true if the namespace was dropped, false otherwise.
* @throws NamespaceNotEmptyException If the namespace is not empty
*/
boolean dropNamespace(SessionContext context, Namespace namespace, boolean cascade);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the default here as well? I think we do for any implementers of SessionCatalog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

To drop a namespace and all its contents including tables, run:

```sql
DROP TABLE prod.db.sample CASCADE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be table? or database?

if (asNamespaceCatalog != null) {
try {
return asNamespaceCatalog.dropNamespace(Namespace.of(namespace));
return asNamespaceCatalog.dropNamespace(Namespace.of(namespace), cascade);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs tests for the Spark DDL

@vinitamaloo-asu
Copy link

@abmo-x any when do you plan to merge this PR?

@supsupsap
Copy link

@abmo-x do you plan to merge this PR?

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 28, 2024
@github-actions
Copy link

github-actions bot commented Sep 5, 2024

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants