-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark 3.3: drop_namespace with CASCADE support #7275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * | ||
| * @param context session context | ||
| * @param namespace a {@link Namespace namespace} | ||
| * @param cascade – When true, deletes all objects under the namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a stricter definition IMO. In some instances below here you delete folders, sometimes you delete files, sometimes it deletes entries. We need a strong definition here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One big missing thing here is how this works with "purge"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have two options, drop_namespace with cascade will:
- drop the tables not the data (no purge)
- drop the tables and delete the data (purge)
We can standardize this in Iceberg and go with the preferred option, #1 sounds better as #2 can take a long time depending on number of tables and data in them. This behavior is not consistent across all catalogs, for example hive metastore cascade deletes all the data (purge) in tables where as others don't.
should I create a vote thread for this to decide on the preference or can we decide that in this PR? I can then update the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer documented the current purge behavior in the PR description and docs.
Only Hive catalog purges the tables as that's the default behavior of hive when we pass the cascade=true parameter.
I can make it consistent across all catalogs and modify current hive catalog behavior to just drop table without purge similar to other catalogs. Let me know your thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sorry I didn't get back to this but I do believe we probably need a discuss thread, or we can get feedback here but I think we need to hear from @jackye1995 and others who are maintaining other Catalog implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed implementation for other catalogs, will initiate a discussion once api change in this PR is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glue by default does cascade drop if just call glue.deleteDatabase, but does not drop data in the table. So technically you can support that directly. But I am okay to publish another PR to support that, up to you.
For doing purge or not, I think the behavior of cascade is not clearly specified in Spark. It could also be argued as something that is catalog-specific, just like the behavior of Hive and Glue are different and it will be difficult to satisfy the other behavior on both sides.
This is related to the issue about different behaviors of list namespace that comes up recently. Maybe we should make a page documenting all catalogs and their different behaviors.
+1 to have a devlist discussion thread first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will start a devlist discussion and create a one pager with current state. Thanks @jackye1995
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995 To confirm are you OK with adding this api to Iceberg? Or do you want keep it within engine implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I am okay with adding this to Iceberg API, otherwise there is no way to leverage catalog service level integration features. I would imagine people using REST catalog would like to just handle this behind the service.
| public boolean dropNamespace(Namespace namespace, boolean cascade) | ||
| throws NamespaceNotEmptyException { | ||
| if (cascade) { | ||
| listTables(namespace).forEach(this::dropTable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i'm missing something here since I'm not sure how dropTable get's called here with a single arg. The issue like I mentioned above is this would be a purely metadata delete for Dynamo I think in this case. Is that the expected behavior we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropTable is overloaded and has two methods, 1 with just single arg TableIdentifier and another which takes the boolean purge arg, we are using the former one here where purge is false.
addressed the purge comment here (#7275 (comment)))
| throws NamespaceNotEmptyException { | ||
| if (cascade) { | ||
| // recursively delete all nested namespaces | ||
| listNamespaces(namespace).forEach(n -> dropNamespace(n, true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the other catalogs do not drop nested namespaces, do we need to cover this in other cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my testing and existing tests, not all catalogs support nested namespaces only the ones I covered here do. Example Hive doesn't support nested namespaces.
If I missed any, I can add those as well.
| throws NamespaceNotEmptyException { | ||
| if (cascade) { | ||
| // recursively delete all nested namespaces | ||
| listNamespaces(namespace).forEach(n -> dropNamespace(n, true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JDBC Supports nested namespaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so based on
| public void testDropNamespace() { |
|
@RussellSpitzer |
|
Tests are failing because of AssertHelpers usage, think that needs a minor change to AssertJ |
|
@RussellSpitzer fixed build, Thanks |
| } | ||
|
|
||
| @Override | ||
| public boolean dropNamespace(Namespace namespace, boolean cascade) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't be needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing something, its needed to implement the interface though?
Class 'SnowflakeCatalog' must either be declared abstract or implement abstract method 'dropNamespace(Namespace, boolean)' in 'SupportsNamespaces'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh above you would add this with a "default" throw unsupported operation exception to the api. That way you don't have to add exception throwing code to all the catalogs that don't implement it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also prevents the change from breaking other catalogs not in the Iceberg repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah! I see what you mean, makes sense. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer made it default.
| * @return true if the namespace was dropped, false otherwise. | ||
| * @throws NamespaceNotEmptyException If the namespace is not empty | ||
| */ | ||
| boolean dropNamespace(SessionContext context, Namespace namespace, boolean cascade); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the default here as well? I think we do for any implementers of SessionCatalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
| To drop a namespace and all its contents including tables, run: | ||
|
|
||
| ```sql | ||
| DROP TABLE prod.db.sample CASCADE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be table? or database?
| if (asNamespaceCatalog != null) { | ||
| try { | ||
| return asNamespaceCatalog.dropNamespace(Namespace.of(namespace)); | ||
| return asNamespaceCatalog.dropNamespace(Namespace.of(namespace), cascade); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs tests for the Spark DDL
|
@abmo-x any when do you plan to merge this PR? |
|
@abmo-x do you plan to merge this PR? |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
SparkCatalog in 3.3 passes the
cascadeflag, current implementation doesn't use the flag and ignores it quietlyiceberg/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
Line 506 in d485cc8
we can use that to support drop_namespace with cascade and fix current implementation by adding new api
This PR adds new API to support cascade with drop_namespace in Iceberg Catalog
drop namespace table purge behaviour by catalog for CASCADE
When namespace is dropped with CASCADE, all tables are dropped and contents are purged based on the type of
catalog.
Should Fix Issue: #3541