-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark 3.3: drop_namespace with CASCADE support #7275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b41e4bf
f2a7ace
8441c74
00671d2
a4049d6
01de116
d13837a
da335fd
2c31a4f
db87d5a
aa63ee3
28b289b
c5dd8a5
5ad1e94
d4d9dc2
c75ea5c
4c05fcb
59bb5e3
e485616
4a04148
95d6676
f5219db
ac28327
187d378
e43de0e
3ba10f9
3e852ac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -349,6 +349,17 @@ public boolean dropNamespace(Namespace namespace) { | |
| } | ||
| } | ||
|
|
||
| @Override | ||
| public boolean dropNamespace(Namespace namespace, boolean cascade) | ||
| throws NamespaceNotEmptyException { | ||
| if (cascade) { | ||
| // recursively delete all nested namespaces | ||
| listNamespaces(namespace).forEach(n -> dropNamespace(n, true)); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All of the other catalogs do not drop nested namespaces, do we need to cover this in other cases? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From my testing and existing tests, not all catalogs support nested namespaces only the ones I covered here do. Example Hive doesn't support nested namespaces. If I missed any, I can add those as well. |
||
| listTables(namespace).forEach(this::dropTable); | ||
| } | ||
| return dropNamespace(namespace); | ||
| } | ||
|
|
||
| @Override | ||
| public boolean setProperties(Namespace namespace, Map<String, String> properties) { | ||
| throw new UnsupportedOperationException( | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -419,6 +419,17 @@ public boolean dropNamespace(Namespace namespace) throws NamespaceNotEmptyExcept | |||
| return deletedRows > 0; | ||||
| } | ||||
|
|
||||
| @Override | ||||
| public boolean dropNamespace(Namespace namespace, boolean cascade) | ||||
| throws NamespaceNotEmptyException { | ||||
| if (cascade) { | ||||
| // recursively delete all nested namespaces | ||||
| listNamespaces(namespace).forEach(n -> dropNamespace(n, true)); | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. JDBC Supports nested namespaces? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe so based on
|
||||
| listTables(namespace).forEach(this::dropTable); | ||||
| } | ||||
| return dropNamespace(namespace); | ||||
| } | ||||
|
|
||||
| @Override | ||||
| public boolean setProperties(Namespace namespace, Map<String, String> properties) | ||||
| throws NoSuchNamespaceException { | ||||
|
|
||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -140,6 +140,47 @@ AS SELECT ... | |
| The schema and partition spec will be replaced if changed. To avoid modifying the table's schema and partitioning, use `INSERT OVERWRITE` instead of `REPLACE TABLE`. | ||
| The new table properties in the `REPLACE TABLE` command will be merged with any existing table properties. The existing table properties will be updated if changed else they are preserved. | ||
|
|
||
| ## `DROP NAMESPACE` | ||
|
|
||
| ### `DROP EMPTY NAMESPACE` | ||
|
|
||
| To drop an _empty_ namespace, run: | ||
|
|
||
| ```sql | ||
| DROP database prod.db.sample | ||
| ``` | ||
| If the namespace is not empty, this will fail with _NamespaceNotEmptyException_. | ||
|
|
||
|
|
||
| ### `DROP NON_EMPTY NAMESPACE` | ||
|
|
||
| To drop a namespace and all its contents including tables, run: | ||
|
|
||
| ```sql | ||
| DROP TABLE prod.db.sample CASCADE | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be table? or database? |
||
| ``` | ||
| **WARNING**: | ||
| drop table purge behaviour with cascade depends on the type of catalog managing the namespace. | ||
| see below mapping of purge behaviour for different catalogs. | ||
| - If the database is managed by **HiveCatalog**, this will _purge_ all the tables in the database. | ||
|
|
||
| #### `drop namespace table purge behaviour by catalog for CASCADE` | ||
|
|
||
| When namespace is dropped with _CASCADE_, all tables are dropped and contents are purged based on the type of | ||
| catalog. | ||
|
|
||
| | Catalog | Table Purge | Nested Namespaces | | ||
| |-----------|--------------|------------------------| | ||
| | Hive | - [ x ] | - [ ] | | ||
| | Hadoop | - [ ] | - [ x ] | | ||
| | JDBC | - [ ] | - [ x ] | | ||
| | ECS | - [ ] | - [ x ] | | ||
| | Nessie | NotSupported | NotSupported | | ||
| | DynamoDb | NotSupported | NotSupported | | ||
| | Glue | NotSupported | NotSupported | | ||
| | Snowflake | NotSupported | NotSupported | | ||
|
|
||
|
|
||
| ## `DROP TABLE` | ||
|
|
||
| The drop table behavior changed in 0.14. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a stricter definition IMO. In some instances below here you delete folders, sometimes you delete files, sometimes it deletes entries. We need a strong definition here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One big missing thing here is how this works with "purge"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have two options, drop_namespace with cascade will:
We can standardize this in Iceberg and go with the preferred option, #1 sounds better as #2 can take a long time depending on number of tables and data in them. This behavior is not consistent across all catalogs, for example hive metastore cascade deletes all the data (purge) in tables where as others don't.
should I create a vote thread for this to decide on the preference or can we decide that in this PR? I can then update the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer documented the current purge behavior in the PR description and docs.
Only Hive catalog purges the tables as that's the default behavior of hive when we pass the cascade=true parameter.
I can make it consistent across all catalogs and modify current hive catalog behavior to just drop table without purge similar to other catalogs. Let me know your thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sorry I didn't get back to this but I do believe we probably need a discuss thread, or we can get feedback here but I think we need to hear from @jackye1995 and others who are maintaining other Catalog implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed implementation for other catalogs, will initiate a discussion once api change in this PR is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glue by default does cascade drop if just call
glue.deleteDatabase, but does not drop data in the table. So technically you can support that directly. But I am okay to publish another PR to support that, up to you.For doing purge or not, I think the behavior of cascade is not clearly specified in Spark. It could also be argued as something that is catalog-specific, just like the behavior of Hive and Glue are different and it will be difficult to satisfy the other behavior on both sides.
This is related to the issue about different behaviors of list namespace that comes up recently. Maybe we should make a page documenting all catalogs and their different behaviors.
+1 to have a devlist discussion thread first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will start a devlist discussion and create a one pager with current state. Thanks @jackye1995
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995 To confirm are you OK with adding this api to Iceberg? Or do you want keep it within engine implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I am okay with adding this to Iceberg API, otherwise there is no way to leverage catalog service level integration features. I would imagine people using REST catalog would like to just handle this behind the service.