Spark: add property to disable client-side purging in spark #11317

twuebi · 2024-10-14T09:18:37Z

This PR adds a new table property io.client-side.purge-enabled. It allows a catalog to control spark's behavior on DROP .. PURGE. This in turn, enables REST catalogs to offer an UNDROP feature for all storage providers. Currently, this is only possible with s3 due to the s3.delete_enabled property.

RussellSpitzer

I think the intent here is good but i'm not sure the implementation is quite correct. My first issue would be that the property probably needs a spark specific prefix or naming somewhere but the bigger issue is that this flag does not actually guarentee that a server side purge will occur when disabled. For many of our catalog implementations, the purge flag still is executed locally on the driver (see HiveCatalog). At best this would be a "request catalog purge" flag which in some cases would be the catalog in and other cases would not.

Also this is placed with in "Catalog Properties" but it is checked as a Table Property, I think Catalog is probably the correct place to put this because it's really a client decision and not really intrinsic to the table.

I don't think these are deal breakers though and I think with proper naming and we can include this.

twuebi · 2024-10-14T17:33:43Z

Hi @RussellSpitzer, thanks for the quick feedback!

I changed the property to be read from catalog properties.

Regarding your other concern - I'm not sure I follow - in our use-case, being a REST catalog, we'd like to set this flag to keep clients from deleting stuff, so we'd like it to not be a client-decision. If the server sets this flag, it tells the client, I guarantee the purge. Does this perspective conflict with other catalogs? I thought with the current proposal, it'd be up to the specific catalog to decide if they or the respective client performs the deletes.

For the names, I'm happy to change them, names are hard..

Kind regards, Tobias

RussellSpitzer · 2024-10-14T19:11:19Z

The problem is that table properties will only be respected by clients which know how to use it, so although you may set this property, you have no guarantee clients will follow the property. We can keep this a table property, but then it goes in the TableProperties file.

The implementation problem is that we already have several catalog implementations which are essentially client only which do not support a remote purge. For example HiveCatalog, GlueCatalog, [HadoopCatalog] (

iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java

Line 264 in 2446cee

CatalogUtil.dropTableData(ops.io(), lastMetadata);

)

So this property is actually whether to use the "Catalog Purge" implementation which may or may not be on the client

twuebi · 2024-10-15T09:21:01Z

The problem is that table properties will only be respected by clients which know how to use it, so although you may set this property, you have no guarantee clients will follow the property.

That is clear, we can prevent non-conforming clients from deleting stuff by not signing the respective request / handing out downscoped tokens without delete permissions. This would cause client side errors so eventually clients would have to conform if they want to talk to a catalog server with the UNDROP feature.

We can keep this a table property, but then it goes in the TableProperties file.

Sure, I made it a catalog property since I agree with you, it's a property of the catalog and will likely not change on a per table basis. I had originally put it under table property since that's where the s3.enable-delete property was placed.

The implementation problem is that we already have several catalog implementations which are essentially client only which do not support a remote purge. For example HiveCatalog, GlueCatalog, [HadoopCatalog] (

iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java

Line 264 in 2446cee

CatalogUtil.dropTableData(ops.io(), lastMetadata);

)
So this property is actually whether to use the "Catalog Purge" implementation which may or may not be on the client

Could we prefix the property with something like rest-catalog to convince any client that they won't have an effect against the other catalogs for which this flag doesn't have a meaning?

RussellSpitzer · 2024-10-25T14:37:16Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

@@ -365,24 +368,35 @@ public boolean purgeTable(Identifier ident) {
      String metadataFileLocation =
          ((HasTableOperations) table).operations().current().metadataFileLocation();



I think it's probably simpler here to just do

// Remote Purge and early exit if (catalog.instanceOf[Rest...] and property = true) { remote drop purge return } // Original Code Path

RussellSpitzer · 2024-10-25T15:19:52Z

core/src/main/java/org/apache/iceberg/CatalogProperties.java

+   *
+   * <p>When set to false, the client will not
+   */
+  public static final String IO_CLIENT_SIDE_PURGE_ENABLED = "io.client-side.purge-enabled";


possibly rename

/** * Controls whether engines using a REST Catalog should delegate the drop table purge requests to the Catalog. * Defaults to false, allowing the engine to use its own implementation for purging. */ REST_PURGE = "rest-purge" REST_PURGE_DEFAULT = false;

RussellSpitzer · 2024-11-04T23:24:53Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestRestDropPurgeTable.java

+import org.junit.jupiter.api.TestTemplate;
+import org.junit.jupiter.api.extension.ExtendWith;
+
+@ExtendWith(ParameterizedTestExtension.class)


I think this should probably still be a part of the other test classes, or a new file called TestSparkCatalogREST

On a higher point I think looking at this we should just add a new method to the SparkCatalog

/** * Reset the delegate Iceberg Catalog to a test Mock Catalog object */ @VisibleForTesting void injectCatalog(Catalog iceberg) { this.icebergCatalog = icebergCatalog; }

Which can use to just inject a mock Catalog into the Spark environment so then the code would look something like

Psuedo Code Follows

Spark Config Here RestCatalog mockCatalog = mock(new RestCatalog) Spark3Util.loadCatalog("name").injectCatalog(mockCatalog) sql("drop table purge ...") verify(mockCatalog).called(drop, purge true)

@nastra and @amogh-jahagirdar , I know y'all have written a bunch of REST testing, how does that sound to you?

I am wondering whether we should have parameter for the normal test code that uses Hive, Hadoop, and add in a REST that injects like ^ ... but that may be overkill

I don't think we need any of the injection logic. We should rather wait for #11093 to get in, which would enable much easier testing with Spark + REST catalog

I see that #11093 has landed, could you help me to proceed here @nastra? I'm not sure how to set the properties of the rest-catalog in CatalogTestBase to include

REST_CATALOG_PURGE, Boolean.toString(purge)

for my test. I'm also not sure how to assert that the catalog was called with the purge flag.

@twuebi you'll need to extend CatalogTestBase and then override the parameters() method to set the catalog to be tested (where you can provide additional properties). An example can be seen in https://github.com/apache/iceberg/pull/11388/files#diff-7f5041107ee18a22c758b79de43bdfcf8582433710adffcfcedd76685a334081R102-R108

@nastra How would we be able to check the side-effects of the Catalog then? I'm also not a big fan of us having to do a whole new configuration for the entire test suite just to test one config property

@RussellSpitzer & @nastra not sure how to proceed here

nastra · 2024-11-05T08:12:02Z

core/src/main/java/org/apache/iceberg/CachingCatalog.java

@@ -60,6 +60,10 @@ public static Catalog wrap(
    return new CachingCatalog(catalog, caseSensitive, expirationIntervalMillis);
  }

+  public boolean wrapped_is_instance(Class<?> cls) {


this can be removed once #11093 gets in

This is used for checking the type of the catalog impl that CachingCatalog is wrapping so that we can correctly respect the purge property at https://github.com/apache/iceberg/pull/11317/files#diff-bd61838d4e3a9aef52a670696750b30deac10b90f526435e32beacb0107eea24R376. Unless CachingCatalog is only used in tests, I think we need to keep this method?

RussellSpitzer · 2024-12-11T16:45:49Z

Bringing this up in Community Sync today to discuss future of the API here

twuebi · 2024-12-11T17:10:46Z

Thanks for bringing it there.

Where can I find the calendar for today's Community Sync? I've been looking through the iceberg community page but there's only a calendar having the next sync on 25th of December

RussellSpitzer · 2024-12-11T17:45:19Z

https://docs.google.com/document/d/1iPGVCIcr-M0XtAiudOguWAvmqIdVgpYN5vz5ohO8PKw/edit?tab=t.0#heading=h.cr6o1g2rn5hc

rdblue · 2024-12-11T23:52:08Z

core/src/main/java/org/apache/iceberg/CatalogProperties.java

+   * Controls whether engines using a REST Catalog should delegate the drop table purge requests to the Catalog.
+   * Defaults to false, allowing the engine to use its own implementation for purging.
+   */
+  public static final String REST_CATALOG_PURGE = "rest.catalog-purge";


Isn't the option whether to delegate to the catalog, not specifically REST? Why is the property specific to REST catalog?

rdblue · 2024-12-11T23:58:18Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

+              || this.icebergCatalog instanceof RESTSessionCatalog
+              || (this.icebergCatalog instanceof CachingCatalog
+                  && ((CachingCatalog) this.icebergCatalog).wrapped_is_instance(RESTCatalog.class)))
+          && this.restCatalogPurge) {


I don't understand why this is specific to REST. Shouldn't the flag delegate purge to the catalog implementation, rather than handling it in the SparkCatalog wrapper? That's simpler.

I suspect that the argument against the simpler option is that we want to eventually make this the default behavior, in which case we would not get parallelized purge behavior. But in that case, I'd rather have some way to parallelize catalog operations than have hacky checks that break the layers of abstraction. The Spark catalog could instead pass a function that accepts a FileIO and metadata location and purges it when it creates the wrapped catalog.

github-actions · 2025-01-11T00:15:06Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2025-01-19T00:16:11Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

RussellSpitzer · 2025-01-21T14:37:49Z

Note from our sync where we discussed this:

Today we had a little discussion on the Apache Iceberg Catalog Community Sync
about DROP and DROP WITH PURGE. Currently the SparkCatalog implementation
inside of the reference library has a unique method of DROP WITH PURGE vs other
implementations. The pseudo code is essentially

use Spark to list files to be removed and delete them
send a drop table request to the Catalog

As opposed to other systems

send a drop table request to the Catalog with the purge flag enabled

This has led us to a situation where it becomes difficult for REST Catalogs
with custom purge implementations (or those with ignore purge) to
work properly with Spark.

Bringing this behavior in line with non-Spark implementations
would have possibly dramatic impacts on users of the
iceberg library but our consensus in the Catalog Sync today was that we should
eventually have that be the default behavior. To this end I propose the following

We support a flag to allow current Spark users to delegate to the REST Catalog
(all other catalog behaviors remain the same). PR available here from
(Credit to Tobias who wrote the PR and brought up this topic)
We deprecate the client side delete for Spark
In the next major release (Iceberg 2.0?) we change the behavior officially to only
send through the Drop Purge flag with no client side file removal.
For all non-REST catalog implementations we keep the code the same for legacy compatibility.

A user of 1.8 will then have the ability to choose for their Spark DROP PURGES whether
or not to purge locally or Remotely for REST

A user of 2.0 will only be able to do a remote purge

Users of non-REST Catalogs will have no change in behavior.

github-actions · 2025-02-21T00:15:17Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

add property to disable client-side purging in spark

90805de

github-actions bot added spark core labels Oct 14, 2024

RussellSpitzer reviewed Oct 14, 2024

View reviewed changes

move property into catalog properties

055789d

fix formatting

302c1b4

RussellSpitzer reviewed Oct 25, 2024

View reviewed changes

rename property & replace branch with early return

3033ab0

twuebi force-pushed the tp/deletes-enabled-property branch from 68a07ae to 3033ab0 Compare October 28, 2024 08:45

twuebi added 2 commits October 29, 2024 15:58

rename, remove SERVER from property as it's not a term in iceberg

85b4d7d

add a tests for purge flag

48de51b

twuebi force-pushed the tp/deletes-enabled-property branch from 21257ee to 48de51b Compare November 4, 2024 19:20

RussellSpitzer reviewed Nov 4, 2024

View reviewed changes

nastra reviewed Nov 5, 2024

View reviewed changes

c-thiel mentioned this pull request Nov 21, 2024

Add REST Catalog tests to Spark 3.5 integration test #11093

Merged

c-thiel mentioned this pull request Dec 11, 2024

Drop behavioral change for Spark with REST Catalogs #11754

Open

3 tasks

rdblue reviewed Dec 11, 2024

View reviewed changes

github-actions bot added the stale label Jan 11, 2025

github-actions bot closed this Jan 19, 2025

RussellSpitzer reopened this Jan 21, 2025

github-actions bot removed the stale label Jan 22, 2025

github-actions bot added the stale label Feb 21, 2025

nastra added not-stale and removed stale labels Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: add property to disable client-side purging in spark #11317

Spark: add property to disable client-side purging in spark #11317

twuebi commented Oct 14, 2024

RussellSpitzer left a comment

twuebi commented Oct 14, 2024 •

edited

Loading

RussellSpitzer commented Oct 14, 2024 •

edited

Loading

twuebi commented Oct 15, 2024

RussellSpitzer Oct 25, 2024

RussellSpitzer Oct 25, 2024 •

edited

Loading

RussellSpitzer Nov 4, 2024

nastra Nov 5, 2024

twuebi Nov 23, 2024 •

edited

Loading

nastra Nov 25, 2024 •

edited

Loading

RussellSpitzer Nov 25, 2024

twuebi Dec 9, 2024

sfc-gh-rspitzer Dec 9, 2024

nastra Nov 5, 2024

twuebi Nov 5, 2024

RussellSpitzer commented Dec 11, 2024

twuebi commented Dec 11, 2024

RussellSpitzer commented Dec 11, 2024

rdblue Dec 11, 2024

rdblue Dec 11, 2024

github-actions bot commented Jan 11, 2025

github-actions bot commented Jan 19, 2025

RussellSpitzer commented Jan 21, 2025

github-actions bot commented Feb 21, 2025

		@@ -365,24 +368,35 @@ public boolean purgeTable(Identifier ident) {
		String metadataFileLocation =
		((HasTableOperations) table).operations().current().metadataFileLocation();

Spark: add property to disable client-side purging in spark #11317

Are you sure you want to change the base?

Spark: add property to disable client-side purging in spark #11317

Conversation

twuebi commented Oct 14, 2024

RussellSpitzer left a comment

Choose a reason for hiding this comment

twuebi commented Oct 14, 2024 • edited Loading

RussellSpitzer commented Oct 14, 2024 • edited Loading

twuebi commented Oct 15, 2024

Choose a reason for hiding this comment

RussellSpitzer Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twuebi Nov 23, 2024 • edited Loading

Choose a reason for hiding this comment

nastra Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer commented Dec 11, 2024

twuebi commented Dec 11, 2024

RussellSpitzer commented Dec 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 11, 2025

github-actions bot commented Jan 19, 2025

RussellSpitzer commented Jan 21, 2025

github-actions bot commented Feb 21, 2025

twuebi commented Oct 14, 2024 •

edited

Loading

RussellSpitzer commented Oct 14, 2024 •

edited

Loading

RussellSpitzer Oct 25, 2024 •

edited

Loading

twuebi Nov 23, 2024 •

edited

Loading

nastra Nov 25, 2024 •

edited

Loading