Core: HadoopFileIO to support bulk delete through the Hadoop Filesystem APIs #10233

steveloughran · 2024-04-26T17:29:04Z

Code for #12055

Using the bulk delete for files eliminates the per-file probe for status of the destination object, an issuing of a single delete request and then a probe to see if we need to recreate an empty directory marker above it, moving to deleting a few hundred objects in a single request at a time.

Tested: S3 london

In apache/hadoop#7316 there's an (uncommitted) PR for hadoop which takes an iceberg library and verifies the correct operation of the iceberg delete operations
against a live AWS S3 store with- and without- bulk delete enabled.

https://github.com/apache/hadoop/blob/d37310cf355f3eb137f925bde9a2a299823b8230/hadoop-tools/hadoop-aws/src/test/java17/org/apache/hadoop/fs/contract/s3a/ITestIcebergBulkDelete.java

This is something we can merge into hadoop as a local regression test once Iceberg has a release with this.

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java

danielcweeks · 2024-04-27T16:38:58Z

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java

+    for (String path : pathnames) {
+      Path p = new Path(path);
+      final URI uri = p.toUri();
+      String fsURI = uri.getScheme() + "://" + uri.getHost() + "/";


Using the URI class has some significant incompatibilities for various cloud providers like GCP who allow non-standard characters (e.g. underscore in bucket name) which will result in the host being empty/null and causing problems. We use classes like GCSLocation or S3URI to workaround these issues. We should try to avoid use of URI

so we should just use the root path of every fs as the key? It uses the uri as hash code and comparator internally.

I'm using the root path of every filesystem instance; relies on the Path type's internal URI stuff. There are some s3 bucket names that aren't valid hostnames, for those the s3a policy is "not supported". dots in bucket names are a key example. Even amazon advise against those

@danielcweeks quick resolution for this. FileSystem.get() uses that root path URI to look up filesystems from its cache.

Non-standard hostnames which can't be converted to a URI are probably not work through HadoopFileIO today.

steveloughran · 2024-05-29T17:39:10Z

This PR is in sync with apache/hadoop#6686 ; the dynamic binding classes in here are from that PR (which also copies in the parquet DynMethods classes for consistency everywhere).

steveloughran · 2024-06-26T14:19:42Z

the latest update runs the tests against local fs paramterized on using/not using bulk delete -the library settings have been modified to use hadoop 3.4.1-SNAPSHOT for this.

it works, and execution time on large file deletion (added to the listPrefix test) is comparable, but there is now two codepaths to test and maintain.

I plan to coalesce them so test coverage is better, even on older builds

steveloughran · 2024-06-27T17:54:26Z

going to create a separate test repository to run an iceberg build against s3/abfs etc with a different hadoop build. this will do end to end testing.

github-actions · 2024-11-22T00:16:11Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-12-29T00:16:41Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

steveloughran · 2025-01-14T20:02:31Z

For anyone watching this, there's a full integration test suite in the hadoop test code: apache/hadoop#7285

All is good, though as it's the first java17 code and depends on an iceberg jar with this patch in, it's actually dependent on this PR going in first. It does show that yes, bulk deletes through the S3A code does work, without me having to add a significant piece of test code to iceberg to wire up hadoop fileIO to the minio docker container that S3FileIO uses.
It also makes for a good regression test: anything downstream of both will get neglected.

Anyway, with those tests happy, I just have a few more tests to write (mixing some local files too) and then I'll be confident this is ready for review

github-actions · 2025-06-30T00:19:20Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

Not compiling as branch currently is hadoop 2.7; fix first Test classpath, bulk deletion Invocation, fallback and testing * Remove off test classpath any yarn dependencies which hive pulls in. These taint the classpath with older libraries. * Hardening the fallback such that the list of undeleted paths and their failure causes are passed back up -those are the elements which are then passed down to the classic iteration algorithm. * Tests for invocation validate fallback on non-empty directory deletion and verify that the behavior is the same on classic and bulk API paths. + tests for: empty directory, empty list Bulk delete: move directly to API, no option to disable. Not compiling as branch currently is hadoop 2.7; fix first Extra mocking resilience if the bulk delete invocation didn't link (e.g. from mocking), then bulk delete is disabled for the life of this HadoopFileIO instance Make resilient to mocking. AWS: Use hadoop bulk delete API where available Reflection-based used of Hadoop 3.4.1+ BulkDelete API so that S3 object deletions can be done in pages of objects, rather than one at a time. Configuration option "iceberg.hadoop.bulk.delete.enabled" to switch to bulk deletes. There's a unit test which will turn this on if the wrapped APIs are loaded and probe the HadoopFileIO instance for it using the APIs. * Parameterized tests for bulk delete on/off * Cache bulk delete page size; this brings performance of bulk delete with page size == 1 to that of single delete * Use deleteFiles() in tests which create many files; helps highlight performance/scale issues against local fs. + updates aws.md to cover this and other relevant s3a settings, primarily for maximum parquet IO

The changes made earlier to the hadoop exclusions should ensure that no artifacts of earlier releases get onto the test classpath of other modules.

steveloughran · 2025-07-07T15:49:57Z

not sure what went up with flink; rebasing and testing locally to make sure that it is unrelated to my PR.

steveloughran · 2025-07-30T16:37:57Z

failure is because spark 3.4 runtimes are using older hadoop releases.

steveloughran · 2025-07-31T10:54:41Z

@danielcweeks so here we are

no reflection
forced update of hadoop version on spark 3.4 and 3.5.

That forced update isn't great, but if iceberg compiles with hadoop 3.4.1 then it's already possibly broken somewhere else when run there. We put a lot of effort into backward compatibility of APIs, but its really hard to do forward compatibility (openFile(), createFile() and other builders are the best we have here.

A real risk is some overloaded method compiling down differently, different exception signatures etc.

That's independ of this PR, which simply finds the problem faster.

Ignoring that fundamental issue, I could tweak this one to look for the presence of the new API classes via a loadResource call, and don't attempt bulk delete if not found. Provided all uses of the bulk delete is isolated to guarded method, this wouldn't exacerbate the existing issue.

How to test if that worked? remove the forced dependency updates from the spark versions.

Thoughts? It's easily done, and isn't the full reflection game, just a probe and a guard

github-actions · 2025-09-11T00:16:05Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2025-09-19T00:16:00Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

steveloughran · 2025-09-22T18:10:11Z

aah, this got closed while I was off on a european-length vacation. PITA.

nastra · 2025-09-23T06:30:07Z

@steveloughran PRs get auto-closed after a certain period but you can always revive it

steveloughran · 2025-09-23T15:17:36Z

@nastra thanks. I think I'll split the "move everything to hadoop 3.4.1 libs" from the code changes, so that build changes which highlight spark 3.4 issues.

github-actions · 2025-10-31T00:17:05Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

steveloughran marked this pull request as draft April 26, 2024 17:29

github-actions bot added the core label Apr 26, 2024

steveloughran mentioned this pull request Apr 26, 2024

HADOOP-18679. Add API for bulk/paged delete of files apache/hadoop#6726

Merged

4 tasks

danielcweeks reviewed Apr 27, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java Outdated Show resolved Hide resolved

danielcweeks reviewed Apr 27, 2024

View reviewed changes

steveloughran mentioned this pull request May 29, 2024

HADOOP-19131. Assist reflection IO with WrappedOperations class apache/hadoop#6686

Merged

4 tasks

steveloughran changed the title ~~HADOOP-18679. Add API for bulk/paged object deletion: Iceberg PoC~~ [draft] HADOOP-18679. Add API for bulk/paged object deletion: Iceberg PoC May 30, 2024

steveloughran force-pushed the s3/HADOOP-18679-bulk-delete-api branch from 64423e2 to e794b1b Compare June 26, 2024 14:17

steveloughran force-pushed the s3/HADOOP-18679-bulk-delete-api branch from e794b1b to 026ca47 Compare August 13, 2024 12:32

steveloughran force-pushed the s3/HADOOP-18679-bulk-delete-api branch from 5839ca1 to 968809c Compare October 22, 2024 17:00

github-actions bot added the stale label Nov 22, 2024

steveloughran force-pushed the s3/HADOOP-18679-bulk-delete-api branch from 968809c to fb10c1a Compare November 27, 2024 22:07

github-actions bot removed the stale label Nov 28, 2024

github-actions bot added stale docs labels Dec 29, 2024

steveloughran force-pushed the s3/HADOOP-18679-bulk-delete-api branch from dcb38e6 to b352b04 Compare January 2, 2025 11:51

github-actions bot removed the stale label Jan 3, 2025

steveloughran changed the title ~~[draft] HADOOP-18679. Add API for bulk/paged object deletion: Iceberg PoC~~ [draft] AWS: support hadoop bulk delete API Jan 3, 2025

steveloughran force-pushed the s3/HADOOP-18679-bulk-delete-api branch from b352b04 to 661ddc6 Compare January 9, 2025 13:29

steveloughran mentioned this pull request Jan 15, 2025

Core: Bulk deletion in RemoveSnapshots #11837

Merged

steveloughran mentioned this pull request Jan 22, 2025

HADOOP-19385. S3A: Add iceberg bulk delete test apache/hadoop#7316

Closed

4 tasks

steveloughran changed the title ~~[draft] AWS: support hadoop bulk delete API~~ HadoopFileIO to support bulk delete through the Hadoop Filesystem APIs Jan 22, 2025

steveloughran marked this pull request as ready for review January 22, 2025 19:50

steveloughran marked this pull request as draft January 22, 2025 19:54

github-actions bot added the stale label Jun 30, 2025

stevenzwu modified the milestones: Iceberg 1.10.0, Iceberg 1.11.0 Jul 2, 2025

github-actions bot removed the stale label Jul 3, 2025

steveloughran force-pushed the s3/HADOOP-18679-bulk-delete-api branch 2 times, most recently from b02a9e1 to cc51900 Compare July 4, 2025 13:19

steveloughran added 4 commits July 7, 2025 16:47

Feedback apache#1: trim s3a docs and comments

164775d

Feedback apache#2: target 3.4.2 only.

6f9d5d6

The changes made earlier to the hadoop exclusions should ensure that no artifacts of earlier releases get onto the test classpath of other modules.

cut the bulk delete probes and flags

bd51dc5

steveloughran force-pushed the s3/HADOOP-18679-bulk-delete-api branch from cc51900 to bd51dc5 Compare July 7, 2025 15:47

bulk delete: force spark 3.4 and 3.5 onto hadoop-3.4.1 runtime.

be22154

github-actions bot added the spark label Jul 30, 2025

danielcweeks self-requested a review August 11, 2025 15:38

github-actions bot added the stale label Sep 11, 2025

github-actions bot closed this Sep 19, 2025

nastra reopened this Sep 23, 2025

github-actions bot removed the stale label Sep 24, 2025

steveloughran marked this pull request as draft September 30, 2025 19:01

github-actions bot added the stale label Oct 31, 2025

Core: HadoopFileIO to support bulk delete through the Hadoop Filesystem APIs #10233

Are you sure you want to change the base?

Core: HadoopFileIO to support bulk delete through the Hadoop Filesystem APIs #10233

Conversation

steveloughran commented Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

danielcweeks Apr 27, 2024

Choose a reason for hiding this comment

Uh oh!

steveloughran May 17, 2024

Choose a reason for hiding this comment

Uh oh!

steveloughran May 20, 2024

Choose a reason for hiding this comment

Uh oh!

steveloughran Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

steveloughran commented May 29, 2024

Uh oh!

steveloughran commented Jun 26, 2024

Uh oh!

steveloughran commented Jun 27, 2024

Uh oh!

github-actions bot commented Nov 22, 2024

Uh oh!

github-actions bot commented Dec 29, 2024

Uh oh!

steveloughran commented Jan 14, 2025

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

steveloughran commented Jul 7, 2025

Uh oh!

steveloughran commented Jul 30, 2025

Uh oh!

steveloughran commented Jul 31, 2025

Uh oh!

github-actions bot commented Sep 11, 2025

Uh oh!

github-actions bot commented Sep 19, 2025

Uh oh!

steveloughran commented Sep 22, 2025

Uh oh!

nastra commented Sep 23, 2025

Uh oh!

steveloughran commented Sep 23, 2025

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

steveloughran commented Apr 26, 2024 •

edited

Loading