Skip to content

HadoopFileIO to support bulk delete through the Hadoop Filesystem APIs #12055

@steveloughran

Description

@steveloughran

Feature Request / Improvement

Hadoop Filesystems now support paged bulk delete API.

For most filesystems the page size is 1; it simply mapped to a single file delete.

For S3A, the page size is the value of fs.s3a.bulk.delete.page.size
-each page of deletions is executed as a single bulk delete POST in the AWS API

There are no attempts to implement POSIX "safety checks" for the path being a directory,
the parent directory existing afterwards etc.

As such it is the most efficient way to delete many objects, whose performance
should match that of to S3FileIO.deleteFiles().
For filesystems without bulk delete, this is mapped to delete(path), so
no less efficient than normal delete calls.

Support this in Iceberg:

  • Add new option iceberg.hadoop.bulk.delete.enabled (default: false)
  • Use reflection to use the bulk delete API through the reflection friendly
    org.apache.hadoop.io.wrappedio.WrappedIO class
  • Switch to using the bulk delete mechanism if enabled and present

If active, bulk delete is supported by:

  1. building up a page of paths to delete for every target filesystem.
  2. Initiate an asynchronous bulk delete request whenever a page is full.
  3. When the end of the list has been reached: queue page deletes for
    all incomplete pages.
  4. Await results and report failures as such.

Missing files are not reported as failures -these are not detected.
Failures will be in permissions, network and possibly transient endpoint issues.

Adds a parameterized test to verify bulk delete works.
This needs to be run against hadoop 3.4.1 to actually verify coverage.

Testing this feature all the way to S3 is complicated.
A test within the hadoop-aws module can validate
the feature through HadoopFileIO and act as regression
testing for the S3A Connector.

Query engine

None

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementPR that improves existing functionalitystale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions