Remove Dependency on Hadoop's Filesystem Class from Remove Orphan Files #11541

RussellSpitzer · 2024-11-13T20:28:00Z

Feature Request / Improvement

Currently RemoveOrphan Files is hardwired to use Hadoop FS classes when listing directories and attempting to locate orphan files. This is a problem for anyone working with a REST catalog since only the Catalog IO (Iceberg's FIleIO) will have credentials and the Hadoop conf will not.

Luckily, I think we currently have everything we need in SupportPrefixOperations to allow a properly equipped FileIO instance to handle all of RemoveOrphanFiles without using any Hadoop Classes.

I propose we branch whenever a Hadoop class is about to be used and use FileIO instead if it supports the proper operations. If it does not we should fall back to Hadoop classes.

Query engine

Spark

Willingness to contribute

I can contribute this improvement/feature independently
I would be willing to contribute this improvement/feature with guidance from the Iceberg community
I cannot contribute this improvement/feature at this time

amogh-jahagirdar · 2024-11-14T01:08:10Z

#7914 should address this, it'd be great if someone could take this forward.

danielcweeks · 2024-11-14T16:26:00Z

@amogh-jahagirdar I don't think #7914 is a good approach to addressing this as it's not scalable. This trades off distributed listing for single iteration.

To support prefix listing, we need to break up the key space in a way that we can parallelize the work. I'll comment on the PR, there's a better approach that we're looking into.

rocco408 · 2024-11-15T17:59:59Z

I'm happy to take a look, independently and/or with community input, I'll circle-back in the coming days with any progress.

rocco408 · 2024-11-17T21:31:01Z

I have a few questions to help get started.

Is TestRemoveOrphanFilesProcedure a good place to add a new test for this? I found this issue, maybe this is a good starting point for codifying the current problem?
branch whenever a Hadoop class is about to be used and use FileIO

Is the idea to branch in DeleteOrphanFilesSparkAction or in some other more general location(s)?

we currently have everything we need in SupportPrefixOperations to allow a properly equipped FileIO instance to handle all of RemoveOrphanFiles without using any Hadoop Classes

I see SupportPrefixOperations has 5 FileIO implementations. Is the idea to lean on ResolvingFileIO for selecting the appropriate FileIO implementation?

RussellSpitzer · 2024-11-17T22:25:58Z

I have a few questions to help get started.

Is TestRemoveOrphanFilesProcedure a good place to add a new test for this? I found this issue, maybe this is a good starting point for codifying the current problem?

I think you could put new tests in here, but this may be a bit difficult to actually test since this will be a side effect within the action. You'll probably need to instrument the branches (FileIO vs current implementation) and then check which one is called.

branch whenever a Hadoop class is about to be used and use FileIO

Is the idea to branch in DeleteOrphanFilesSparkAction or in some other more general location(s)?

Just DeleteOrphanFiles, there isn't another location that is using these classes I believe. The base actions are Hadoop agnostic.

we currently have everything we need in SupportPrefixOperations to allow a properly equipped FileIO instance to handle all of RemoveOrphanFiles without using any Hadoop Classes

I see SupportPrefixOperations has 5 FileIO implementations. Is the idea to lean on ResolvingFileIO for selecting the appropriate FileIO implementation?

The idea is in places like https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L311

We would do a check of

    if (table.io() instanceof SupportsPrefixOperations) {
      // listDirRecursivelyFileIO( location, predicate, .....)
    } else {
      // list at most MAX_DRIVER_LISTING_DEPTH levels and only dirs that have
      // less than MAX_DRIVER_LISTING_DIRECT_SUB_DIRS direct sub dirs on the driver
      listDirRecursively( //Probably rename this function to listDirRecursivelyHadoop
        location,
        predicate,
        hadoopConf.value(),
        MAX_DRIVER_LISTING_DEPTH,
        MAX_DRIVER_LISTING_DIRECT_SUB_DIRS,
        subDirs,
        pathFilter,
        matchingFiles);
    }

I haven't checked the rest of the code, but basically anywhere you see org.apache.hadoop classes we should check if the tableIO supports prefixOperations and if it does, branch and do something with prefix operations.

RussellSpitzer added improvement PR that improves existing functionality good first issue Good for newcomers labels Nov 13, 2024

RussellSpitzer mentioned this issue Nov 21, 2024

add_files with RestCatalog, S3FileIO #11558

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Dependency on Hadoop's Filesystem Class from Remove Orphan Files #11541

Remove Dependency on Hadoop's Filesystem Class from Remove Orphan Files #11541

RussellSpitzer commented Nov 13, 2024

amogh-jahagirdar commented Nov 14, 2024

danielcweeks commented Nov 14, 2024

rocco408 commented Nov 15, 2024

rocco408 commented Nov 17, 2024

RussellSpitzer commented Nov 17, 2024 •

edited

Loading

Remove Dependency on Hadoop's Filesystem Class from Remove Orphan Files #11541

Remove Dependency on Hadoop's Filesystem Class from Remove Orphan Files #11541

Comments

RussellSpitzer commented Nov 13, 2024

Feature Request / Improvement

Query engine

Willingness to contribute

amogh-jahagirdar commented Nov 14, 2024

danielcweeks commented Nov 14, 2024

rocco408 commented Nov 15, 2024

rocco408 commented Nov 17, 2024

RussellSpitzer commented Nov 17, 2024 • edited Loading

RussellSpitzer commented Nov 17, 2024 •

edited

Loading