Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Dependency on Hadoop's Filesystem Class from Remove Orphan Files #11541

Open
3 tasks
RussellSpitzer opened this issue Nov 13, 2024 · 5 comments
Open
3 tasks
Labels
good first issue Good for newcomers improvement PR that improves existing functionality

Comments

@RussellSpitzer
Copy link
Member

Feature Request / Improvement

Currently RemoveOrphan Files is hardwired to use Hadoop FS classes when listing directories and attempting to locate orphan files. This is a problem for anyone working with a REST catalog since only the Catalog IO (Iceberg's FIleIO) will have credentials and the Hadoop conf will not.

Luckily, I think we currently have everything we need in SupportPrefixOperations to allow a properly equipped FileIO instance to handle all of RemoveOrphanFiles without using any Hadoop Classes.

I propose we branch whenever a Hadoop class is about to be used and use FileIO instead if it supports the proper operations. If it does not we should fall back to Hadoop classes.

Query engine

Spark

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time
@RussellSpitzer RussellSpitzer added improvement PR that improves existing functionality good first issue Good for newcomers labels Nov 13, 2024
@amogh-jahagirdar
Copy link
Contributor

#7914 should address this, it'd be great if someone could take this forward.

@danielcweeks
Copy link
Contributor

@amogh-jahagirdar I don't think #7914 is a good approach to addressing this as it's not scalable. This trades off distributed listing for single iteration.

To support prefix listing, we need to break up the key space in a way that we can parallelize the work. I'll comment on the PR, there's a better approach that we're looking into.

@rocco408
Copy link
Contributor

I'm happy to take a look, independently and/or with community input, I'll circle-back in the coming days with any progress.

@rocco408
Copy link
Contributor

I have a few questions to help get started.

  1. Is TestRemoveOrphanFilesProcedure a good place to add a new test for this? I found this issue, maybe this is a good starting point for codifying the current problem?

  2. branch whenever a Hadoop class is about to be used and use FileIO

Is the idea to branch in DeleteOrphanFilesSparkAction or in some other more general location(s)?

  1. we currently have everything we need in SupportPrefixOperations to allow a properly equipped FileIO instance to handle all of RemoveOrphanFiles without using any Hadoop Classes

I see SupportPrefixOperations has 5 FileIO implementations. Is the idea to lean on ResolvingFileIO for selecting the appropriate FileIO implementation?

@RussellSpitzer
Copy link
Member Author

RussellSpitzer commented Nov 17, 2024

I have a few questions to help get started.

  1. Is TestRemoveOrphanFilesProcedure a good place to add a new test for this? I found this issue, maybe this is a good starting point for codifying the current problem?

I think you could put new tests in here, but this may be a bit difficult to actually test since this will be a side effect within the action. You'll probably need to instrument the branches (FileIO vs current implementation) and then check which one is called.

  1. branch whenever a Hadoop class is about to be used and use FileIO

Is the idea to branch in DeleteOrphanFilesSparkAction or in some other more general location(s)?

Just DeleteOrphanFiles, there isn't another location that is using these classes I believe. The base actions are Hadoop agnostic.

  1. we currently have everything we need in SupportPrefixOperations to allow a properly equipped FileIO instance to handle all of RemoveOrphanFiles without using any Hadoop Classes

I see SupportPrefixOperations has 5 FileIO implementations. Is the idea to lean on ResolvingFileIO for selecting the appropriate FileIO implementation?

The idea is in places like https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L311

We would do a check of

    if (table.io() instanceof SupportsPrefixOperations) {
      // listDirRecursivelyFileIO( location, predicate, .....)
    } else {
      // list at most MAX_DRIVER_LISTING_DEPTH levels and only dirs that have
      // less than MAX_DRIVER_LISTING_DIRECT_SUB_DIRS direct sub dirs on the driver
      listDirRecursively( //Probably rename this function to listDirRecursivelyHadoop
        location,
        predicate,
        hadoopConf.value(),
        MAX_DRIVER_LISTING_DEPTH,
        MAX_DRIVER_LISTING_DIRECT_SUB_DIRS,
        subDirs,
        pathFilter,
        matchingFiles);
    }

I haven't checked the rest of the code, but basically anywhere you see org.apache.hadoop classes we should check if the tableIO supports prefixOperations and if it does, branch and do something with prefix operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers improvement PR that improves existing functionality
Projects
None yet
Development

No branches or pull requests

4 participants