Skip to content

Conversation

@mzhang
Copy link
Contributor

@mzhang mzhang commented Oct 29, 2025

What changes were proposed in this pull request?

Currently, the in memory file index will list the non-direct children nodes of the input path, even if the recursiveFileLookup option is set to false, which can be very costly.

This PR skips the child node lookups.

Why are the changes needed?

These unnecessary lookups can be very costly.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@holdenk
Copy link
Contributor

holdenk commented Oct 31, 2025

Hey @mzhang thanks for the PR, can you file a JIRA for this and link the PR to it? See #52039 for an example of how we link JIRAs in PR titles. Also currently we use recursiveFileLookup to determine if we infer partitions from the directory schema rather than scan or not scan which is maybe a confusingly named parameter (see FileBasedDataSourceSuite.scalaL880), I'm a little fuzzy on the naming here but let's ask @xiaonanyang-db for context too.

@holdenk holdenk marked this pull request as draft October 31, 2025 23:09
@mzhang
Copy link
Contributor Author

mzhang commented Nov 2, 2025

Hey! Thanks for taking a look! My mistake leaving this as reviewable - I wasn't sure if the CI would run otherwise but this PR wasn't ready for review yet. I will file a ticket per usual when I'm ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants