Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

@friendlymatthew friendlymatthew commented Oct 27, 2025

Which issue does this PR close?

Rationale for this change

This PR delegates the responsibility of projection pushdown evaluation from the DataSource trait layer (like FileScanConfig) down to the file source implementation itself (the FileSource trait level).

Previously, FileScanConfig::try_swapping_with_projection contained all the logic to determine whether projections can be pushed down: checking for partition columns, aliases, and computing new projection indices. This meant the DataSource was responsible for implementation details that should belong to the underlying file format

Now, FileSource::try_pushdown_projections handles this evaluation. The default impl performs the naive check that mentioned above, and individual file sources like ParquetSource can override this method to provide format-specific pushdown behavior

DataSource::try_swapping_with_projection now returns a tuple containing both a new data source and optional remaining projections, allowing for partial pushdown scenarios where some projection expressions cannot be evaluated by the file source and must remain in a ProjectionExec node.

@github-actions github-actions bot added the datasource Changes to the datasource crate label Oct 27, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So exciting to see this moving -- thank you @friendlymatthew

}
}

pub type ProjectionPushdownResult =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please document this type (like what the two fields mean)?

I actually think it would be even nicer if this was a real enum so we could document it inline

Perhaps like

/// Result of evaluating projection pushdown ....
enum ProjectionPushdownResult {
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Can I get your thoughts on naming here? I made another ProjectionPushdownResult that is an enum. That type happens to live at the FileSource level

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if we go with https://github.com/apache/datafusion/pull/18309/files#r2467051055 we can have just 1 enum / structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not sure if that is possible. One stores an Arc<dyn DataSource> while the other stores an Arc<dyn FileSource>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make them generic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what I'm trying to say is that it would be nice to give these types a proper name/alias. PartialPushdownResult<FileSource> and PartialPushdownResult<DataSource> doesn't sound the best to me IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what we do for filter pushdown, seems okay

}
}

pub enum ProjectionPushdownResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doctorings please

))
}

fn try_pushdown_projections(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A docstring would be great

Partial {
new_file_source: Option<Arc<dyn FileSource>>,
remaining_projections: Option<ProjectionExprs>,
new_projection_indices: Option<Vec<usize>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm something seems off here to me. In my mind this should be more like:

pub struct ProjectionPushdown {
    new_file_source: Arc<dyn FileSource>,
    remaining_projections: Option<ProjectionExprs>,
}

pub type ProjectionPushdownResult = Option<ProjectionPushdown>;        

I don't see how it could make sense to have a remaining projection if the source wasn't updated.

File sources like Parquet will absorb the entire projection.
File sources like CSV will push down indexes and create a remainder expression that handles anything more complex (aliases, operators, etc.). We can make helpers that those file sources call to handle splitting up the projection.
If no projection can be handled (the default) this returns None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we just return Option<ProjectionPushdown> directly (not wrap it in a typedef)?

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/evaluate-projections-at-file-source-level branch from 158fcbf to b8b0cff Compare October 28, 2025 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants