-
Couldn't load subscription status.
- Fork 1.7k
Evaluate projections at the file source level #18309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Evaluate projections at the file source level #18309
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So exciting to see this moving -- thank you @friendlymatthew
| } | ||
| } | ||
|
|
||
| pub type ProjectionPushdownResult = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please document this type (like what the two fields mean)?
I actually think it would be even nicer if this was a real enum so we could document it inline
Perhaps like
/// Result of evaluating projection pushdown ....
enum ProjectionPushdownResult {
...
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. Can I get your thoughts on naming here? I made another ProjectionPushdownResult that is an enum. That type happens to live at the FileSource level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe if we go with https://github.com/apache/datafusion/pull/18309/files#r2467051055 we can have just 1 enum / structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm not sure if that is possible. One stores an Arc<dyn DataSource> while the other stores an Arc<dyn FileSource>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make them generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what I'm trying to say is that it would be nice to give these types a proper name/alias. PartialPushdownResult<FileSource> and PartialPushdownResult<DataSource> doesn't sound the best to me IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what we do for filter pushdown, seems okay
| } | ||
| } | ||
|
|
||
| pub enum ProjectionPushdownResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doctorings please
| )) | ||
| } | ||
|
|
||
| fn try_pushdown_projections( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A docstring would be great
| Partial { | ||
| new_file_source: Option<Arc<dyn FileSource>>, | ||
| remaining_projections: Option<ProjectionExprs>, | ||
| new_projection_indices: Option<Vec<usize>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm something seems off here to me. In my mind this should be more like:
pub struct ProjectionPushdown {
new_file_source: Arc<dyn FileSource>,
remaining_projections: Option<ProjectionExprs>,
}
pub type ProjectionPushdownResult = Option<ProjectionPushdown>; I don't see how it could make sense to have a remaining projection if the source wasn't updated.
File sources like Parquet will absorb the entire projection.
File sources like CSV will push down indexes and create a remainder expression that handles anything more complex (aliases, operators, etc.). We can make helpers that those file sources call to handle splitting up the projection.
If no projection can be handled (the default) this returns None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we just return Option<ProjectionPushdown> directly (not wrap it in a typedef)?
158fcbf to
b8b0cff
Compare
Which issue does this PR close?
TableProviders#14993FileScanConfigown a list ofProjectionExprs #18253Rationale for this change
This PR delegates the responsibility of projection pushdown evaluation from the
DataSourcetrait layer (likeFileScanConfig) down to the file source implementation itself (theFileSourcetrait level).Previously,
FileScanConfig::try_swapping_with_projectioncontained all the logic to determine whether projections can be pushed down: checking for partition columns, aliases, and computing new projection indices. This meant theDataSourcewas responsible for implementation details that should belong to the underlying file formatNow,
FileSource::try_pushdown_projectionshandles this evaluation. The default impl performs the naive check that mentioned above, and individual file sources likeParquetSourcecan override this method to provide format-specific pushdown behaviorDataSource::try_swapping_with_projectionnow returns a tuple containing both a new data source and optional remaining projections, allowing for partial pushdown scenarios where some projection expressions cannot be evaluated by the file source and must remain in aProjectionExecnode.