- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.7k
 
          Redesign ownership model between FileScanConfig and FileSources
          #17242
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
  
    Redesign ownership model between FileScanConfig and FileSources
  
  #17242
              Conversation
93b9263    to
    152548d      
    Compare
  
    b138003    to
    d6ff4cc      
    Compare
  
    ae16eba    to
    3177ca6      
    Compare
  
    | 
           I was able to identify the root cause behind several test failures related to roundtripping physical plans in  impacted test cases
 tl;dr, we expect  On  For example, in  Thoughts on fixesI see 2 possible directions for resolving the test failures: 
 I'm leaning towards 1 since it reflects the current lossless boundary and avoids overcomplicating serialization. But I'm curious if people have any opinions cc @adriangb @berkaysynnada @mbrubeck @xudong963 @comphead @blaginin @alamb  | 
    
| 
           I agree that there was always losses in serialization / deserialization e.g. loss of custom   | 
    
ccf7308    to
    a152b7d      
    Compare
  
    FileScanConfig and FileSources
      FileScanConfig and FileSourcesFileScanConfig and FileSources
      513b7aa    to
    66eac9f      
    Compare
  
    fa40cc8    to
    5cfd81d      
    Compare
  
    b417b26    to
    1bbd3ed      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this PR is ready for review-- It's quite sizable and touches 55+ files.
I've split up the PR into 2 commits:
- Move all fields shared across 
FileSourceintoFileScanConfig - Have 
FileSourceownFileScanConfigand directly implDataSource 
I would recommend starting from datafusion/datasource/src/file_scan_config.rs to see which fields were added to FileScanConfig. Then, all concrete types that impl FileSource like ParquetSource.
datafusion/datasource/src/file.rs is the main change, as it implements DataSource for all T: FileSource
| let file_source = Arc::new( | ||
| ParquetSource::default() | ||
| let file_scan_config = FileScanConfigBuilder::new(object_store_url, schema) | ||
| .with_limit(limit) | ||
| .with_projection(projection.cloned()) | ||
| .with_file(partitioned_file) | ||
| .build(); | ||
| 
               | 
          ||
| let file_source = | ||
| ParquetSource::new(TableParquetOptions::default(), file_scan_config.clone()) | ||
| // provide the predicate so the DataSourceExec can try and prune | ||
| // row groups internally | ||
| .with_predicate(predicate) | ||
| // provide the factory to create parquet reader without re-reading metadata | ||
| .with_parquet_file_reader_factory(Arc::new(reader_factory)), | ||
| ); | ||
| let file_scan_config = | ||
| FileScanConfigBuilder::new(object_store_url, schema, file_source) | ||
| .with_limit(limit) | ||
| .with_projection(projection.cloned()) | ||
| .with_file(partitioned_file) | ||
| .build(); | ||
| .with_parquet_file_reader_factory(Arc::new(reader_factory)); | ||
| 
               | 
          ||
| // Finally, put it all together into a DataSourceExec | ||
| Ok(DataSourceExec::from_data_source(file_scan_config)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll see this pattern a lot in this PR.
The old flow goes something like:
- Define the file source
 - Define the file scan config and move the file source inside
 - Derive a data source plan from the file scan config
 
Inside this flow, there's a circular dependency (call file source from config, create file opener from file source but also pass in the config).
The new flow goes something like:
- Define the config
 - Define the file source which now owns the config
 - Derive a data source plan from the file source
 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest point of the refactor
1bbd3ed    to
    0cc2125      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just small comments for now. will review more tomorrow
| let file_source = Arc::new( | ||
| ParquetSource::default() | ||
| let file_scan_config = FileScanConfigBuilder::new(object_store_url, schema) | ||
| .with_limit(limit) | ||
| .with_projection(projection.cloned()) | ||
| .with_file(partitioned_file) | ||
| .build(); | ||
| 
               | 
          ||
| let file_source = | ||
| ParquetSource::new(TableParquetOptions::default(), file_scan_config.clone()) | ||
| // provide the predicate so the DataSourceExec can try and prune | ||
| // row groups internally | ||
| .with_predicate(predicate) | ||
| // provide the factory to create parquet reader without re-reading metadata | ||
| .with_parquet_file_reader_factory(Arc::new(reader_factory)), | ||
| ); | ||
| let file_scan_config = | ||
| FileScanConfigBuilder::new(object_store_url, schema, file_source) | ||
| .with_limit(limit) | ||
| .with_projection(projection.cloned()) | ||
| .with_file(partitioned_file) | ||
| .build(); | ||
| .with_parquet_file_reader_factory(Arc::new(reader_factory)); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just moving code around to initialize the FileScanConfig before the ParquetSource ✅
| .with_schema(schema) | ||
| .with_batch_size(8192) | ||
| .with_projection(&scan_config); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense that the schema, batch size and projection are not properties inherent to CSVs and thus should be part of FileScanConfig. In fact they are currently duplicated!
| 
           I'll start review in recent two days.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks really nice!
Big picture the idea of having the shared behavior via impl <T: FileSource> DataSource for &T is really smart: it gives us code sharing / common behavior while allowing each FileSource to override the implementations to specialize (e.g. Parquet for projection pushdown).
I have left some comments questions and would like to see some other reviews as well before approving.
But again overall commend you on the attention to detail taken in this PR. It is unfortunately massive and hard to review but I think the organization into stacked commits you did is good and I don't see any way to split this up better.
| /// Does not hold anything special, since [`FileScanConfig`] is sufficient for arrow | ||
| #[derive(Clone, Default)] | ||
| #[derive(Clone)] | ||
| pub struct ArrowSource { | ||
| metrics: ExecutionPlanMetricsSet, | ||
| schema_adapter_factory: Option<Arc<dyn SchemaAdapterFactory>>, | ||
| config: FileScanConfig, | ||
| } | ||
| 
               | 
          ||
| impl ArrowSource { | ||
| /// Returns a [`ArrowSource`] | ||
| pub fn new(config: FileScanConfig) -> Self { | ||
| Self { | ||
| metrics: Default::default(), | ||
| schema_adapter_factory: None, | ||
| config, | ||
| } | ||
| } | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense - all of the FileSource now need a FileScanConfig -> they can't impl Default anymmore
| fn config(&self) -> &FileScanConfig { | ||
| &self.config | ||
| } | ||
| 
               | 
          ||
| fn with_config(&self, config: FileScanConfig) -> Arc<dyn FileSource> { | ||
| let mut this = self.clone(); | ||
| this.config = config; | ||
| 
               | 
          ||
| Arc::new(this) | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the APIs I'm not so sure about
| Arc::new(ArrowOpener { | ||
| object_store, | ||
| projection: base_config.file_column_projection_indices(), | ||
| projection: self.config().file_column_projection_indices(), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use self.config instead, seems like not point in calling the public method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both approaches work in this case because:
- The trait method 
config()just returns&self.config - So 
self.config()andself.configaccess the same data 
Use self.config() for consistency with the trait interface, especially since:
- It maintains abstraction - using the trait method rather than direct field access
 - It would work correctly even if the trait implementation changed, though it seems not to have happened
 
| fn metrics(&self) -> &ExecutionPlanMetricsSet { | ||
| fn metrics_inner(&self) -> &ExecutionPlanMetricsSet { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm is metrics_inner now a method on FileSource?
| fn with_fetch(&self, limit: Option<usize>) -> Option<Arc<dyn DataSource>> { | ||
| let config = FileScanConfigBuilder::from(self.config().clone()) | ||
| .with_limit(limit) | ||
| .build(); | ||
| 
               | 
          ||
| Some(self.with_config(config).as_data_source()) | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should invert this:
    fn with_fetch(&self, limit: Option<usize>) -> Option<Arc<dyn DataSource>> {
        Some(Arc::new(FileSource::with_fetch(self, limit)) as Arc<dyn DataSource>)
    }
Where FileSource::with_fetch is something along the lines of:
self.config = self.config.with_fetch(limit)
Basically can we make FileScanConfig an implementation detail one level lower.
| fn as_file_source(&self) -> Option<Arc<dyn FileSource>> { | ||
| None | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering / questioning this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MemorySourceConfig impls DataSource but not FileSource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but I wonder if this is the right way to go about that or if whatever is needing this sort of down casting should be a part of the trait
| if let Some(file_source) = self.data_source.as_file_source() { | ||
| if let Some(file_group) = file_source.config().file_groups.get(partition) | ||
| { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the point is that not all data sources are file sources, and for the ones that aren't partitioning doesn't necessarily apply.
Can't we make this a method on the DataSource trait so that we can implement it as Ok(Statistics::new_unknown()) for MemoryDataSource and copy this code into impl <T: FileScan> DataSource for &T?
| pub fn downcast_to_file_source<T: FileSource + 'static>(&self) -> Option<&T> { | ||
| self.data_source().as_any().downcast_ref::<T>() | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same about this method, wonder where it is used. It's a nice convenience but to me this sort of downcasting is always a smell of missing design on the trait / leaky abstraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessarily a bad code smell. It's used whenever we need to downcast to a FileSource. It just reduces the # LOC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that sort of down casting being necessary is a code smell
| fn parquet_exec_multiple_sorted( | ||
| output_ordering: Vec<LexOrdering>, | ||
| ) -> Arc<DataSourceExec> { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem for today but again the fact that optimizer rules have to special case execution node implementations via downcasting is IMO a smell
| fn data_source_statistics(&self) -> Result<Statistics> { | ||
| Ok(self.config().projected_stats(self.file_source_statistics())) | ||
| fn data_source_statistics(&self) -> Statistics { | ||
| let file_source_statistics = self.file_source_statistics(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love data_source_statistics and file_source_statistics, curious if people had thoughts on better names here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both statistics()s are good to me
| 
           It would be better to have a high-level diagram to describe the current relationship. Before the PR, it looks like After PR, -> ? If we have this, it'll definitely be helpful to review, especially for our users to understand the changes in the PR. Also, if we wanna have more eyes on this redesign, we can send an email to https://lists.apache.org/list.html?dev@datafusion.apache.org.  | 
    
          
 Fwiw, this change involves removing the  Here's a diagram that captures the redesign:  
You can check it out here  | 
    
| fn with_config(&self, config: FileScanConfig) -> Arc<dyn FileSource> { | ||
| let mut this = self.clone(); | ||
| this.config = config; | ||
| 
               | 
          ||
| Arc::new(this) | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find a reason why we don't give a default implementation for it in FileSource
| fn config(&self) -> &FileScanConfig { | ||
| &self.config | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| fn as_data_source(&self) -> Arc<dyn DataSource> { | ||
| Arc::new(self.clone()) | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| Arc::new(ArrowOpener { | ||
| object_store, | ||
| projection: base_config.file_column_projection_indices(), | ||
| projection: self.config().file_column_projection_indices(), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both approaches work in this case because:
- The trait method 
config()just returns&self.config - So 
self.config()andself.configaccess the same data 
Use self.config() for consistency with the trait interface, especially since:
- It maintains abstraction - using the trait method rather than direct field access
 - It would work correctly even if the trait implementation changed, though it seems not to have happened
 
| fn with_statistics(&self, statistics: Statistics) -> Arc<dyn FileSource>; | ||
| /// Return execution plan metrics | ||
| fn metrics(&self) -> &ExecutionPlanMetricsSet; | ||
| fn metrics_inner(&self) -> &ExecutionPlanMetricsSet; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid the method here? It seems that we can add metrics to FileScanConfig
| let mut source = source | ||
| .as_any() | ||
| .downcast_ref::<ParquetSource>() | ||
| .unwrap() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be better to be replaced with proper error handling
| // Apply schema adapter factory before building the new config | ||
| let file_source = source.apply_schema_adapter(&conf)?; | ||
| 
               | 
          ||
| let conf = FileScanConfigBuilder::from(conf) | ||
| .with_source(file_source) | ||
| .build(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for later reviewers, these step have been pulled up by create_file_source_with_schema_adapter
| fn file_source_statistics(&self, config: &FileScanConfig) -> Statistics { | ||
| let statistics = config.file_source_projected_statistics.clone(); | ||
| fn file_source_statistics(&self) -> Statistics { | ||
| let statistics = self.config.file_source_projected_statistics.clone(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems just a name issue.
| /// from the logical schema to the physical schema of the file. | ||
| pub expr_adapter_factory: Option<Arc<dyn PhysicalExprAdapterFactory>>, | ||
| 
               | 
          ||
| pub file_source_projected_statistics: Statistics, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
projected_statistics is good to me
| fn data_source_statistics(&self) -> Result<Statistics> { | ||
| Ok(self.config().projected_stats(self.file_source_statistics())) | ||
| fn data_source_statistics(&self) -> Statistics { | ||
| let file_source_statistics = self.file_source_statistics(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both statistics()s are good to me
          
 Thanks @friendlymatthew @xudong963 for diagrams thats incredibly helpful for high level review, couple of questions: 
  | 
    
| 
           Thank you for this innovative discussion! Learned a lot from it 💯 After digging a bit into related code, I got a few different opinions. First, I was wondering how  
  
Here is a brief walk through of current implementation: Some unreasonable points to me are: 
 And the final one,  
 I'd like to propose a few changes based on current PR (as a discussion in this refactor topic, not as a review comment and might be too large for one PR). 
 A diagram covers above proposed changes (drawio xml embedded):  
       | 
    
| 
           @waynexia thanks so much for the input! I agree with most of your points, let's see what @friendlymatthew thinks 
 Don't we need a  
 FWIW I also think we should get rid of the partition value handling in   | 
    
          
 Partition is closer to Runtime Config (passed from execution plan's exec) so it should be ok to have only one FileFormat for all partitions. The partition number is used to get a set of file segments from FilsScanConfig and to mark metrics.  | 
    
| 
           Thanks @waynexia for the diagram and explanation. For example if user would like to onboard the Orc file, For the diagram it might be still confusing having memory datasource under file source configs. Making some changes in the proposal However as you correctly mentioned FileFormat, FileOpener, FileStream probably can be incapsulated into some facade object taking a config and providing   | 
    
| 
           @friendlymatthew do you need help with finishing this pr?  | 
    
| 
           @blaginin we've essentially "abandoned" this PR because it was too big of a change. We've been working on more iterative changes. Tracking in #14993, the latest PR is #18309. @friendlymatthew I'm going to close this PR for now, there's so many conflicts I think even if we wanted to continue down this path it'd essentially be starting over.  | 
    





Which issue does this PR close?
FileSources are applying projections? #17095FileScanConfigBuilder,FileScanConfigandFileSource#15952Rationale for this change
This PR simplifies the relationship between
FileSources <-->FileScanConfig<-->DataSource.Currently,
FileScanConfigis a struct used to group common parameters shared across different file sources. However, the existing design also makesFileScanConfigimplDataSource. This means to construct a data source execution plan, you must derive it from a configuration struct.This PR removes that indirection. Instead, each
FileSourcestruct holds aFileScanConfigfield, and all types implFileSourcealso implDataSource.This redesign proves to remove a lot of redundant code. For instance,
AvroSourcepreviously duplicated fields fromFileScanConfig, which required additional boilerplate to manually get/set values:--
We still maintain an abstraction bounday between
FileSourceandDataSources. TheDataSourceimpl remains generic over anyT: FileSourceAre there any user-facing changes?
Yes -- and they are substantial.
Note: the current diff does not yet include deprecation strategies for existing methods to keep the review process clearer