Pass `batch_size` directly when creating file opener #17076

friendlymatthew · 2025-08-07T16:38:44Z

Rationale for this change

This PR simplifies the design by making batch size a required parameter when calling FileSource::create_file_opener.

When we go to open these files in create_file_opener, we do an .expect() call to ensure batch size is properly configured ahead of time. By passing in the batch size directly as a parameter, we can avoid the .expect() and avoid the Option altogether

Instead of:

let source = self
    .file_source
    .with_batch_size(batch_size) // sets `batch_size` to `Some(batch_size)`
    .with_projection(self);

let opener = source.create_file_opener(object_store, self, partition); // inside here, we immediately do an .expect()

We now directly pass the batch size

let source = self
    .file_source
    .with_projection(self);

let opener = source.create_file_opener(object_store, self, partition, batch_size);

Are there any user-facing changes?

Yes, the FIleSource trait is modified, as well as DataSource structs

…opener

adriangb · 2025-08-08T04:13:59Z

I think this is a positive change. The fact that there is an Option<T> that always gets expect()ed seems like a smell that something is wrong. The diff is also (currently) +35 / -72, another sign that this is a good change.

Generally we would like to disentangle this area of code a bit starting by refactoring places like this. There are a couple other ways in which FileSource get's initialized in multiple places. At the risk of going off topic I'll use file_schema as an example:

https://github.com/pydantic/datafusion/blob/9cb592c70f9926b81102ee2270875fb4e19e4e1e/datafusion/datasource-parquet/src/source.rs#L553-L558

Which also gets pulled in from FileScanConfig:

https://github.com/pydantic/datafusion/blob/9cb592c70f9926b81102ee2270875fb4e19e4e1e/datafusion/datasource-parquet/src/source.rs#L532

I think where some of this is heading is splitting FileScanConfig into:

FileScanConfig: holds configurations / is a builder for data shared across multiple data sources (only the very shared things)
FileScan: does the actual execution
By doing this refactor we can end up with something like:

let config = FileScanConfigBuilder::new(file_schema).build();
let source = ParquetSource::new(config.clone());  // needs access to file schema, for filter pushdown evaluation
let scan = FileScan::new(source, config);  // needs access to the file schema for projection stuff? needs access the the files to be scanned?
let exec = DataSourceExec::new(scan);

There's other things to do along the way, e.g. I think this should be moved to CSVSource:

datafusion/datafusion/datasource/src/file_scan_config.rs

Lines 185 to 186 in f9efba0

    
               /// Are new lines in values supported for CSVOptions 
        
               pub new_lines_in_values: bool,

adriangb · 2025-08-08T04:18:07Z

@blaginin would love your input on this change / maybe a bit of the whole plan!

adriangb · 2025-08-08T04:19:15Z

datafusion-examples/examples/csv_json_opener.rs

    let config = CsvSource::new(true, b',', b'"')
        .with_comment(Some(b'#'))
        .with_schema(schema)


These sources may need builders or some arguments moved to new() to avoid Option<>.expect()s

adriangb · 2025-08-08T04:22:22Z

datafusion/datasource/src/file.rs

        object_store: Arc<dyn ObjectStore>,
        base_config: &FileScanConfig,
        partition: usize,
+        batch_size: usize,


My only qualm with this change is that it adds another argument to FileSource::create_file_opener and it's not clear if this will be it or if we're going to add 5 more and realize that it's a bad pattern. I think there may be 1 more or something and that's it, in which case this is the simplest and best way to do it I can think of, but if anyone has better ideas or thinks there will be a proliferation of arguments here I'd be interested in your thoughts.

comphead

Thanks @friendlymatthew

While I can see the rationale behind this change, I'd like to understand more about why this approach is definitively better than the existing implementation. Given that this modifies the FileSource it would be a breaking change for downstream users?

Please let me know the specific example where builder wouldn't work for you?

The current approach using with_batch_size() follows a builder pattern which is flexible and composable. Given that this modifies the FileSource trait and affects multiple DataSource structs, what's the migration path for downstream users?

friendlymatthew · 2025-08-08T18:16:29Z

Thanks @friendlymatthew

While I can see the rationale behind this change, I'd like to understand more about why this approach is definitively better than the existing implementation. Given that this modifies the FileSource it would be a breaking change for downstream users?

Please let me know the specific example where builder wouldn't work for you?

Hi, the FileConfigBuilder initializing batch size as an Option just for every subsequent call site to require it as Some felt awkward. The with_batch_size signature reinforces this, since it only accepts a usize and immediately wraps it in Some.

From what I understand, batch size is only used when opening/reading files. Instead of requiring the user to set it in advance (and risking a panic if omitted), I think it's cleaner to pass it as a required parameter at the point where it's actually needed:

let config = CsvSource::new(true, b',', b'"')
    .with_comment(Some(b'#'))
    .with_schema(schema)
//  .with_batch_size(8192) --> If I omit this, we'll panic when we create a file opener
    .with_projection(&scan_config);

let opener = config.create_file_opener(object_store, &scan_config, 0); // let's just pass it in here?

The current approach using with_batch_size() follows a builder pattern which is flexible and composable. Given that this modifies the FileSource trait and affects multiple DataSource structs, what's the migration path for downstream users?

One option would be to deprecate with_batch_size and point users toward the new parameter-passing approach. (I'm considering doing something similar for with_projection: #17095).

Stepping back, my main goal is to break the tight coupling between FileSourceConfig and FileSource. Right now, the config struct both stores configuration values and the underlying file source, which in turn needs to call back into the config-- that feels like a design smell

adriangb · 2025-08-08T18:26:12Z

Right big picture here I think it's quite evident that there are large issues with the current design: coupling, circular references, etc. Just take a look at this:

datafusion/datafusion/datasource/src/source.rs

Lines 74 to 122 in 407a965

    
           /// The following diagram shows how DataSource, FileSource, and DataSourceExec are related 
        
           /// ```text 
        
           ///                       ┌─────────────────────┐                              -----► execute path 
        
           ///                       │                     │                              ┄┄┄┄┄► init path 
        
           ///                       │   DataSourceExec    │   
        
           ///                       │                     │     
        
           ///                       └───────▲─────────────┘ 
        
           ///                               ┊  │ 
        
           ///                               ┊  │ 
        
           ///                       ┌──────────▼──────────┐                            ┌──────────-──────────┐ 
        
           ///                       │                     │                            |                     | 
        
           ///                       │  DataSource(trait)  │                            | TableProvider(trait)| 
        
           ///                       │                     │                            |                     | 
        
           ///                       └───────▲─────────────┘                            └─────────────────────┘ 
        
           ///                               ┊  │                                                  ┊ 
        
           ///               ┌───────────────┿──┴────────────────┐                                 ┊ 
        
           ///               |   ┌┄┄┄┄┄┄┄┄┄┄┄┘                   |                                 ┊ 
        
           ///               |   ┊                               |                                 ┊ 
        
           ///    ┌──────────▼──────────┐             ┌──────────▼──────────┐                      ┊ 
        
           ///    │                     │             │                     │           ┌──────────▼──────────┐ 
        
           ///    │   FileScanConfig    │             │ MemorySourceConfig  │           |                     | 
        
           ///    │                     │             │                     │           |  FileFormat(trait)  | 
        
           ///    └──────────────▲──────┘             └─────────────────────┘           |                     | 
        
           ///               │   ┊                                                      └─────────────────────┘ 
        
           ///               │   ┊                                                                 ┊ 
        
           ///               │   ┊                                                                 ┊ 
        
           ///    ┌──────────▼──────────┐                                               ┌──────────▼──────────┐ 
        
           ///    │                     │                                               │     ArrowSource     │ 
        
           ///    │ FileSource(trait)   ◄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄│          ...        │ 
        
           ///    │                     │                                               │    ParquetSource    │ 
        
           ///    └─────────────────────┘                                               └─────────────────────┘ 
        
           ///               │ 
        
           ///               │ 
        
           ///               │ 
        
           ///               │ 
        
           ///    ┌──────────▼──────────┐ 
        
           ///    │     ArrowSource     │ 
        
           ///    │          ...        │ 
        
           ///    │    ParquetSource    │ 
        
           ///    └─────────────────────┘ 
        
           ///               | 
        
           /// FileOpener (called by FileStream) 
        
           ///               │ 
        
           ///    ┌──────────▼──────────┐ 
        
           ///    │                     │ 
        
           ///    │     RecordBatch     │ 
        
           ///    │                     │ 
        
           ///    └─────────────────────┘ 
        
           /// ```

@xudong963 made this diagram (thank you again) because we were all having trouble wrapping our heads around how these things are related. Just yesterday I ran into a gnarly bug in this area that's probably going to be really hard to unravel because logic / information is split across multiple places: #17077 And this complexity is blocking important work e.g. #14993. All this is to say: the current status quo is not great. The thing I'm struggling with is how to improve that. Unfortunately small incremental improvements (like what this PR attempts to do) will result in a lot of churn for users. Maybe a better approach is to work on a greenfield replacement that attempts to minimize the final API churn? I'm not sure, open to ideas.

blaginin · 2025-08-09T12:13:00Z

I agree with the idea overall:

batch_size doesn’t seem to me like a FileSource property; it’s rather a FileScanConfig param (and I feel like it’ll be even more straightforward once we do the big refactoring @adriangb mentioned)
FileSource doesn’t currently have any methods to inspect the current batch_size, so the user impact shouldn’t be too big

Happy for us to merge this, but I feel like we may need to agree on the final approach first. If we’ll end up rewriting this in the next release, should we make small incremental updates now? I fear it may be annoying for library users to face a breaking change in every release - especially in the world of File Scan configs, which are a bit messy / hard to understand right now.

comphead · 2025-08-09T20:22:06Z

Thanks @adriangb, @blaginin, @friendlymatthew
IMO crashing on not setting batch_size is super confusing indeed. However as long as the error is pretty descriptive and the API allows to overcome this issue by calling with_batch_size we should be fine. Totally agree the bigger refactoring is needed and having some default value for batch_size and making the file open routine less troublesome.

For this PR I'm pretty sure it would add a migration pain for downstream users having a limited benefit.

@friendlymatthew would you like to start a refactoring?

friendlymatthew · 2025-08-10T00:45:05Z

Both comments make sense to me, and yes I'm happy to start the refactor

xudong963 · 2025-08-11T08:01:07Z

@xudong963 made this diagram (thank you again) because we were all having trouble wrapping our heads around how these things are related.

One thing I can be sure of is that when I was making the diagram, I felt that traits/code are coupled deeply, and I was worried I missed some configs when upgrading. Do we have an issue/RFC to discuss the future refactor?

friendlymatthew · 2025-08-12T15:53:39Z

@xudong963 made this diagram (thank you again) because we were all having trouble wrapping our heads around how these things are related.

One thing I can be sure of is that when I was making the diagram, I felt that traits/code are coupled deeply, and I was worried I missed some configs when upgrading. Do we have an issue/RFC to discuss the future refactor?

Hi, I'm still investigating the existing relationships, but I'll open a issue for the redesign shortly.

I'll be curious to get your thoughts @xudong963

adriangb · 2025-08-12T15:56:00Z

Do we have an issue/RFC to discuss the future refactor?

The closest thing to that in my mind is #15952

alamb · 2025-08-12T20:06:21Z

Do we have an issue/RFC to discuss the future refactor?

The closest thing to that in my mind is #15952

I marked this one with my experimental "PROPOSED EPIC" tag

alamb

Thank you @friendlymatthew and @adriangb -- I think this change makes sense to me

Is it ok with you too @comphead ?

comphead · 2025-08-12T20:10:41Z

Thank you @friendlymatthew and @adriangb -- I think this change makes sense to me

Is it ok with you too @comphead ?

I think we were inclined to discuss refactor FileScan related structures rather than pursuing this PR which patches an API flaw.

Agreed on a different approach

alamb · 2025-08-22T19:15:58Z

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review.

friendlymatthew · 2025-08-25T15:28:56Z

Thanks @adriangb, @blaginin, @friendlymatthew IMO crashing on not setting batch_size is super confusing indeed. However as long as the error is pretty descriptive and the API allows to overcome this issue by calling with_batch_size we should be fine. Totally agree the bigger refactoring is needed and having some default value for batch_size and making the file open routine less troublesome.

For this PR I'm pretty sure it would add a migration pain for downstream users having a limited benefit.

@friendlymatthew would you like to start a refactoring?

Hi, here's the proposed refactor: #17242

github-actions · 2025-10-28T02:06:37Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Aug 7, 2025

Pass batch size directly to file opener

f706d4f

friendlymatthew force-pushed the friendlymatthew/pass-batch-size-directly-to-opener branch from 93e4b42 to f706d4f Compare August 7, 2025 19:40

Merge branch 'main' into friendlymatthew/pass-batch-size-directly-to-…

9cb592c

…opener

adriangb previously approved these changes Aug 8, 2025

View reviewed changes

friendlymatthew mentioned this pull request Aug 8, 2025

Unify how various FileSources are applying projections? #17095

Open

comphead reviewed Aug 8, 2025

View reviewed changes

alamb added the api change Changes the API exposed to users of the crate label Aug 8, 2025

blaginin self-requested a review August 9, 2025 12:02

adriangb mentioned this pull request Aug 12, 2025

Clean up APIs around FileScanConfigBuilder, FileScanConfig and FileSource #15952

Open

alamb previously approved these changes Aug 12, 2025

View reviewed changes

alamb marked this pull request as draft August 22, 2025 19:16

github-actions bot added the Stale PR has not had any activity for some time label Oct 28, 2025

github-actions bot closed this Nov 4, 2025

Pass batch_size directly when creating file opener #17076

Pass batch_size directly when creating file opener #17076

Uh oh!

Conversation

friendlymatthew commented Aug 7, 2025

Rationale for this change

Are there any user-facing changes?

Uh oh!

adriangb commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Aug 8, 2025

Uh oh!

adriangb Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

friendlymatthew commented Aug 8, 2025

Uh oh!

adriangb commented Aug 8, 2025

Uh oh!

blaginin commented Aug 9, 2025

Uh oh!

comphead commented Aug 9, 2025

Uh oh!

friendlymatthew commented Aug 10, 2025

Uh oh!

xudong963 commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

friendlymatthew commented Aug 12, 2025

Uh oh!

adriangb commented Aug 12, 2025

Uh oh!

alamb commented Aug 12, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

comphead commented Aug 12, 2025

Uh oh!

alamb commented Aug 22, 2025

Uh oh!

friendlymatthew commented Aug 25, 2025

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Pass `batch_size` directly when creating file opener #17076

Pass `batch_size` directly when creating file opener #17076

adriangb commented Aug 8, 2025 •

edited

Loading

xudong963 commented Aug 11, 2025 •

edited

Loading