Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

Rationale for this change

This PR simplifies the design by making batch size a required parameter when calling FileSource::create_file_opener.

When we go to open these files in create_file_opener, we do an .expect() call to ensure batch size is properly configured ahead of time. By passing in the batch size directly as a parameter, we can avoid the .expect() and avoid the Option altogether

Instead of:

let source = self
    .file_source
    .with_batch_size(batch_size) // sets `batch_size` to `Some(batch_size)`
    .with_projection(self);

let opener = source.create_file_opener(object_store, self, partition); // inside here, we immediately do an .expect()

We now directly pass the batch size

let source = self
    .file_source
    .with_projection(self);

let opener = source.create_file_opener(object_store, self, partition, batch_size);

Are there any user-facing changes?

Yes, the FIleSource trait is modified, as well as DataSource structs

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Aug 7, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/pass-batch-size-directly-to-opener branch from 93e4b42 to f706d4f Compare August 7, 2025 19:40
@adriangb
Copy link
Contributor

adriangb commented Aug 8, 2025

I think this is a positive change. The fact that there is an Option<T> that always gets expect()ed seems like a smell that something is wrong. The diff is also (currently) +35 / -72, another sign that this is a good change.

Generally we would like to disentangle this area of code a bit starting by refactoring places like this. There are a couple other ways in which FileSource get's initialized in multiple places. At the risk of going off topic I'll use file_schema as an example:

https://github.com/pydantic/datafusion/blob/9cb592c70f9926b81102ee2270875fb4e19e4e1e/datafusion/datasource-parquet/src/source.rs#L553-L558

Which also gets pulled in from FileScanConfig:

https://github.com/pydantic/datafusion/blob/9cb592c70f9926b81102ee2270875fb4e19e4e1e/datafusion/datasource-parquet/src/source.rs#L532

I think where some of this is heading is splitting FileScanConfig into:

  • FileScanConfig: holds configurations / is a builder for data shared across multiple data sources (only the very shared things)
  • FileScan: does the actual execution
    By doing this refactor we can end up with something like:
let config = FileScanConfigBuilder::new(file_schema).build();
let source = ParquetSource::new(config.clone());  // needs access to file schema, for filter pushdown evaluation
let scan = FileScan::new(source, config);  // needs access to the file schema for projection stuff? needs access the the files to be scanned?
let exec = DataSourceExec::new(scan);

There's other things to do along the way, e.g. I think this should be moved to CSVSource:

/// Are new lines in values supported for CSVOptions
pub new_lines_in_values: bool,

@adriangb
Copy link
Contributor

adriangb commented Aug 8, 2025

@blaginin would love your input on this change / maybe a bit of the whole plan!

adriangb
adriangb previously approved these changes Aug 8, 2025
Comment on lines 68 to 70
let config = CsvSource::new(true, b',', b'"')
.with_comment(Some(b'#'))
.with_schema(schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sources may need builders or some arguments moved to new() to avoid Option<>.expect()s

object_store: Arc<dyn ObjectStore>,
base_config: &FileScanConfig,
partition: usize,
batch_size: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only qualm with this change is that it adds another argument to FileSource::create_file_opener and it's not clear if this will be it or if we're going to add 5 more and realize that it's a bad pattern. I think there may be 1 more or something and that's it, in which case this is the simplest and best way to do it I can think of, but if anyone has better ideas or thinks there will be a proliferation of arguments here I'd be interested in your thoughts.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @friendlymatthew

While I can see the rationale behind this change, I'd like to understand more about why this approach is definitively better than the existing implementation. Given that this modifies the FileSource it would be a breaking change for downstream users?

Please let me know the specific example where builder wouldn't work for you?

The current approach using with_batch_size() follows a builder pattern which is flexible and composable. Given that this modifies the FileSource trait and affects multiple DataSource structs, what's the migration path for downstream users?

@friendlymatthew
Copy link
Contributor Author

Thanks @friendlymatthew

While I can see the rationale behind this change, I'd like to understand more about why this approach is definitively better than the existing implementation. Given that this modifies the FileSource it would be a breaking change for downstream users?

Please let me know the specific example where builder wouldn't work for you?

Hi, the FileConfigBuilder initializing batch size as an Option just for every subsequent call site to require it as Some felt awkward. The with_batch_size signature reinforces this, since it only accepts a usize and immediately wraps it in Some.

From what I understand, batch size is only used when opening/reading files. Instead of requiring the user to set it in advance (and risking a panic if omitted), I think it's cleaner to pass it as a required parameter at the point where it's actually needed:

let config = CsvSource::new(true, b',', b'"')
    .with_comment(Some(b'#'))
    .with_schema(schema)
//  .with_batch_size(8192) --> If I omit this, we'll panic when we create a file opener
    .with_projection(&scan_config);

let opener = config.create_file_opener(object_store, &scan_config, 0); // let's just pass it in here?

The current approach using with_batch_size() follows a builder pattern which is flexible and composable. Given that this modifies the FileSource trait and affects multiple DataSource structs, what's the migration path for downstream users?

One option would be to deprecate with_batch_size and point users toward the new parameter-passing approach. (I'm considering doing something similar for with_projection: #17095).

Stepping back, my main goal is to break the tight coupling between FileSourceConfig and FileSource. Right now, the config struct both stores configuration values and the underlying file source, which in turn needs to call back into the config-- that feels like a design smell

@adriangb
Copy link
Contributor

adriangb commented Aug 8, 2025

Right big picture here I think it's quite evident that there are large issues with the current design: coupling, circular references, etc. Just take a look at this:

/// The following diagram shows how DataSource, FileSource, and DataSourceExec are related
/// ```text
/// ┌─────────────────────┐ -----► execute path
/// │ │ ┄┄┄┄┄► init path
/// │ DataSourceExec │
/// │ │
/// └───────▲─────────────┘
/// ┊ │
/// ┊ │
/// ┌──────────▼──────────┐ ┌──────────-──────────┐
/// │ │ | |
/// │ DataSource(trait) │ | TableProvider(trait)|
/// │ │ | |
/// └───────▲─────────────┘ └─────────────────────┘
/// ┊ │ ┊
/// ┌───────────────┿──┴────────────────┐ ┊
/// | ┌┄┄┄┄┄┄┄┄┄┄┄┘ | ┊
/// | ┊ | ┊
/// ┌──────────▼──────────┐ ┌──────────▼──────────┐ ┊
/// │ │ │ │ ┌──────────▼──────────┐
/// │ FileScanConfig │ │ MemorySourceConfig │ | |
/// │ │ │ │ | FileFormat(trait) |
/// └──────────────▲──────┘ └─────────────────────┘ | |
/// │ ┊ └─────────────────────┘
/// │ ┊ ┊
/// │ ┊ ┊
/// ┌──────────▼──────────┐ ┌──────────▼──────────┐
/// │ │ │ ArrowSource │
/// │ FileSource(trait) ◄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄│ ... │
/// │ │ │ ParquetSource │
/// └─────────────────────┘ └─────────────────────┘
/// │
/// │
/// │
/// │
/// ┌──────────▼──────────┐
/// │ ArrowSource │
/// │ ... │
/// │ ParquetSource │
/// └─────────────────────┘
/// |
/// FileOpener (called by FileStream)
/// │
/// ┌──────────▼──────────┐
/// │ │
/// │ RecordBatch │
/// │ │
/// └─────────────────────┘
/// ```

@xudong963 made this diagram (thank you again) because we were all having trouble wrapping our heads around how these things are related. Just yesterday I ran into a gnarly bug in this area that's probably going to be really hard to unravel because logic / information is split across multiple places: #17077 And this complexity is blocking important work e.g. #14993. All this is to say: the current status quo is not great. The thing I'm struggling with is how to improve that. Unfortunately small incremental improvements (like what this PR attempts to do) will result in a lot of churn for users. Maybe a better approach is to work on a greenfield replacement that attempts to minimize the final API churn? I'm not sure, open to ideas.

@alamb alamb added the api change Changes the API exposed to users of the crate label Aug 8, 2025
@blaginin blaginin self-requested a review August 9, 2025 12:02
@blaginin
Copy link
Collaborator

blaginin commented Aug 9, 2025

I agree with the idea overall:

  • batch_size doesn’t seem to me like a FileSource property; it’s rather a FileScanConfig param (and I feel like it’ll be even more straightforward once we do the big refactoring @adriangb mentioned)
  • FileSource doesn’t currently have any methods to inspect the current batch_size, so the user impact shouldn’t be too big

Happy for us to merge this, but I feel like we may need to agree on the final approach first. If we’ll end up rewriting this in the next release, should we make small incremental updates now? I fear it may be annoying for library users to face a breaking change in every release - especially in the world of File Scan configs, which are a bit messy / hard to understand right now.

@comphead
Copy link
Contributor

comphead commented Aug 9, 2025

Thanks @adriangb, @blaginin, @friendlymatthew
IMO crashing on not setting batch_size is super confusing indeed. However as long as the error is pretty descriptive and the API allows to overcome this issue by calling with_batch_size we should be fine. Totally agree the bigger refactoring is needed and having some default value for batch_size and making the file open routine less troublesome.

For this PR I'm pretty sure it would add a migration pain for downstream users having a limited benefit.

@friendlymatthew would you like to start a refactoring?

@friendlymatthew
Copy link
Contributor Author

Both comments make sense to me, and yes I'm happy to start the refactor

@xudong963
Copy link
Member

xudong963 commented Aug 11, 2025

@xudong963 made this diagram (thank you again) because we were all having trouble wrapping our heads around how these things are related.

One thing I can be sure of is that when I was making the diagram, I felt that traits/code are coupled deeply, and I was worried I missed some configs when upgrading. Do we have an issue/RFC to discuss the future refactor?

@friendlymatthew
Copy link
Contributor Author

@xudong963 made this diagram (thank you again) because we were all having trouble wrapping our heads around how these things are related.

One thing I can be sure of is that when I was making the diagram, I felt that traits/code are coupled deeply, and I was worried I missed some configs when upgrading. Do we have an issue/RFC to discuss the future refactor?

Hi, I'm still investigating the existing relationships, but I'll open a issue for the redesign shortly.

I'll be curious to get your thoughts @xudong963

@adriangb
Copy link
Contributor

Do we have an issue/RFC to discuss the future refactor?

The closest thing to that in my mind is #15952

@alamb
Copy link
Contributor

alamb commented Aug 12, 2025

Do we have an issue/RFC to discuss the future refactor?

The closest thing to that in my mind is #15952

I marked this one with my experimental "PROPOSED EPIC" tag

alamb
alamb previously approved these changes Aug 12, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew and @adriangb -- I think this change makes sense to me

Is it ok with you too @comphead ?

@comphead
Copy link
Contributor

Thank you @friendlymatthew and @adriangb -- I think this change makes sense to me

Is it ok with you too @comphead ?

I think we were inclined to discuss refactor FileScan related structures rather than pursuing this PR which patches an API flaw.

@alamb alamb dismissed stale reviews from adriangb and themself August 22, 2025 19:15

Agreed on a different approach

@alamb
Copy link
Contributor

alamb commented Aug 22, 2025

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review.

@alamb alamb marked this pull request as draft August 22, 2025 19:16
@friendlymatthew
Copy link
Contributor Author

Thanks @adriangb, @blaginin, @friendlymatthew IMO crashing on not setting batch_size is super confusing indeed. However as long as the error is pretty descriptive and the API allows to overcome this issue by calling with_batch_size we should be fine. Totally agree the bigger refactoring is needed and having some default value for batch_size and making the file open routine less troublesome.

For this PR I'm pretty sure it would add a migration pain for downstream users having a limited benefit.

@friendlymatthew would you like to start a refactoring?

Hi, here's the proposed refactor: #17242

@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Oct 28, 2025
@github-actions github-actions bot closed this Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate core Core DataFusion crate datasource Changes to the datasource crate Stale PR has not had any activity for some time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants