-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Pass batch_size directly when creating file opener
#17076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass batch_size directly when creating file opener
#17076
Conversation
93e4b42 to
f706d4f
Compare
|
I think this is a positive change. The fact that there is an Generally we would like to disentangle this area of code a bit starting by refactoring places like this. There are a couple other ways in which Which also gets pulled in from I think where some of this is heading is splitting
let config = FileScanConfigBuilder::new(file_schema).build();
let source = ParquetSource::new(config.clone()); // needs access to file schema, for filter pushdown evaluation
let scan = FileScan::new(source, config); // needs access to the file schema for projection stuff? needs access the the files to be scanned?
let exec = DataSourceExec::new(scan);There's other things to do along the way, e.g. I think this should be moved to CSVSource: datafusion/datafusion/datasource/src/file_scan_config.rs Lines 185 to 186 in f9efba0
|
|
@blaginin would love your input on this change / maybe a bit of the whole plan! |
| let config = CsvSource::new(true, b',', b'"') | ||
| .with_comment(Some(b'#')) | ||
| .with_schema(schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These sources may need builders or some arguments moved to new() to avoid Option<>.expect()s
| object_store: Arc<dyn ObjectStore>, | ||
| base_config: &FileScanConfig, | ||
| partition: usize, | ||
| batch_size: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only qualm with this change is that it adds another argument to FileSource::create_file_opener and it's not clear if this will be it or if we're going to add 5 more and realize that it's a bad pattern. I think there may be 1 more or something and that's it, in which case this is the simplest and best way to do it I can think of, but if anyone has better ideas or thinks there will be a proliferation of arguments here I'd be interested in your thoughts.
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @friendlymatthew
While I can see the rationale behind this change, I'd like to understand more about why this approach is definitively better than the existing implementation. Given that this modifies the FileSource it would be a breaking change for downstream users?
Please let me know the specific example where builder wouldn't work for you?
The current approach using with_batch_size() follows a builder pattern which is flexible and composable. Given that this modifies the FileSource trait and affects multiple DataSource structs, what's the migration path for downstream users?
Hi, the From what I understand, batch size is only used when opening/reading files. Instead of requiring the user to set it in advance (and risking a panic if omitted), I think it's cleaner to pass it as a required parameter at the point where it's actually needed: let config = CsvSource::new(true, b',', b'"')
.with_comment(Some(b'#'))
.with_schema(schema)
// .with_batch_size(8192) --> If I omit this, we'll panic when we create a file opener
.with_projection(&scan_config);
let opener = config.create_file_opener(object_store, &scan_config, 0); // let's just pass it in here?
One option would be to deprecate Stepping back, my main goal is to break the tight coupling between |
|
Right big picture here I think it's quite evident that there are large issues with the current design: coupling, circular references, etc. Just take a look at this: datafusion/datafusion/datasource/src/source.rs Lines 74 to 122 in 407a965
@xudong963 made this diagram (thank you again) because we were all having trouble wrapping our heads around how these things are related. Just yesterday I ran into a gnarly bug in this area that's probably going to be really hard to unravel because logic / information is split across multiple places: #17077 And this complexity is blocking important work e.g. #14993. All this is to say: the current status quo is not great. The thing I'm struggling with is how to improve that. Unfortunately small incremental improvements (like what this PR attempts to do) will result in a lot of churn for users. Maybe a better approach is to work on a greenfield replacement that attempts to minimize the final API churn? I'm not sure, open to ideas. |
|
I agree with the idea overall:
Happy for us to merge this, but I feel like we may need to agree on the final approach first. If we’ll end up rewriting this in the next release, should we make small incremental updates now? I fear it may be annoying for library users to face a breaking change in every release - especially in the world of File Scan configs, which are a bit messy / hard to understand right now. |
|
Thanks @adriangb, @blaginin, @friendlymatthew For this PR I'm pretty sure it would add a migration pain for downstream users having a limited benefit. @friendlymatthew would you like to start a refactoring? |
|
Both comments make sense to me, and yes I'm happy to start the refactor |
One thing I can be sure of is that when I was making the diagram, I felt that traits/code are coupled deeply, and I was worried I missed some configs when upgrading. Do we have an issue/RFC to discuss the future refactor? |
Hi, I'm still investigating the existing relationships, but I'll open a issue for the redesign shortly. I'll be curious to get your thoughts @xudong963 |
The closest thing to that in my mind is #15952 |
I marked this one with my experimental "PROPOSED EPIC" tag |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew and @adriangb -- I think this change makes sense to me
Is it ok with you too @comphead ?
I think we were inclined to discuss refactor |
Agreed on a different approach
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. |
Hi, here's the proposed refactor: #17242 |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Rationale for this change
This PR simplifies the design by making batch size a required parameter when calling
FileSource::create_file_opener.When we go to open these files in
create_file_opener, we do an.expect()call to ensure batch size is properly configured ahead of time. By passing in the batch size directly as a parameter, we can avoid the.expect()and avoid theOptionaltogetherInstead of:
We now directly pass the batch size
Are there any user-facing changes?
Yes, the
FIleSourcetrait is modified, as well asDataSourcestructs