Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid reading entire stream to determine schema of arrow file #6368

Closed
jonmmease opened this issue May 16, 2023 · 0 comments · Fixed by #7962
Closed

Avoid reading entire stream to determine schema of arrow file #6368

jonmmease opened this issue May 16, 2023 · 0 comments · Fixed by #7962
Labels
enhancement New feature or request performance Make DataFusion faster

Comments

@jonmmease
Copy link
Contributor

jonmmease commented May 16, 2023

Follow on to #6337.

Currently when reading an arrow file from a stream, the entire stream is parsed as a file in order to determine the schema:

https://github.com/apache/arrow-datafusion/blob/8a47c42096311cf9b6191cfb9d96e2d9ba3a630d/datafusion/core/src/datasource/file_format/arrow.rs#L60-L63

This will result in parsing the stream multiple times (once to determine the schema and again later to actually build RecordBatches from the stream).

Can we be more efficient here by only looking as far into the stream as necessary to read the schema?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants