-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV inference reads in the whole file to memory, regardless of row limit #3658
Comments
This should just be a case of hooking up #2936. I think the inference code dates from a time before that |
@tustvold thanks for the comment and pointer. I'm having a hard time figuring out how to hook up a Reader (used by the infer_reader_schema function) to a Future's stream. Do you have any hints for me? I was able to pull from the stream (regardless of file vs. networked object store) into a buffer, and turn that into a reader. But hooking up a reader directly to the stream is confounding me. |
You will want to do something vaguely like (not properly tested)
|
Describe the bug
When inferring the schema, the complete CSV will be read into memory even if you leave it at the default 1000 rows to infer from.
To Reproduce
Happens here:
https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/datasource/file_format/csv.rs#L109
Expected behavior
It should read in only as much data as it needs for the given row count to infer data from.
Additional context
The text was updated successfully, but these errors were encountered: