-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pl.scan_parquet().head().collect() uses huge ram on 4GB file #3818
Comments
Thanks, I can reproduce it. Could you send me the schema of the file? I am curious why it is such a large difference with pyarrow. |
Perhaps I misunderstood what you mean by that - there is only a single column of type float64, that numpy's >>> import pyarrow.parquet as pq
>>> pq.read_schema('big.parquet')
a: double |
Found the culprit. Will also make reading a bit faster in the process. |
Sweet! |
Fixed by #3821 Note that polars is probably faster when reading multiple columns. We choose to parallelize horizontally where I believe pyarrow parallelizes vertically. |
Great. Did you check the ram-usage of scan_parquet? |
It didn't oom anymore, so yes its better. :) |
Does this same situation/fix also apply to IPC files? #3360 |
I don't think so. They are very differently implemented. jorgecarleitao/arrow2#1105 might help it. I have plans to make ipc reading parrallel. Then I also want to investigate that one. And keep reminding me. ;) |
Hi @ritchie46 , this probably should be a different thread, but I have a PR coming which adds lazy streaming to Arrow Streaming IPC files, would it be worth holding off until those changes, or get that in? |
Could you open an issue and discuss this further? I am curious what you mean with lazy streaming? |
What language are you using?
Python
Have you tried latest version of polars?
yes
What version of polars are you using?
0.13.50
What operating system are you using polars on?
Mac OS 12.3.1, Arm arch
What language version are you using
python 3.10
Describe your bug.
Reading lazily from parquet files uses enormous amounts of ram and takes forever when calling
df.head().collect()
.df.collect()
works fine, see example:The text was updated successfully, but these errors were encountered: