Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl.scan_parquet().head().collect() uses huge ram on 4GB file #3818

Closed
thomasaarholt opened this issue Jun 26, 2022 · 11 comments
Closed

pl.scan_parquet().head().collect() uses huge ram on 4GB file #3818

thomasaarholt opened this issue Jun 26, 2022 · 11 comments
Labels
bug Something isn't working

Comments

@thomasaarholt
Copy link
Contributor

What language are you using?

Python

Have you tried latest version of polars?

yes

What version of polars are you using?

0.13.50

What operating system are you using polars on?

Mac OS 12.3.1, Arm arch

What language version are you using

python 3.10

Describe your bug.

Reading lazily from parquet files uses enormous amounts of ram and takes forever when calling df.head().collect(). df.collect() works fine, see example:

# 400 MB dataset
data = np.random.random((50_000_000))
df = pl.DataFrame({"a":data})
df.to_parquet("small.parquet")

pl.read_parquet("small.parquet").head() # takes 0.8 sec
pl.scan_parquet("small.parquet").head().collect() # takes 1.9 sec


# 4 GB dataset
data = np.random.random((500_000_000))
df = pl.DataFrame({"a":data})
df.to_parquet("big.parquet")

pl.read_parquet("big.parquet").head() # takes 20 sec, 8 sec if `use_pyarrow=True`
pl.scan_parquet("big.parquet").head().collect() # goes forever, uses 30+ GB ram and dies

pl.scan_parquet("big.parquet").collect() # note, without .head, takes 20 sec
@thomasaarholt thomasaarholt added the bug Something isn't working label Jun 26, 2022
@ritchie46
Copy link
Member

Thanks, I can reproduce it.

Could you send me the schema of the file? I am curious why it is such a large difference with pyarrow.

@thomasaarholt
Copy link
Contributor Author

thomasaarholt commented Jun 26, 2022

Perhaps I misunderstood what you mean by that - there is only a single column of type float64, that numpy's random() defaults to, aka double:

>>> import pyarrow.parquet as pq
>>> pq.read_schema('big.parquet')
a: double

@ritchie46
Copy link
Member

Found the culprit. Will also make reading a bit faster in the process.

@thomasaarholt
Copy link
Contributor Author

Sweet!

@ritchie46
Copy link
Member

Fixed by #3821

Note that polars is probably faster when reading multiple columns. We choose to parallelize horizontally where I believe pyarrow parallelizes vertically.

@thomasaarholt
Copy link
Contributor Author

Great. Did you check the ram-usage of scan_parquet?

@ritchie46
Copy link
Member

It didn't oom anymore, so yes its better. :)

@cbilot
Copy link

cbilot commented Jun 27, 2022

Does this same situation/fix also apply to IPC files? #3360

@ritchie46
Copy link
Member

ritchie46 commented Jun 28, 2022

Does this same situation/fix also apply to IPC files? #3360

I don't think so. They are very differently implemented.

jorgecarleitao/arrow2#1105 might help it. I have plans to make ipc reading parrallel. Then I also want to investigate that one. And keep reminding me. ;)

@joshuataylor
Copy link
Contributor

Hi @ritchie46 , this probably should be a different thread, but I have a PR coming which adds lazy streaming to Arrow Streaming IPC files, would it be worth holding off until those changes, or get that in?

@ritchie46
Copy link
Member

Could you open an issue and discuss this further? I am curious what you mean with lazy streaming?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants