pl.scan_parquet().head().collect() uses huge ram on 4GB file #3818

thomasaarholt · 2022-06-26T16:14:50Z

What language are you using?

Python

Have you tried latest version of polars?

yes

What version of polars are you using?

0.13.50

What operating system are you using polars on?

Mac OS 12.3.1, Arm arch

What language version are you using

python 3.10

Describe your bug.

Reading lazily from parquet files uses enormous amounts of ram and takes forever when calling df.head().collect(). df.collect() works fine, see example:

# 400 MB dataset
data = np.random.random((50_000_000))
df = pl.DataFrame({"a":data})
df.to_parquet("small.parquet")

pl.read_parquet("small.parquet").head() # takes 0.8 sec
pl.scan_parquet("small.parquet").head().collect() # takes 1.9 sec


# 4 GB dataset
data = np.random.random((500_000_000))
df = pl.DataFrame({"a":data})
df.to_parquet("big.parquet")

pl.read_parquet("big.parquet").head() # takes 20 sec, 8 sec if `use_pyarrow=True`
pl.scan_parquet("big.parquet").head().collect() # goes forever, uses 30+ GB ram and dies

pl.scan_parquet("big.parquet").collect() # note, without .head, takes 20 sec

The text was updated successfully, but these errors were encountered:

ritchie46 · 2022-06-26T16:26:48Z

Thanks, I can reproduce it.

Could you send me the schema of the file? I am curious why it is such a large difference with pyarrow.

thomasaarholt · 2022-06-26T17:20:49Z

Perhaps I misunderstood what you mean by that - there is only a single column of type float64, that numpy's random() defaults to, aka double:

>>> import pyarrow.parquet as pq
>>> pq.read_schema('big.parquet')
a: double

ritchie46 · 2022-06-27T06:29:40Z

Found the culprit. Will also make reading a bit faster in the process.

thomasaarholt · 2022-06-27T06:47:24Z

Sweet!

ritchie46 · 2022-06-27T07:18:18Z

Fixed by #3821

Note that polars is probably faster when reading multiple columns. We choose to parallelize horizontally where I believe pyarrow parallelizes vertically.

thomasaarholt · 2022-06-27T07:30:33Z

Great. Did you check the ram-usage of scan_parquet?

ritchie46 · 2022-06-27T09:18:15Z

It didn't oom anymore, so yes its better. :)

cbilot · 2022-06-27T14:03:02Z

Does this same situation/fix also apply to IPC files? #3360

ritchie46 · 2022-06-28T05:02:21Z

Does this same situation/fix also apply to IPC files? #3360

I don't think so. They are very differently implemented.

jorgecarleitao/arrow2#1105 might help it. I have plans to make ipc reading parrallel. Then I also want to investigate that one. And keep reminding me. ;)

joshuataylor · 2022-06-28T05:58:20Z

Hi @ritchie46 , this probably should be a different thread, but I have a PR coming which adds lazy streaming to Arrow Streaming IPC files, would it be worth holding off until those changes, or get that in?

ritchie46 · 2022-06-28T07:05:43Z

Could you open an issue and discuss this further? I am curious what you mean with lazy streaming?

thomasaarholt added the bug Something isn't working label Jun 26, 2022

ritchie46 closed this as completed Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pl.scan_parquet().head().collect() uses huge ram on 4GB file #3818

pl.scan_parquet().head().collect() uses huge ram on 4GB file #3818

thomasaarholt commented Jun 26, 2022

ritchie46 commented Jun 26, 2022

thomasaarholt commented Jun 26, 2022 •

edited

Loading

ritchie46 commented Jun 27, 2022

thomasaarholt commented Jun 27, 2022

ritchie46 commented Jun 27, 2022

thomasaarholt commented Jun 27, 2022

ritchie46 commented Jun 27, 2022

cbilot commented Jun 27, 2022

ritchie46 commented Jun 28, 2022 •

edited

Loading

joshuataylor commented Jun 28, 2022

ritchie46 commented Jun 28, 2022

pl.scan_parquet().head().collect() uses huge ram on 4GB file #3818

pl.scan_parquet().head().collect() uses huge ram on 4GB file #3818

Comments

thomasaarholt commented Jun 26, 2022

What language are you using?

Have you tried latest version of polars?

What version of polars are you using?

What operating system are you using polars on?

What language version are you using

Describe your bug.

ritchie46 commented Jun 26, 2022

thomasaarholt commented Jun 26, 2022 • edited Loading

ritchie46 commented Jun 27, 2022

thomasaarholt commented Jun 27, 2022

ritchie46 commented Jun 27, 2022

thomasaarholt commented Jun 27, 2022

ritchie46 commented Jun 27, 2022

cbilot commented Jun 27, 2022

ritchie46 commented Jun 28, 2022 • edited Loading

joshuataylor commented Jun 28, 2022

ritchie46 commented Jun 28, 2022

thomasaarholt commented Jun 26, 2022 •

edited

Loading

ritchie46 commented Jun 28, 2022 •

edited

Loading